Learning from Demonstrations for Real World Reinforcement Learning

In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages this data to massively accelerate the learning process even from relatively small amounts of demonstration data. DQfD works by combining temporal difference updates with large-margin classification of the demonstrator’s actions. We show that DQfD has better initial performance than Deep Q-Networks (DQN) on 40 of 42 Atari games and it receives more average rewards than DQN on 27 of 42 Atari games. We also demonstrate that DQfD learns faster than DQN even when given poor demonstration data.