https://arxiv.org/pdf/1602.04621v3.pdf Deep Exploration via Bootstrapped DQN

Efficient exploration in complex environments remains a major challenge for reinforcement learning. We propose bootstrapped DQN, a simple algorithm that explores in a computationally and statistically efficient manner through use of randomized value functions. Unlike dithering strategies such as epsilon-greedy exploration, bootstrapped DQN carries out temporally-extended (or deep) exploration; this can lead to exponentially faster learning. We demonstrate these benefits in complex stochastic MDPs and in the large-scale Arcade Learning Environment. Bootstrapped DQN substantially improves learning times and performance across most Atari games.

Key breakthroughs in this paper include the following:

* We present the first practical reinforcement learning algorithm that combines deep learning with deep exploration: Bootstrapped DQN. * We show that this algorithm can lead to exponentially faster learning. * We present new state of the art results on Atari 2600.

https://www.youtube.com/playlist?list=PLdy8eRAW78uLDPNo1jRv8jdTx7aup1ujM

https://papers.nips.cc/paper/6253-blazing-the-trails-before-beating-the-path-sample-efficient-monte-carlo-planning.pdf Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning

https://arxiv.org/pdf/1706.05087.pdf Plan, Attend, Generate: Character-Level Neural Machine Translation with Planning

https://arxiv.org/abs/1801.03354 Planning with Pixels in (Almost) Real Time Recently, width-based planning methods have been shown to yield state-of-the-art results in the Atari 2600 video games. For this, the states were associated with the (RAM) memory states of the simulator. In this work, we consider the same planning problem but using the screen instead. By using the same visual inputs, the planning results can be compared with those of humans and learning methods. We show that the planning approach, out of the box and without training, results in scores that compare well with those obtained by humans and learning methods, and moreover, by developing an episodic, rollout version of the IW(k) algorithm, we show that such scores can be obtained in almost real time.

https://openreview.net/forum?id=HJw8fAgA-&noteId=HJw8fAgA Learning Dynamic State Abstractions for Model-Based Reinforcement Learning

RL agents that use Monte-Carlo rollouts of these models as features for decision making outperform strong model-free baselines on the game MS_PACMAN, demonstrating the benefits of planning using learned dynamic state abstractions.