# Deep Reinforcement Learning

**Intent**

**Motivation**

**Sketch**

**Discussion**

**Known Uses**

**Related Patterns**

<Diagram>

**References**

https://johncarlosbaez.wordpress.com/2014/10/30/sensing-and-acting-under-information-constraints/ Bellman Equation and Euler-Lagrange, Hamilton-Jacobi “If we replace derivatives by differences, and talk about maximizing total reward instead of minimizing action, we get Bellman’s equation”

http://robotics.ai.uiuc.edu/~scandido/?Developing_Reinforcement_Learning_from_the_Bellman_Equation For value iteration (solving the optimal control problem), we assume full knowledge of the process model and reward function. For Q -learning (solving the reinforcement learning problem), we assume no knowledge of process model or reward function.

http://arxiv.org/abs/1509.06461 Deep Reinforcement Learning with Double Q-learning

A deep Q network (DQN) is a multi-layered neural network that for a given state s outputs a vector of action values Q, where are the parameters of the network. For an n-dimensional state space and an action space containing m actions, the neural network is a function from Rn to Rm.

Two important ingredients of the DQN algorithm as proposed by Mnih et al. (2015) are the use of a target network, and the use of experience replay.

The max operator in standard Q-learning and DQN uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. To prevent this, we can decouple the selection from the evaluation. This is the idea behind Double Q-learning (van Hasselt, 2010).

In the original Double Q-learning algorithm, two value functions are learned by assigning each experience randomly to update one of the two value functions, such that there are two sets of weight. For each update, one set of weights is used to determine the greedy policy and the other to determine its value.

http://arxiv.org/pdf/1606.05174v1.pdf Deep Reinforcement Learning Discovers Internal Models

In this work we present the Semi-Aggregated MDP (SAMDP) model. A model best suited to describe policies exhibiting both spatial and temporal hierarchies. We describe its advantages for analyzing trained policies over other modeling approaches, and show that under the right state representation, like that of DQN agents, SAMDP can help to identify skills.

https://www.nervanasys.com/demystifying-deep-reinforcement-learning/

https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf Human-level control through deep reinforcement learning

http://karpathy.github.io/2016/05/31/rl/

https://webdocs.cs.ualberta.ca/~sutton/book/the-book.html

http://outlace.com/Reinforcement-Learning-Part-1/

http://faculty.washington.edu/paymana/swarm/gambardella95-icml.pdf

http://www.wildml.com/2016/10/learning-reinforcement-learning/

https://arxiv.org/pdf/1502.05477v4.pdf Trust Region Policy Optimization

https://arxiv.org/abs/1602.01783 Asynchronous Methods for Deep Reinforcement Learning

We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU.

https://gym.openai.com/docs/rl

https://arxiv.org/abs/1609.05473v4 SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient

In this paper, we propose a sequence generation framework, called SeqGAN, to solve the problems. Modeling the data generator as a stochastic policy in reinforcement learning (RL), SeqGAN bypasses the generator differentiation problem by directly performing gradient policy update. The RL reward signal comes from the GAN discriminator judged on a complete sequence, and is passed back to the intermediate state-action steps using Monte Carlo search.

http://biorxiv.org/content/early/2016/10/29/084111 Modeling Cognitive Processes with Neural Reinforcement Learning

http://openreview.net/pdf?id=HkLXCE9lx RL2 : FAST REINFORCEMENT LEARNING VIA SLOW REINFORCEMENT LEARNING

http://zacklipton.com/media/papers/iclr-combating-reinforcement-lipton2016.pdf

https://arxiv.org/abs/1611.02796v2 Tuning Recurrent Neural Networks with Reinforcement Learning

in this paper we propose a novel approach for sequence training which combines Maximum Likelihood (ML) and RL training. We refine a sequence predictor by optimizing for some imposed reward functions, while maintaining good predictive properties learned from data. We propose efficient ways to solve this by augmenting deep Q-learning with a cross-entropy reward and deriving novel off-policy methods for RNNs from stochastic optimal control (SOC).

We have derived a novel sequence learning framework which uses RL rewards to correct properties of sequences generated by an RNN, while keeping much of the information learned from supervised training on data. We proposed and evaluated three alternative techniques for achieving this, and showed promising results on music generation tasks.

In addition to the ability to train models to generate pleasant-sounding melodies, we believe our approach of using RL to refine RNN models could be promising for a number of applications. For example, it is well known that a common failure mode of RNNs is to repeatedly generate the same token. In text generation and automatic question answering, this can take the form of repeatedly generating the same response (e.g. “How are you?” → “How are you?” → “How are you?” …). We have demonstrated that with our approach we can correct for this unwanted behavior, while still maintaining information that the model learned from data. Although manually writing a reward function may seem unappealing to those who believe in training models end-to-end based only on data, that approach it is limited by the quality of the data that can be collected. If the data contains hidden biases, this can lead to highly undesirable consequences. Recent research has shown that the word2vec embeddings in popular language models trained on standard corpora consistently contain the same harmful biases with respect to race and gender that are revealed by implicit association tests on humans (Caliskan-Islam et al., 2016). In contrast to relying solely on possibly biased data, our approach allows for encoding high-level domain knowledge into the RNN, providing a general, alternative tool for training sequence models.

https://medium.com/@joshdotai/deep-reinforcement-learning-papers-a2167c136fc7#.wo6bjzwxd

http://people.eecs.berkeley.edu/~pabbeel/nips-tutorial-policy-optimization-Schulman-Abbeel.pdf

https://blog.ought.com/nips-2016-875bb8fadb8c#.tna5eeblv

https://arxiv.org/abs/1612.00563 Self-critical Sequence Training for Image Captioning

Recently it has been shown that policy-gradient methods for reinforcement learning can be utilized to train deep end-to-end systems directly on non-differentiable metrics for the task at hand. In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized. Our systems are built using a new optimization approach that we call self-critical sequence training (SCST). SCST is a form of the popular REINFORCE algorithm that, rather than estimating a “baseline” to normalize the rewards and reduce variance, utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences. Using this approach, estimating the reward signal (as actor-critic methods must do) and estimating normalization (as REINFORCE algorithms typically do) is avoided, while at the same time harmonizing the model with respect to its test-time inference procedure

https://arxiv.org/abs/1701.07274v1 Deep Reinforcement Learning: An Overview

We give an overview of recent exciting achievements of deep reinforcement learning (RL). We start with background of deep learning and reinforcement learning, as well as introduction of testbeds. Next we discuss Deep Q-Network (DQN) and its extensions, asynchronous methods, policy optimization, reward, and planning. After that, we talk about attention and memory, unsupervised learning, and learning to learn. Then we discuss various applications of RL, including games, in particular, AlphaGo, robotics, spoken dialogue systems (a.k.a. chatbot), machine translation, text sequence prediction, neural architecture design, personalized web services, healthcare, finance, and music generation. We mention topics/papers not reviewed yet. After listing a collection of RL resources, we close with discussions.

https://arxiv.org/abs/1702.08892v1 Bridging the Gap Between Value and Policy Based Reinforcement Learning

We formulate a new notion of softmax temporal consistency that generalizes the standard hard-max Bellman consistency usually considered in value based reinforcement learning (RL). In particular, we show how softmax consistent action values correspond to optimal policies that maximize entropy regularized expected reward. More importantly, we establish that softmax consistent action values and the optimal policy must satisfy a mutual compatibility property that holds across any state-action subsequence. Based on this observation, we develop a new RL algorithm, Path Consistency Learning (PCL), that minimizes the total inconsistency measured along multi-step subsequences extracted from both both on and off policy traces. An experimental evaluation demonstrates that PCL significantly outperforms strong actor-critic and Q-learning baselines across several benchmark tasks.

https://arxiv.org/abs/1703.01327 Multi-step Reinforcement Learning: A Unifying Algorithm

Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, TD(λ) elegantly unifies one-step TD prediction with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter λ. Currently, there are a multitude of algorithms that can be used to perform TD control, including Sarsa, Q-learning, and Expected Sarsa. These methods are often studied in the one-step case, but they can be extended across multiple time steps to achieve better performance. Each of these algorithms is seemingly distinct, and no one dominates the others for all problems. In this paper, we study a new multi-step action-value algorithm called Q(σ) which unifies and generalizes these existing algorithms, while subsuming them as special cases. A new parameter, σ, is introduced to allow the degree of sampling performed by the algorithm at each step during its backup to be continuously varied, with Sarsa existing at one extreme (full sampling), and Expected Sarsa existing at the other (pure expectation). Q(σ) is generally applicable to both on- and off-policy learning, but in this work we focus on experiments in the on-policy case. Our results show that an intermediate value of σ, which results in a mixture of the existing algorithms, performs better than either extreme. The mixture can also be varied dynamically which can result in even greater performance.

https://arxiv.org/abs/1704.06440 Equivalence Between Policy Gradients and Soft Q-Learning

https://arxiv.org/abs/1708.05866 A Brief Survey of Deep Reinforcement Learning

https://arxiv.org/abs/1709.06560

https://arxiv.org/abs/1710.02298v1 Rainbow: Combining Improvements in Deep Reinforcement Learning

This paper examines six extensions to the DQN algorithm and empirically studies their combination.

https://arxiv.org/pdf/1803.02811.pdf Accelerated Methods for Deep Reinforcement Learning

We confirm that both policy gradient and Q-value learning algorithms can be adapted to learn using many parallel simulator instances. We further find it possible to train using batch sizes considerably larger than are standard, without negatively affecting sample complexity or final performance. We leverage these facts to build a unified framework for parallelization that dramatically hastens experiments in both classes of algorithm.

https://arxiv.org/pdf/1710.02298.pdf Rainbow: Combining Improvements in Deep Reinforcement Learning

Double DQN (DDQN; van Hasselt, Guez, and Silver 2016) addresses an overestimation bias of Q-learning (van Hasselt 2010), by decoupling selection and evaluation of the bootstrap action.

Prioritized experience replay (Schaul et al. 2015) improves data efficiency, by replaying more often transitions from which there is more to learn.

The dueling network architecture (Wang et al. 2016) helps to generalize across actions by separately representing state values and action advantages.

Learning from multi-step bootstrap targets (Sutton 1988; Sutton and Barto 1998), as used in A3C (Mnih et al. 2016), shifts the bias-variance tradeoff and helps to propagate newly observed rewards faster to earlier visited states.

Distributional Q-learning (Bellemare, Dabney, and Munos 2017) learns a categorical distribution of discounted returns, instead of estimating the mean.

Noisy DQN (Fortunato et al. 2017) uses stochastic network layers for exploration.