Name Equations of Evolution aka Dynamics

$$V(x(t), t) = \min_u \left\{\int_t^{t + dt} C(x(t), u(t)) \, dt + V(x(t+dt), t+dt) \right\}.$$

References

https://arxiv.org/pdf/1102.1360v1.pdf Quantum simulation of time-dependent Hamiltonians and the convenient illusion of Hilbert space

https://arxiv.org/abs/1704.04463 On Generalized Bellman Equations and Temporal-Difference Learning

https://arxiv.org/abs/1704.04932v1 Deep Relaxation: partial differential equations for optimizing deep neural networks

We establish connections between non-convex optimization methods for training deep neural networks (DNNs) and the theory of partial differential equations (PDEs). In particular, we focus on relaxation techniques initially developed in statistical physics, which we show to be solutions of a nonlinear Hamilton-Jacobi-Bellman equation. We employ the underlying stochastic control problem to analyze the geometry of the relaxed energy landscape and its convergence properties, thereby confirming empirical evidence. This paper opens non-convex optimization problems arising in deep learning to ideas from the PDE literature. In particular, we show that the non-viscous Hamilton-Jacobi equation leads to an elegant algorithm based on the Hopf-Lax formula that outperforms state-of-the-art methods. Furthermore, we show that these algorithms scale well in practice and can effectively tackle the high dimensionality of modern neural networks.

https://arxiv.org/pdf/1701.07403v1.pdf Learning Light Transport the Reinforced Way

We show that the equations of reinforcement learning and light transport simulation are related integral equations. Based on this correspondence, a scheme to learn importance while sampling path space is derived. The new approach is demonstrated in a consistent light transport simulation algorithm that uses reinforcement learning to progressively learn where light comes from. As using this information for importance sampling includes information about visibility, too, the number of light transport paths with non-zero contribution is dramatically increased, resulting in much less noisy images within a fixed time budget.

https://arxiv.org/pdf/1703.01585.pdf A Unified Bellman Equation for Causal Information and Value in Markov Decision Processes

In this work we consider RL objectives with information-theoretic limitations. For the first time we derive a Bellman-type recursive equation for the causal information between the environment and the agent, which is combined plausibly with the Bellman recursion for the value function. The unified equitation serves to explore the typical behavior of artificial agents in an infi- nite time horizon.

https://arxiv.org/abs/1709.05380v1 The Uncertainty Bellman Equation and Exploration

. In this paper we consider a similar uncertainty Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the estimated value of any fixed policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for ϵ-greedy improves DQN performance on 51 out of 57 games in the Atari suite.

https://arxiv.org/abs/1710.10044v1 Distributional Reinforcement Learning with Quantile Regression

https://arxiv.org/abs/1709.05380 The Uncertainty Bellman Equation and Exploration

We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar uncertainty Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the estimated value of any fixed policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for ϵ-greedy improves DQN performance on 51 out of 57 games in the Atari suite.

https://arxiv.org/abs/1805.11593 Observe and Look Further: Achieving Consistent Performance on Atari

A new transformed Bellman operator allows our algorithm to process rewards of varying densities and scales; an auxiliary temporal consistency loss allows us to train stably using a discount factor of γ=0.999 (instead of γ=0.99) extending the effective planning horizon by an order of magnitude; and we ease the exploration problem by using human demonstrations that guide the agent towards rewarding states. When tested on a set of 42 Atari games, our algorithm exceeds the performance of an average human on 40 games using a common set of hyper parameters. Furthermore, it is the first deep RL algorithm to solve the first level of Montezuma's Revenge.