https://arxiv.org/pdf/1602.02867v2.pdf Value Iteration Networks

We introduce the value iteration network (VIN): a fully differentiable neural network with a ‘planning module’ embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.

https://arxiv.org/abs/1611.03673v2 Learning to Navigate in Complex Environments

https://arxiv.org/abs/1611.05397v1 Reinforcement Learning with Unsupervised Auxiliary Tasks

https://arxiv.org/pdf/1701.02392.pdf Reinforcement Learning via Recurrent Convolutional Neural Networks

. We present a natural representation of to Reinforcement Learning (RL) problems using Recurrent Convolutional Neural Networks (RCNNs), to better exploit this inherent structure. We define 3 such RCNNs, whose forward passes execute an efficient Value Iteration, propagate beliefs of state in partially observable environments, and choose optimal actions respectively. Backpropagating gradients through these RCNNs allows the system to explicitly learn the Transition Model and Reward Function associated with the underlying MDP, serving as an elegant alternative to classical model-based RL.

http://interactive.mit.edu/sites/default/files/documents/Kim_AAAI_2017_preprint.pdf

The generation of near-optimal plans for multi-agent systems with numerical states and temporal actions is computationally challenging. Current off-the-shelf planners can take a very long time before generating a near-optimal solution. In an effort to reduce plan computation time, increase the quality of the resulting plans, and make them more interpretable by humans, we explore collaborative planning techniques that actively involve human users in plan generation. Specifically, we explore a framework in which users provide high-level strategies encoded as soft preferences to guide the low-level search of the planner. Through human subject experimentation, we empirically demonstrate that this approach results in statistically significant improvements to plan quality, without substantially increasing computation time. We also show that the resulting plans achieve greater similarity to those generated by humans with regard to the produced sequences of actions, as compared to plans that do not incorporate user provided strategies.

https://arxiv.org/abs/1702.03920 Cognitive Mapping and Planning for Visual Navigation

We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person viewpoints and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the planner, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a top-down belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. Our experiments demonstrate that CMP outperforms both reactive strategies and standard memory-based architectures and performs well in novel environments. Furthermore, we show that CMP can also achieve semantically specified goals, such as 'go to a chair'.

https://arxiv.org/pdf/1705.08439.pdf Thinking Fast and Slow with Deep Learning and Tree Search

Solving sequential decision making problems, such as text parsing, robotic control, and game playing, requires a combination of planning policies and generalisation of those plans. In this paper, we present Expert Iteration, a novel algorithm which decomposes the problem into separate planning and generalisation tasks. Planning new policies is performed by tree search, while a deep neural network generalises those plans. In contrast, standard Deep Reinforcement Learning algorithms rely on a neural network not only to generalise plans, but to discover them too. We show that our method substantially outperforms Policy Gradients in the board game Hex, winning 84.4% of games against it when trained for equal time.

https://arxiv.org/pdf/1706.01433.pdf Visual Interaction Networks

https://arxiv.org/abs/1706.09597v1 Path Integral Networks: End-to-End Differentiable Optimal Control

PI-Net is fully differentiable, learning both dynamics and cost models end-to-end by back-propagation and stochastic gradient descent. Because of this, PI-Net can learn to plan. PI-Net has several advantages: it can generalize to unseen states thanks to planning, it can be applied to continuous control tasks, and it allows for a wide variety learning schemes, including imitation and reinforcement learning.

https://arxiv.org/abs/1707.06170 Learning model-based planning from scratch

Conventional wisdom holds that model-based planning is a powerful approach to sequential decision-making. It is often very challenging in practice, however, because while a model can be used to evaluate a plan, it does not prescribe how to construct a plan. Here we introduce the “Imagination-based Planner”, the first model-based, sequential decision-making agent that can learn to construct, evaluate, and execute plans. Before any action, it can perform a variable number of imagination steps, which involve proposing an imagined action and evaluating it with its model-based imagination. All imagined actions and outcomes are aggregated, iteratively, into a “plan context” which conditions future real and imagined actions. The agent can even decide how to imagine: testing out alternative imagined actions, chaining sequences of actions together, or building a more complex “imagination tree” by navigating flexibly among the previously imagined states using a learned policy. And our agent can learn to plan economically, jointly optimizing for external rewards and computational costs associated with using its imagination. We show that our architecture can learn to solve a challenging continuous control problem, and also learn elaborate planning strategies in a discrete maze-solving task. Our work opens a new direction toward learning the components of a model-based planning system and how to use them.

https://arxiv.org/abs/1707.06203 Imagination-Augmented Agents for Deep Reinforcement Learning

https://www.researchgate.net/profile/Anne_Collins3/publication/262779120_Human_cognition_Foundations_of_human_reasoning_in_the_prefrontal_cortex/links/53d962410cf2631430c64c5c/Human-cognition-Foundations-of-human-reasoning-in-the-prefrontal-cortex.pdf Foundations of human reasoning in the prefrontal cortex

The prefrontal cortex (PFC) subserves reasoning in the service of adaptive behavior. Little is known however about the architecture of reasoning processes in the PFC. Using computational modeling and neuroimaging, we show here that the human PFC comprises two concurrent inferential tracks: one from ventromedial to dorsomedial PFC regions that makes probabilistic inferences about the reliability of the ongoing behavioral strategy and arbitrates between adjusting this strategy vs. exploring new ones from long-term memory; another from polar to lateral PFC regions that makes probabilistic inferences about the reliability of two/three alternative strategies and arbitrates between exploring new strategies vs. exploiting these alternative ones. The two tracks interact and along with the striatum, realize hypothesis-testing for accepting vs. rejecting newly created strategies.

https://arxiv.org/abs/1705.04350v2 Imagination improves Multimodal Translation

We decompose multimodal translation into two sub-tasks: learning to translate and learning visually grounded representations. In a multitask learning framework, translations are learned in an attention-based encoder-decoder, and grounded representations are learned through image representation prediction. Our approach improves translation performance compared to the state of the art on the Multi30K dataset. Furthermore, it is equally effective if we train the image prediction task on the external MS COCO dataset, and we find improvements if we train the translation model on the external News Commentary parallel text.

https://arxiv.org/abs/1708.06742 Twin Networks: Using the Future as a Regularizer

Being able to model long-term dependencies in sequential data, such as text, has been among the long-standing challenges of recurrent neural networks (RNNs). This issue is strictly related to the absence of explicit planning in current RNN architectures. More explicitly, the RNNs are trained to predict only the next token given previous ones. In this paper, we introduce a simple way of encouraging the RNNs to plan for the future. In order to accomplish this, we introduce an additional neural network which is trained to generate the sequence in reverse order, and we require closeness between the states of the forward RNN and backward RNN that predict the same token. At each step, the states of the forward RNN are required to match the future information contained in the backward states. We hypothesize that the approach eases modeling of long-term dependencies thus helping in generating more globally consistent samples. The model trained with conditional generation for a speech recognition task achieved 12\% relative improvement (CER of 6.7 compared to a baseline of 7.6).

https://arxiv.org/abs/1708.07280 Learning Generalized Reactive Policies using Deep Neural Networks

https://arxiv.org/pdf/1711.05313.pdf SIMULATING ACTION DYNAMICS WITH NEURAL PROCESS NETWORKS

https://arxiv.org/abs/1710.11417 TreeQN and ATreeC: Differentiable Tree Planning for Deep Reinforcement Learning

To address these challenges, we propose TreeQN, a differentiable, recursive, tree-structured model that serves as a drop-in replacement for any value function network in deep RL with discrete actions. TreeQN dynamically constructs a tree by recursively applying a transition model in a learned abstract state space and then aggregating predicted rewards and state-values using a tree backup to estimate Q-values.

https://arxiv.org/abs/1702.08360 Neural Map: Structured Memory for Deep Reinforcement Learning

More recent work (Oh et al., 2016) has went beyond these architectures by using memory networks which can allow more sophisticated addressing schemes over the past k frames. But even these architectures are unsatisfactory due to the reason that they are limited to only remembering information from the last k frames. In this paper, we develop a memory system with an adaptable write operator that is customized to the sorts of 3D environments that DRL agents typically interact with. This architecture, called the Neural Map, uses a spatially structured 2D memory image to learn to store arbitrary information about the environment over long time lags. We demonstrate empirically that the Neural Map surpasses previous DRL memories on a set of challenging 2D and 3D maze environments and show that it is capable of generalizing to environments that were not seen during training.

https://arxiv.org/abs/1802.04697v1 Learning to Search with MCTSnets

Our architecture, which we call an MCTSnet, incorporates simulation-based search inside a neural network, by expanding, evaluating and backing-up a vector embedding. The parameters of the network are trained end-to-end using gradient-based optimisation. When applied to small searches in the well known planning problem Sokoban, the learned search algorithm significantly outperformed MCTS baselines.

https://openreview.net/forum?id=Bk9zbyZCZ Neural Map: Structured Memory for Deep Reinforcement Learning

n this paper, we develop a memory system with an adaptable write operator that is customized to the sorts of 3D environments that DRL agents typically interact with. This architecture, called the Neural Map, uses a spatially structured 2D memory image to learn to store arbitrary information about the environment over long time lags. We demonstrate empirically that the Neural Map surpasses previous DRL memories on a set of challenging 2D and 3D maze environments and show that it is capable of generalizing to environments that were not seen during training.

https://openreview.net/pdf?id=H1dh6Ax0Z TREEQN AND ATREEC: DIFFERENTIABLE TREE-STRUCTURED MODELS FOR DEEP REINFORCEMENT LEARNING

However, in complex environments where transition models need to be learned from data, the deficiencies of learned models have limited their utility for planning. To address these challenges, we propose TreeQN, a differentiable, recursive, tree-structured model that serves as a drop-in replacement for any value function network in deep RL with discrete actions.

https://arxiv.org/abs/1804.00645 Universal Planning Networks

UPNs embed differentiable planning within a goal-directed policy. This planning computation unrolls a forward model in a latent space and infers an optimal action plan through gradient descent trajectory optimization. The plan-by-gradient-descent process and its underlying representations are learned end-to-end to directly optimize a supervised imitation learning objective. We find that the representations learned are not only effective for goal-directed visual imitation via gradient-based trajectory optimization, but can also provide a metric for specifying goals using images. The learned representations can be leveraged to specify distance-based rewards to reach new target states for model-free reinforcement learning, resulting in substantially more effective learning when solving new tasks described via image-based goals. We were able to achieve successful transfer of visuomotor planning strategies across robots with significantly different morphologies and actuation capabilities.

https://arxiv.org/abs/1805.11199 Value Propagation Networks

We present Value Propagation (VProp), a parameter-efficient differentiable planning module built on Value Iteration which can successfully be trained using reinforcement learning to solve unseen tasks, has the capability to generalize to larger map sizes, and can learn to navigate in dynamic environments. Furthermore, we show that the module enables learning to plan when the environment also includes stochastic elements, providing a cost-efficient learning system to build low-level size-invariant planners for a variety of interactive navigation problems. We evaluate on static and dynamic configurations of MazeBase grid-worlds, with randomly generated environments of several different sizes, and on a StarCraft navigation scenario, with more complex dynamics, and pixels as input.

https://arxiv.org/pdf/1802.04697.pdf Learning to Search with MCTSnets

https://web.stanford.edu/~jaywhang/soorl.pdf Strategic Object Oriented Reinforcement Learning

We compare different exploration strategies in a model-based setting in which exact planning is impossible. Additionally, we test our approach on perhaps the hardest Atari game Pitfall! and achieve significantly improved exploration and performance over prior methods.

https://icml.cc/Conferences/2018/Schedule?showEvent=2659 Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings

We proposed a method for hierarchical reinforcement learning that combines representation learning of trajectories with model-based planning in a continuous latent space of behaviors. We describe how to train such a model and use it for long horizon planning, as well as for exploration.

https://arxiv.org/abs/1807.07665v1 Multitask Reinforcement Learning for Zero-shot Generalization with Subtask Dependencies

https://arxiv.org/abs/1807.09341 Learning Plannable Representations with Causal InfoGAN

Our framework learns a generative model of sequential observations, where the generative process is induced by a transition in a low-dimensional planning model, and an additional noise.

https://arxiv.org/abs/1807.04742 Visual Reinforcement Learning with Imagined Goals

In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised “practice” phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques. https://bair.berkeley.edu/blog/2018/09/06/rig/

https://arxiv.org/abs/1809.02378v1 Monte Carlo Tree Search with Scalable Simulation Periods for Continuously Running Tasks

We extended the Hierarchical Optimistic Optimization applied to Tree (HOOT) method to adapt our approach to environments with a continuous decision space. We evaluated our approach for environments with a continuous decision space through OpenAI gym's Pendulum and Continuous Mountain Car environments and for environments with discrete action space through the arcade learning environment (ALE) platform. The evaluation results show that, with variable simulation times, the proposed approach outperforms the conventional MCTS in the evaluated continuous decision space tasks and improves the performance of MCTS in most of the ALE tasks.

https://openreview.net/forum?id=BJemQ209FQ Learning to Navigate the Web

https://github.com/codeaudit/15_by_15_AlphaGomoku

https://arxiv.org/abs/1811.00090 SDRL: Interpretable and Data-efficient Deep Reinforcement Learning Leveraging Symbolic Planning

Symbolic Deep Reinforcement Learning (SDRL) that can handle both high-dimensional sensory inputs and symbolic planning. The task-level interpretability is enabled by relating symbolic actions to options.This framework features a planner – controller – meta-controller architecture, which takes charge of subtask scheduling, data-driven subtask learning, and subtask evaluation, respectively. The three components cross-fertilize each other and eventually converge to an optimal symbolic plan along with the learned subtasks, bringing together the advantages of long-term planning capability with symbolic knowledge and end-to-end reinforcement learning directly from a high-dimensional sensory input. Experimental results validate the interpretability of subtasks, along with improved data efficiency compared with state-of-the-art approaches.

https://danijar.com/project/planet/ Learning Latent Dynamics for Planning from Pixels