**Name** Neural Decision Tree

**Intent**

Embed decision tree nodes as layers

**Motivation**

How can we leverage decision trees in constructing a deep network?

**Sketch**

<Diagram>

**Discussion**

**Known Uses**

**Related Patterns**

<Diagram>

**References**

http://arxiv.org/abs/1606.05390 Making Tree Ensembles Interpretable

https://www.microsoft.com/en-us/research/publication/deep-neural-decision-forests/

http://www.robots.ox.ac.uk/~tvg/publications/talks/deepNeuralDecisionForests.pdf

https://arxiv.org/pdf/1603.01250v1.pdf Decision Forests, Convolutional Networks and the Models in-Between

We present a systematic analysis of how to fuse conditional computation with representation learning and achieve a continuum of hybrid models with different ratios of accuracy vs. efficiency. We call this new family of hybrid models conditional networks. Conditional networks can be thought of as: i) decision trees augmented with data transformation operators, or ii) CNNs, with block-diagonal sparse weight matrices, and explicit data routing functions.

Our automatically-optimized conditional architecture (green circle) is ∼5 times faster and ∼6 times smaller than NiN, with same accuracy.

Representing decision trees. Data routing functions (a.k.a. routers, in red) direct the data to one or more child nodes. Identity matrices copy the data without transforming it.

A generic conditional network. Conditional networks fuse efficient data routing with accurate data transformation in a single model. Vector concatenations are denoted with ⊕.

The conditional network used on the Imagenet experiments employs implicit data routing in the (usually expensive) convolutional layers to yield higher compute efficiency than the corresponding, unrouted deep CNN (here VGG11). Small red lines indicate “groups” of feature maps as implemented in Caffe.

Automatically learned conditional architecture for image classification in CIFAR. Both structure and parameters of this conditional network have been learned automatically via Bayesian optimization.

Computational efficiency of implicit conditional networks. (top) A standard CNN (one route). (bottom) A two-routed architecture with no explicit routers. The larger boxes denote feature maps, the smaller ones the filters. Due to branching, the depth of the second set of kernels (in yellow) changes between the two architectures. The reduction in kernel size yields fewer computations and thus higher efficiency in the branched network.

https://arxiv.org/pdf/1702.08835v1.pdf Deep Forest: Towards An Alternative to Deep Neural Networks

https://arxiv.org/pdf/1703.06217v1.pdf Deciding How to Decide: Dynamic Routing in Artificial Neural Networks

We propose and systematically evaluate three strategies for training dynamically-routed artificial neural networks: graphs of learned transformations through which different input signals may take different paths. Though some approaches have advantages over others, the resulting networks are often qualitatively similar. We find that, in dynamically-routed networks trained to classify images, layers and branches become specialized to process distinct categories of images. Additionally, given a fixed computational budget, dynamically-routed networks tend to perform better than comparable statically-routed networks.

https://arxiv.org/pdf/1702.08835v2.pdf Deep Forest: Towards an Alternative to Deep Neural Networks∗

https://openreview.net/pdf?id=ry8dvM-R- ROUTING NETWORKS: ADAPTIVE SELECTION OF NON-LINEAR FUNCTIONS FOR MULTI-TASK LEARNING

https://openreview.net/pdf?id=HJWLfGWRb MATRIX CAPSULES WITH EM ROUTING

https://arxiv.org/abs/1310.6008 Computing in permutation groups without memory

https://arxiv.org/abs/1801.01423v1 Overcoming catastrophic forgetting with hard attention to the task

An attention mask is learned concurrently to every task through stochastic gradient descent, and previous masks are exploited to constrain such learning. We show that the proposed mechanism is effective for reducing catastrophic forgetting, cutting current rates by 33 to 84%. We also show that it is robust to different hyperparameter choices and that it offers a number of monitoring capabilities. The approach features the possibility to control both the stability and compactness of the learned knowledge, which we believe makes it also attractive for online learning and network compression applications.

https://arxiv.org/pdf/1803.05784.pdf Mondrian Trees

https://openreview.net/pdf?id=H1dh6Ax0Z TREEQN AND ATREEC: DIFFERENTIABLE TREE-STRUCTURED MODELS FOR DEEP REINFORCEMENT LEARNING

However, in complex environments where transition models need to be learned from data, the deficiencies of learned models have limited their utility for planning. To address these challenges, we propose TreeQN, a differentiable, recursive, tree-structured model that serves as a drop-in replacement for any value function network in deep RL with discrete actions.

https://arxiv.org/abs/1804.04241v1 Capsules for Object Segmentation

We extend the idea of convolutional capsules with locally-connected routing and propose the concept of deconvolutional capsules. Further, we extend the masked reconstruction to reconstruct the positive input class. The proposed convolutional-deconvolutional capsule network, called SegCaps, shows strong results for the task of object segmentation with substantial decrease in parameter space.

https://arxiv.org/abs/1711.09784v1 Distilling a Neural Network Into a Soft Decision Tree

Deep neural networks have proved to be a very effective way to perform classification tasks. They excel when the input data is high dimensional, the relationship between the input and the output is complicated, and the number of labeled training examples is large. But it is hard to explain why a learned network makes a particular classification decision on a particular test case. This is due to their reliance on distributed hierarchical representations. If we could take the knowledge acquired by the neural net and express the same knowledge in a model that relies on hierarchical decisions instead, explaining a particular decision would be much easier. We describe a way of using a trained neural net to create a type of soft decision tree that generalizes better than one learned directly from the training data.

https://arxiv.org/abs/1805.05185v1 Generative Adversarial Forests for Better Conditioned Adversarial Learning

Our method embeds the powerful discriminating capabilities of a decision forest into the discriminator of a GAN. This results in a better conditioned model which learns in an extremely stable way.

https://arxiv.org/abs/1806.05780 Sample-Efficient Deep RL with Generative Adversarial Tree Search

We propose Generative Adversarial Tree Search (GATS), a sample-efficient Deep Reinforcement Learning (DRL) algorithm. While Monte Carlo Tree Search (MCTS) is known to be effective for search and planning in RL, it is often sample-inefficient and therefore expensive to apply in practice. In this work, we develop a Generative Adversarial Network (GAN) architecture to model an environment's dynamics and a predictor model for the reward function. We exploit collected data from interaction with the environment to learn these models, which we then use for model-based planning. During planning, we deploy a finite depth MCTS, using the learned model for tree search and a learned Q-value for the leaves, to find the best action. We theoretically show that GATS improves the bias-variance trade-off in value-based DRL. Moreover, we show that the generative model learns the model dynamics using orders of magnitude fewer samples than the Q-learner. In non-stationary settings where the environment model changes, we find the generative model adapts significantly faster than the Q-learner to the new environment.

https://arxiv.org/abs/1806.07857 RUDDER: Return Decomposition for Delayed Rewards

https://arxiv.org/abs/1806.09550v1 Inference Trees: Adaptive Inference with Exploration

build on ideas from Monte Carlo tree search to perform adaptive sampling in a manner that balances exploration with exploitation, ensures consistency, and alleviates pathologies in existing adaptive methods. ITs adaptively sample from hierarchical partitions of the parameter space, while simultaneously learning these partitions in an online manner. This enables ITs to not only identify regions of high posterior mass, but also maintain uncertainty estimates to track regions where significant posterior mass may have been missed. ITs can be based on any inference method that provides a consistent estimate of the marginal likelihood. They are particularly effective when combined with sequential Monte Carlo, where they capture long-range dependencies and yield improvements beyond proposal adaptation alone.

https://arxiv.org/abs/1807.06699v1 Adaptive Neural Trees

(i) faster inference via conditional computation, (ii) increased interpretability via hierarchical clustering e.g. learning meaningful class associations, such as separating natural vs. man-made objects, and (iii) a mechanism to adapt the architecture to the size and complexity of the training dataset.

https://arxiv.org/abs/1808.08692v1 Generalized Capsule Networks with Trainable Routing Procedure

https://www.nature.com/articles/ncomms11061 Dynamic information routing in complex networks

Flexible information routing fundamentally underlies the function of many biological and artificial networks. Yet, how such systems may specifically communicate and dynamically route information is not well understood. Here we identify a generic mechanism to route information on top of collective dynamical reference states in complex networks. Switching between collective dynamics induces flexible reorganization of information sharing and routing patterns, as quantified by delayed mutual information and transfer entropy measures between activities of a network’s units. We demonstrate the power of this mechanism specifically for oscillatory dynamics and analyse how individual unit properties, the network topology and external inputs co-act to systematically organize information routing. For multi-scale, modular architectures, we resolve routing patterns at all levels. Interestingly, local interventions within one sub-network may remotely determine nonlocal network-wide communication. These results help understanding and designing information routing patterns across systems where collective dynamics co-occurs with a communication function.

https://arxiv.org/abs/1807.06699 Adaptive Neural Trees

We unite the two via adaptive neural trees (ANTs), a model that incorporates representation learning into edges, routing functions and leaf nodes of a decision tree, along with a backpropagation-based training algorithm that adaptively grows the architecture from primitive modules (e.g., convolutional layers). ANTs allow increased interpretability via hierarchical clustering, e.g., learning meaningful class associations, such as separating natural vs. man-made objects. We demonstrate this whilst achieving over 99% and 90% accuracy on the MNIST and CIFAR-10 datasets. Furthermore, ANT optimisation naturally adapts the architecture to the size and complexity of the training data.