Heuristic Credit Assignment







Known Uses

Related Patterns


Not sure if explore none gradient descent or 2nd order methods should be in scope.

Examples: Flexible Alignment, RMProp, NEAT


http://arxiv.org/pdf/1506.06472v1.pdf The Ebb and Flow of Deep Learning: a Theory of Local Learning

duplicate of credit_assignment


MuProp: Unbiased Backpropagation For Stochastic Neural Networks

backpropagation is not directly applicable to stochastic networks that include discrete sampling operations within their computational graph, training such networks remains difficult. We present MuProp, an unbiased gradient estimator for stochastic networks, designed to make this task easier. MuProp improves on the likelihood-ratio estimator by reducing its variance using a control variate based on the first-order Taylor expansion of a mean-field network.


http://arxiv.org/abs/1506.05254 Gradient Estimation Using Stochastic Computation Graphs

We introduce the formalism of stochastic computation graphs—directed acyclic graphs that include both deterministic functions and conditional probability distributions—and describe how to easily and automatically derive an unbiased estimator of the loss function's gradient. The resulting algorithm for computing the gradient estimator is a simple modification of the standard backpropagation algorithm.

http://deliprao.com/archives/187 https://arxiv.org/abs/1608.05343 Decoupled Neural Interfaces using Synthetic Gradients


https://arxiv.org/pdf/1609.01596v1.pdf Direct Feedback Alignment Provides Learning in Deep Neural Networks

In this work, the feedback alignment principle is used for training hidden layers more independently from the rest of the network, and from a zero initial condition. The error is propagated through fixed random feedback connections directly from the output layer to each hidden layer. This simple method is able to achieve zero training error even in convolutional networks and very deep networks, completely without error backpropagation.

https://arxiv.org/abs/1602.05179v4 Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation

This algorithm involves only one kind of neural computation both for the first phase (when the prediction is made) and the second phase (after the target is revealed) of training. Contrary to backpropagation in feedforward networks, there is no need for special computation in the second phase of our learning algorithm.

https://arxiv.org/abs/1610.02306 Optimization of Convolutional Neural Network using Microcanonical Annealing Algorithm

The performance of the proposed method is tested using the MNIST and CIFAR-10 datasets. Although experiment results of MNIST dataset indicate the increase in computation time (1.02x - 1.38x), nevertheless this proposed method can considerably enhance the performance of the original CNN (up to 4.60\%). On the CIFAR10 dataset, currently, state of the art is 96.53\% using fractional pooling, while this proposed method achieves 99.14\%.



Dense-Sparse-Dense (DSD) training, a novel training method that first regularizes the model through sparsity-constrained optimization, and improves the prediction accuracy by recovering and retraining on pruned weights. At test time, the final model produced by DSD training still has the same architecture and dimension as the original dense model, and DSD training doesn’t incur any inference overhead.

https://arxiv.org/pdf/1505.05401.pdf A Max-Sum algorithm for training discrete neural networks

The algorithm we present performs as well as BP on binary perceptron learning problems, and may be better suited to address the problem on fully-connected twolayer networks, since inherent symmetries in two layer networks are naturally broken using the MS approach.


http://www.iro.umontreal.ca/~bengioy/talks/NIPSAutoDiffWorkshop10dec2016.key.pdf Credit Assignment: Beyond Backpropagation

https://arxiv.org/pdf/1612.02734v1.pdf Learning in the Machine: Random Backpropagation and the Learning Channel

To better understand random backpropagation, we first connect it to the notions of local learning and the learning channel. Through this connection, we derive several alternatives to RBP, including skipped RBP (SRPB), adaptive RBP (ARBP), sparse RBP, and their combinations (e.g. ASRBP) and analyze their computational complexity. We then study their behavior through simulations using the MNIST and CIFAR-10 bechnmark datasets. These simulations show that most of these variants work robustly, almost as well as backpropagation, and that multiplication by the derivatives of the activation functions is important.

https://arxiv.org/pdf/1612.05596.pdf Event-driven Random Backpropagation: Enabling Neuromorphic Deep Learning Machines

https://arxiv.org/abs/1703.07950 Failures of Deep Learning

Failure due to Non-Informative Gradients - we discuss a family of problems for which with high probability, at any fixed point, the gradient, ∇Fh(w), will be essentially the same regardless of the underlying target function h.

https://arxiv.org/pdf/1608.03983.pdf SGDR: STOCHASTIC GRADIENT DESCENT WITH WARM RESTARTS https://github.com/loshchil/SGDR

In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks.

https://arxiv.org/pdf/1705.07505v1.pdf Annealed Generative Adversarial Networks

We introduce a novel framework for adversarial training where the target distribution is annealed between the uniform distribution and the data distribution. We posited a conjecture that learning under continuous annealing in the nonparametric regime is stable irrespective of the divergence measures in the objective function and proposed an algorithm, dubbed β-GAN, in corollary. In this framework, the fact that the initial support of the generative network is the whole ambient space combined with annealing are key to balancing the minimax game. In our experiments on synthetic data, MNIST, and CelebA, β-GAN with a fixed annealing schedule was stable and did not suffer from mode collapse.

https://arxiv.org/abs/1705.07795v2 Backprop without Learning Rates Through Coin Betting

we reduce the optimization process to a game of betting on a coin and propose a learning rate free optimal algorithm for this scenario. Theoretical convergence is proven for convex and quasi-convex functions and empirical evidence shows the advantage of our algorithm over popular stochastic gradient algorithms.

https://arxiv.org/abs/1706.04859 Sobolev Training for Neural Networks

This paper introduces Sobolev Training for neural networks, which is a method for incorporating these target derivatives in addition the to target values while training. By optimising neural networks to not only approximate the function's outputs but also the function's derivatives we encode additional information about the target function within the parameters of the neural network.


Gradient-based optimization is the foundation of deep learning and reinforcement learning. Even when the mechanism being optimized is unknown or not differentiable, optimization using high-variance or biased gradient estimates is still often the best strategy. We introduce a general framework for learning low-variance, unbiased gradient estimators for black-box functions of random variables. Our method uses gradients of a neural network trained jointly with model parameters or policies, and is applicable in both discrete and continuous settings. We demonstrate this framework for training discrete latent-variable models. We also give an unbiased, action-conditional extension of the advantage actor-critic reinforcement learning algorithm.


However, since gradient descent is not applicable to hard-threshold functions, it is not clear how to learn them in a principled way. We address this problem by observing that setting targets for hard-threshold hidden units in order to minimize loss is a discrete optimization problem, and can be solved as such. The discrete optimization goal is to find a set of targets such that each unit, including the output, has a linearly separable problem to solve. Given these targets, the network decomposes into individual perceptrons, which can then be learned with standard convex approaches. Based on this, we develop a recursive mini-batch algorithm for learning deep hardthreshold networks that includes the popular but poorly justified straight-through estimator as a special case. Empirically, we show that our algorithm improves classification accuracy in a number of settings, including for AlexNet and ResNet-18 on ImageNet, when compared to the straight-through estimator.

https://arxiv.org/pdf/1703.07370.pdf REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models

https://arxiv.org/abs/1802.00027 A New Backpropagation Algorithm without Gradient Descent

https://arxiv.org/abs/1804.08682v1 Boltzmann Encoded Adversarial Machines

Restricted Boltzmann Machines (RBMs) are a class of generative neural network that are typically trained to maximize a log-likelihood objective function. We argue that likelihood-based training strategies may fail because the objective does not sufficiently penalize models that place a high probability in regions where the training data distribution has low probability. To overcome this problem, we introduce Boltzmann Encoded Adversarial Machines (BEAMs). A BEAM is an RBM trained against an adversary that uses the hidden layer activations of the RBM to discriminate between the training data and the probability distribution generated by the model. We present experiments demonstrating that BEAMs outperform RBMs and GANs on multiple benchmarks.

https://arxiv.org/abs/1808.04873 Generalization of Equilibrium Propagation to Vector Field Dynamics

We present a simple two-phase learning procedure for fixed point recurrent networks that addresses both these issues. In our model, neurons perform leaky integration and synaptic weights are updated through a local mechanism. Our learning method generalizes Equilibrium Propagation to vector field dynamics, relaxing the requirement of an energy function. As a consequence of this generalization, the algorithm does not compute the true gradient of the objective function, but rather approximates it at a precision which is proven to be directly related to the degree of symmetry of the feedforward and feedback weights. We show experimentally that our algorithm optimizes the objective function.