**This is an old revision of the document!**

**Name** Stochastic Gradient Descent

**Intent**
Train the Model by taking random samples from the training set.

**Motivation**
How can we update the Model without considering the entire training set?

**Sketch**

<Diagram>

**Discussion**

The concept of gradient descent is simple in that we calculate the Fitness function and from there adjust our model weights accordingly. This direction of the adjustment is called the gradient. However, calculating the gradient can be a massive considering that all the training data is required. The true gradient depends on every single instance of training data. Now consider that DL gains the most with the more data we can use. To overcome this difficulty, we take a shortcut, in that we only estimate the gradient a random sample at a time. Considering that this random sample may be orders of magnitude smaller than the entire training set, one has to wonder if estimating in this manner should even work at all. The randomness is very import because what we've found in practice is that if the samples aren't random enough, then the method does not work at all.

We sample a small fraction of training data, compute the Fitness function for that sample, compute the gradient for that sample and then use that as the direction to use to update our Model. Of course, it will not be always in the correct direction and often it will lead to a worse Fitness function. However if we do this for so many times then perhaps the method compensates for the errors. This method requires less computational resources as computing the true gradient however in contrast to just performing a single gradient adjustment, we have to make this adjustment many, many times. This method is known as Stochastic Gradient Descent and serves as the core of DL. Stochastic Gradient Descent (SGD) scales very well in practice with both big training data and with large Models [SGD2014].

Normalizing the training data is essential for SGD. Random initialization is also a very important step for SGD. There are additional knobs in how we calculated the estimated gradient that aids in faster training. The first such knob is we can adjust the “momentum” of the descent. What we do is leverage the information that we've accumulated in previous calculations. We can for instance calculate the average of the previous gradient calculations and use it to influence the current gradient calculation. This has analogies mathematically with the concept of momentum and it just so happens that this method works well in practice.

Another knob that we can tweak is that of learning rate decay. We could influence how small a gradient step we make per iteration. It does appear that in practice that it is effect to take smaller and smaller steps as you train your data. There are many ways in practice that you can adjust the learning rate the key point is that it becomes lower over time.

Deep Learning networks are built by stacking layers on top of each other. To calculate the gradient across multiple layers, one uses the “chain rule” that we learn from basic calculus. The chain rule is simply defined such that when you have a function composed into another function, then the derivative is simply the product of the derivative of each function. So you can basically compute the derivative for each layer which is dependent on the layers of derivatives that precede it. In practice, most DL frameworks have a symbolic auto-differentiation capability so that you don't have to calculate the derivative by hand.

So when training data is input to the machine you have it flow up the stack of layers until it reaches its the top where it resolves into a classification. To compute the gradient, you need a flow that goes in the other direction, backwards through the network. The gradients are calculated via chain rule and auto-differentiation. The first phase is called forward propagation, while the backward phase is known as back propagation. The back propagation phase is where the model weights are updated based on its current weight and the gradient. This happens for many iterations.

The entire success of Deep Learning hinges on the surprising effectiveness of SGD to train networks. It is surprising because the objective function of neural networks are all to often non-convex. Non-convex optimization in theory are extremely difficult. Yet, without any explanation, SGD seems to be very good at training extremely large deep neural networks. In theory, the the problem of training neural networks is NP-hard. There is a proof that shows the problem of finding the best neural network for a network with just three hidden units is NP-hard! We are just fortunate to be able to avoid the theory given SGD is so effective in practice.

source: Leon Bottou

**Known Uses**

**Related Patterns**

<Diagram>

- Entropy
- Random Orthoginal Initialization
- Dissipative Adaptation
- Distributed Model
- Hierarchical Model
- Scale Invariance
- Fitness Function
- Differentiable Layer
- Activation Function
- Convolution
- Recurrent Layer
- Gain
- In Layer Transform
- Structured Matrix
- Layer Reversibility (Reversible Layer)
- Network in Network (Recursive Layer)
- Recurrent Network
- Mutable Layer
- Dimensionless Features
- Regularization
- Relaxed Backpropagation =Credit Assignment
- Natural Gradient Descent
- Transfer Learning
- Curriculum Learning
- DropOut
- Unsupervised Pretraining
- AutoEncoder
- Differential Training
- Genetic Algorithm
- Manifold Traversal
- Metric Learning

**References**

http://www.magicbroom.info/Papers/DuchiHaSi10.pdf Adagrad

http://arxiv.org/pdf/1212.5701v1.pdf Adadelta

http://arxiv.org/pdf/1412.6980v8.pdf Adan

http://www.deeplearningbook.org/contents/optimization.html

http://sebastianruder.com/optimizing-gradient-descent/

http://www.offconvex.org/2016/05/08/almostconvexitySATM/

http://www.offconvex.org/2016/03/22/saddlepoints/

http://www.offconvex.org/2016/03/24/saddles-again/

http://arxiv.org/pdf/1406.2572.pdf Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

http://arxiv.org/pdf/1504.07278.pdf

A control theoretic approach is presented in this paper for both batch and instantaneous updates of weights in feed-forward neural networks. The popular Hamilton-Jacobi- Bellman (HJB) equation has been used to generate an optimal weight update law. The remarkable contribution in this paper is that closed form solutions for both optimal cost and weight update can be achieved for any feed-forward network using HJB equation in a simple yet elegant manner.

https://arxiv.org/pdf/1404.1089.pdf

In this work a method to solve the Hamilton Jacobi Bellman equation for nonlinear, stochastic systems with complexity the scales linearly with dimension has been proposed.

http://www.kdnuggets.com/2016/05/concise-overview-model-fitting-methods.html

http://www.offconvex.org/2016/04/04/markov-chains-dynamical-systems/

Our viewpoint connects evolutionary Markov chains, nature’s algorithms, with stochastic descent methods, popular in machine learning and optimization, and the readers interested in the latter might benefit from our techniques.

http://arxiv.org/abs/1408.2923

http://www.argmin.net/2016/05/31/mechanics-of-lagrangians

http://www.argmin.net/2016/05/18/mates-of-costate/

A few years ago, Steve Wright introduced me to an older method from optimal control, called the method of adjoints, which is equivalent to backpropagation.

[SGD2014] http://arxiv.org/abs/1412.6544v6 Qualitatively characterizing neural network optimization problems

Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.

The reason for the success of SGD on a wide variety of tasks is now clear: these tasks are relatively easy to optimize. The primary difficulty is finding the correct search direction. While this task is still difficult, it is not nearly as difficult as escaping sequences of local minima or threading a winding, high-dimensional ravine.

http://arxiv.org/abs/1511.06807 http://deliprao.com/archives/153 http://deliprao.com/archives/156 http://web.mit.edu/6.435/www/Gelfand93.pdf http://arxiv.org/abs/1503.02101 Escaping From Saddle Points — Online Stochastic Gradient for Tensor Decomposition

We analyze stochastic gradient descent for optimizing non-convex functions. In many cases for non-convex functions the goal is to find a reasonable local minimum, and the main concern is that gradient updates are trapped in saddle points. In this paper we identify strict saddle property for non-convex problem that allows for efficient optimization. Using this property we show that stochastic gradient descent converges to a local minimum in a polynomial number of iterations. To the best of our knowledge this is the first work that gives global convergence guarantees for stochastic gradient descent on non-convex functions with exponentially many local minima and saddle points. Our analysis can be applied to orthogonal tensor decomposition, which is widely used in learning a rich class of latent variable models. We propose a new optimization formulation for the tensor decomposition problem that has strict saddle property. As a result we get the first online algorithm for orthogonal tensor decomposition with global convergence guarantee.

http://arxiv.org/pdf/1607.04917v1.pdf Piecewise convexity of artificial neural networks

http://arxiv.org/abs/1605.06444v2 Unreasonable Effectiveness of Learning Neural Networks: From Accessible States and Robust Ensembles to Basic Algorithmic Schemes

The intuitive interpretation of the method is also quite straightforward: a set of coupled systems is less likely to get trapped in narrow minima, and will instead be attracted to wide regions of good (and mostly equivalent) configurations, thus naturally implementing a kind of robustness to details of the configurations.

https://arxiv.org/abs/1609.04836 On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

There have been some attempts to investigate the cause for this generalization drop in the large-batch regime, however the precise answer for this phenomenon is, hitherto unknown. In this paper, we present ample numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions – and that sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.

http://stanford.edu/~imit/tuneyourmomentum

https://www.neuraldesigner.com/blog/5_algorithms_to_train_a_neural_network.html

https://arxiv.org/abs/1605.07110 Deep Learning without Poor Local Minima

for the squared loss function of deep linear neural networks with any depth and any widths: 1) the function is non-convex and non-concave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) there exist “bad” saddle points (where the Hessian has no negative eigenvalue) for the deeper networks (with more than three layers), whereas there is no bad saddle point for the shallow networks (with three layers). Moreover, for deep nonlinear neural networks, we prove the same four statements via a reduction to a deep linear model under the independence assumption adopted from recent work. As a result, we present an instance, for which we can answer the following question: how difficult is it to directly train a deep model in theory? It is more difficult than the classical machine learning models (because of the non-convexity), but not too difficult (because of the nonexistence of poor local minima). Furthermore, the mathematically proven existence of bad saddle points for deeper models would suggest a possible open problem. We note that even though we have advanced the theoretical foundations of deep learning and non-convex optimization, there is still a gap between theory and practice.

https://arxiv.org/abs/1412.6615 Explorations on high dimensional landscapes

Finding minima of a real valued non-convex function over a high dimensional space is a major challenge in science. We provide evidence that some such functions that are defined on high dimensional domains have a narrow band of values whose pre-image contains the bulk of its critical points. This is in contrast with the low dimensional picture in which this band is wide. Our simulations agree with the previous theoretical work on spin glasses that proves the existence of such a band when the dimension of the domain tends to infinity. Furthermore our experiments on teacher-student networks with the MNIST dataset establish a similar phenomenon in deep networks. We finally observe that both the gradient descent and the stochastic gradient descent methods can reach this level within the same number of steps.

http://moderndescartes.com/essays/backpropagation_pascal_triangle

https://arxiv.org/pdf/1701.07974v1.pdf Reinforced backpropagation improves test performance of deep networks: a toy-model

https://arxiv.org/pdf/1611.06310.pdf LOCAL MINIMA IN TRAINING OF DEEP NETWORKS

The overwhelming amount of empirical evidence points towards learning being well behaved in practice. We argue that the only way to reconcile these different observations is if the well-behaved learning dynamics are local and conditioned on data structure, initialization and maybe other architectural choices. One can imagine a continuum from the very specific, where every detail of the setup is important to attain good learning dynamics, to the generic, where learning is globally well behaved regardless of dataset or initialization. We believe that a crucial and important step forward in the theoretical study of the neural networks can be made by identifying where exactly this class of models falls on this continuum. In particular, what are the most generic set of constraints that need to be respected in order to attain the good behaviour. Our results focus on constructing counterexamples which result in a bad learning dynamics. While this does not directly lead to sufficient conditions for well-behaved systems, we hope that by carving out the space of possible conditions we move forward towards our goal.

https://arxiv.org/abs/1609.01037 Distribution-specific hardness of learning neural networks

To formally explain the practical success of neural network learning, it seems that one would need to employ a careful combination of assumptions on both the input distribution and the target function. To prove our results, we develop some tools which may be of independent interest, such as extension of Fourier-based techniques for proving hardness in the statistical queries framework \cite{blum1994weakly}, from the Boolean cube to Euclidean space.

http://www.stat.cmu.edu/~acthomas/724/Geman.pdf Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images

https://arxiv.org/pdf/1703.04782v1.pdf Online Learning Rate Adaptation with Hypergradient Descent

Our method works by dynamically updating the learning rate during optimization using the gradient with respect to the learning rate of the update rule itself. Computing this “hypergradient” needs little additional computation, requires only one extra copy of the original gradient to be stored in memory, and relies upon nothing more than what is provided by reverse-mode automatic differentiation.

https://arxiv.org/abs/1704.02708 Distribution-free Evolvability of Vector Spaces: All it takes is a Generating Set

Evolution appears to occur in a mean-divergence model reminiscent of Markowitz mean-variance model for portfolio selection, and the risk-return efficient frontier of evolution shows an interesting pattern: when far from the optimum, the mutator always has access to mutations close to the efficient frontier. Toy experiments in supervised and unsupervised learning display promising directions for this scheme to be used as a (new) provable gradient-free stochastic optimisation algorithm.

https://arxiv.org/pdf/1704.04289.pdf Stochastic Gradient Descent as Approximate Bayesian Inference

https://arxiv.org/abs/1611.01505v2 Improving Stochastic Gradient Descent with Feedback

The method tracks the relative changes in the objective function with a running average, and uses it to adaptively tune the learning rate in stochastic gradient descent.

https://arxiv.org/pdf/1705.08292v1.pdf The Marginal Value of Adaptive Gradient Methods in Machine Learning

We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance.

First, with the same amount of hyperparameter tuning, SGD and SGD with momentum outperform adaptive methods on the development/test set across all evaluated models and tasks. This is true even when the adaptive methods achieve the same training loss or lower than non-adaptive methods.

Second, adaptive methods often display faster initial progress on the training set, but their performance quickly plateaus on the development/test set.

Third, the same amount of tuning was required for all methods, including adaptive methods. This challenges the conventional wisdom that adaptive methods require less tuning.

Moreover, as a useful guide to future practice, we propose a simple scheme for tuning learning rates and decays that performs well on all deep learning tasks we studied.

https://arxiv.org/pdf/1705.10412.pdf Gradient Descent Can Take Exponential Time to Escape Saddle Points

GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points—it can find an approximate local minimizer in polynomial time.

http://cs.stanford.edu/~zjian/project/YellowFin/

TLDR; Hand-tuned momentum SGD is competitive with state-of-the-art adaptive methods, like Adam. We introduce YellowFin, an automatic tuner for the hyperparameters of momentum SGD. YellowFin trains models such as large ResNets and LSTMs in fewer iterations than the state of the art. It performs even better in asynchronous settings via an on-the-fly momentum adaptation scheme that uses a novel momentum measurement component along with a negative-feedback loop mechanism.

https://hackernoon.com/training-your-deep-model-faster-and-sharper-e85076c3b047

https://arxiv.org/abs/1710.11029 Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

Even more surprisingly, SGD does not even converge in the classical sense: we show that the most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points. Instead, they resemble closed loops with deterministic components. We prove that such “out-of-equilibrium” behavior is a consequence of the fact that the gradient noise in SGD is highly non-isotropic; the covariance matrix of mini-batch gradients has a rank as small as 1% of its dimension. We provide extensive empirical validation of these claims, proven in the appendix.

https://arxiv.org/abs/1802.05074 L4: Practical loss-based stepsize adaptation for deep learning

It operates directly with the loss function and rescales the gradient in order to make fixed predicted progress on the loss.

https://arxiv.org/abs/1803.05407v1 Averaging Weights Leads to Wider Optima and Better Generalization

Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much broader optima than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model.