Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
stochastic_gradient_descent [2018/02/21 14:07]
admin
stochastic_gradient_descent [2018/11/22 10:46] (current)
admin
Line 263: Line 263:
  
  It operates directly with the loss function and rescales the gradient in order to make fixed predicted progress on the loss.  It operates directly with the loss function and rescales the gradient in order to make fixed predicted progress on the loss.
 +
 +https://​arxiv.org/​abs/​1803.05407v1 Averaging Weights Leads to Wider Optima and Better Generalization
 +
 +Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much broader optima than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. ​
 +
 +https://​openreview.net/​forum?​id=rJTutzbA- On the insufficiency of existing momentum schemes for Stochastic Optimization https://​github.com/​rahulkidambi/​AccSGD
 +
 +https://​arxiv.org/​abs/​1802.10026v2 Loss Surfaces, Mode Connectivity,​ and Fast Ensembling of DNNs
 +
 +http://​mdolab.engin.umich.edu/​sites/​default/​files/​Martins2003CSD.pdf The Complex-Step Derivative Approximation
 +
 +https://​arxiv.org/​abs/​1810.00150 Directional Analysis of Stochastic Gradient Descent via von Mises-Fisher Distributions in Deep learning
 +
 +We empirically verify our result using deep convolutional networks and observe a higher correlation between the gradient stochasticity and the proposed directional uniformity than that against the gradient norm stochasticity,​ suggesting that the directional statistics of minibatch gradients is a major factor behind SGD.
 +
 +https://​arxiv.org/​abs/​1810.02054 Gradient Descent Provably Optimizes Over-parameterized Neural Networks
 +
 +over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. ​
 +
 +https://​arxiv.org/​abs/​1810.11393 Dendritic cortical microcircuits approximate the backpropagation algorithm
 +
 +https://​arxiv.org/​abs/​1811.03962 A Convergence Theory for Deep Learning via Over-Parameterization
 +