Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
stochastic_gradient_descent [2018/03/16 20:20]
admin
stochastic_gradient_descent [2018/10/07 12:16]
admin
Line 267: Line 267:
  
 Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much broader optima than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. ​ Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much broader optima than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. ​
 +
 +https://​openreview.net/​forum?​id=rJTutzbA- On the insufficiency of existing momentum schemes for Stochastic Optimization https://​github.com/​rahulkidambi/​AccSGD
  
 https://​arxiv.org/​abs/​1802.10026v2 Loss Surfaces, Mode Connectivity,​ and Fast Ensembling of DNNs https://​arxiv.org/​abs/​1802.10026v2 Loss Surfaces, Mode Connectivity,​ and Fast Ensembling of DNNs
  
 +http://​mdolab.engin.umich.edu/​sites/​default/​files/​Martins2003CSD.pdf The Complex-Step Derivative Approximation
 +
 +https://​arxiv.org/​abs/​1810.00150 Directional Analysis of Stochastic Gradient Descent via von Mises-Fisher Distributions in Deep learning
 +
 +We empirically verify our result using deep convolutional networks and observe a higher correlation between the gradient stochasticity and the proposed directional uniformity than that against the gradient norm stochasticity,​ suggesting that the directional statistics of minibatch gradients is a major factor behind SGD.
 +
 +https://​arxiv.org/​abs/​1810.02054 Gradient Descent Provably Optimizes Over-parameterized Neural Networks
 +
 +over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. ​