Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
stochastic_gradient_descent [2018/10/02 21:22]
admin
stochastic_gradient_descent [2018/11/22 10:46]
admin
Line 277: Line 277:
  
 We empirically verify our result using deep convolutional networks and observe a higher correlation between the gradient stochasticity and the proposed directional uniformity than that against the gradient norm stochasticity,​ suggesting that the directional statistics of minibatch gradients is a major factor behind SGD. We empirically verify our result using deep convolutional networks and observe a higher correlation between the gradient stochasticity and the proposed directional uniformity than that against the gradient norm stochasticity,​ suggesting that the directional statistics of minibatch gradients is a major factor behind SGD.
 +
 +https://​arxiv.org/​abs/​1810.02054 Gradient Descent Provably Optimizes Over-parameterized Neural Networks
 +
 +over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. ​
 +
 +https://​arxiv.org/​abs/​1810.11393 Dendritic cortical microcircuits approximate the backpropagation algorithm
 +
 +https://​arxiv.org/​abs/​1811.03962 A Convergence Theory for Deep Learning via Over-Parameterization
 +