This chapter covers mechanisms that are known to lead to a trained model. Why are neural networks able to generalize? Why does back-propagation eventually lead to convergence? There are many questions that still are looking for a good theoretical explanation. However, DL is an experimental science and it is known that the simplistic method of back-propagation is surprisingly effective.
Early objections with regards to neural networks were that the equivalent optimization problem was likely to be convex. What this meant was that it would be extremely difficult to train a model to reach convergence. However recent research disproves this original intuition. Rather, in high-dimensional spaces, it is more likely to find that a local minima is a saddle point and thus the higher probability that gradient descent will eventually find a way to continue to roll down the optimization hill.
The requirements for back-propagation in Deep Learning is surprisingly simplistic. If one is able to calculate the divergence of each of the layers with respect to its model parameters then one can apply it. Back-propagation works extremely well in discovering a convergence basin where a model has learned to generalize.
This chapter covers recurring learning patterns we find in different neural network architectures. At its most abstract form, learning is a credit assignment problem. As a consequence of observed data, which parts of a model do we need to change and by how much? We will explore many of techniques that have been shown to be effective in practice.
Learning to Optimize note: Different from Meta-learning
https://arxiv.org/pdf/1606.04838v1.pdf Optimization Methods for Large-Scale Machine Learning
we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations.
Recent Advances in Non-Convex Optimization and its Implications to Learning Anima Anandkumar ICML 2016 Tutorial
http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf Simple Statistical Gradient Following for Connectionist Reinforcement Learning