This is an old revision of the document!




Empirical Risk Minimization


Generalization is a measure that is based on predictive accuracy against non-training data.


How can we condition a network to accurately perform predictions on information it has previously never encountered?


This section provides alternative descriptions of the pattern in the form of an illustration or alternative formal expression. By looking at the sketch a reader may quickly understand the essence of the pattern.


Generalization is an interesting term in deep learning. We all talk about it, however we don't have a precise definition.

We can define it as the behavior of our system in response to validation data. That is against data that we have not included as part of the training set. We be a bit more ambitious and define it as behavior when the system is deployed to analyze real world data. We essentially would like to see our trained system perform accurately in the context of data it has never seen. This is one definition.

Another generalization revolves around the idea of minimizing risk. When we train our system, there is an uncertainty in the context in which it will be deployed. So we train our models with mechanisms to anticipate unpredictable situations. The hope is that the system is robust to contexts that have not been previously predicted. This is kind of a game theoretic definition. We can envision an environment where information will always remain imperfect and generalization effectively means executing a particular strategy within the environment. This may be the most abstract definition of generalization that we have.

A third definition is based on the idea of Occam's Razor. That is, the most simplest of explanations is like the best explanation. Here we make certain assumptions of the form of the data and we drive our regularization to constrain the solution toward our assumptions. So for example in the field of compressive sensing, we assume that a sparse basis exists. From there we can drive an optimization problem that searches solutions that have a sparse basis.

A fourth definition is based on the systems ability to recreate or reconstruct the features. This is the approach taken by generative models. If a neural network is able to accurately generate realistic images, then it able to capture the concept of images in its entirety.

A fifth definition involves the notion of ignoring invariant features or nuisance variables. That is, a system is able to generalize well if it able to ignore invariant features for its tasks. Remove away as much features as possible until you can't remove anymore. This is somewhat similar to the third definition however it tackles the problem from another perspective.

In looking through these definitions, it becomes apparent that generalization concerns itself with ignorance of a future observation. This differs from entropy which is a measure of our ignorance of a current observation.

A useful schema in discussing ignorance can be stated as follows. Knowable knowns, meaning models that will converge on training data. Knowable unknowns, are trainable models that are able to make accurate predictions on non-training data. This is the domain of generalization. This is what drives our preference of models and regularization choices. Unknowable knowns, reflects an inability to learn a known system, this covers the area of computational irreducible computations and involves an appreciation of anti-causality. Unknowable unknowns which I will leave to the reader to figure out.

You can divide the potential capabilities of deep learning into 4 classes. You can divide the potential capabilities of deep learning into 4 classes. (1) Knowable Knowns - Given a large enough set of knowns (i.e. training data) we can get good convergence of our prediction errors. (2) Knowable Unknowns - With good generalization, we can know about test data that we a machine has never seen. (3) Unknowable Knowns - However, there are certain classes of system that deep learning can never learn. These are in the class of computational irreducible systems. However a Deep Learning system may detect the direction of causality and be able to know that is able to learn from the system. What it will not be able to know (unknowable) is if a system is in the class of computational irreducible systems. (4) Unknowable Unknowns - Finally, there is a class of total ignorance. This is really just a meta-physical statement. The class of unknowables that cannot be known is unknowable.

Even though the above schema of ignorance looks complicated, it is but the tip of the iceberg of understanding ignorance. Consider more complexities such as misinformation, detecting model bias, ambiguity, disagreement, or accommodating for change.

This leads us to a conclusion about achieving good generalization. Given that it can be measured only through a future observation, an appropriate learning process will therefore require iterative feedback. This gets us into the realm of reinforcement learning which we discuss only briefly in this book and likely the main topic of a future sequel.

Known Uses

Here we review several projects or papers that have used this pattern.

Related Patterns In this section we describe in a diagram how this pattern is conceptually related to other patterns. The relationships may be as precise or may be fuzzy, so we provide further explanation into the nature of the relationship. We also describe other patterns may not be conceptually related but work well in combination with this pattern.

Relationship to other canonical patterns:

Relationship to other Patterns

Cited in these patterns:

Further Reading

We provide here some additional external material that will help in exploring this pattern in more detail.

References Generalization Bounds and Stability Relative Entropy in Biological Systems

In this paper we review various information-theoretic characterizations of the approach to equilibrium in biological systems. The replicator equation, evolutionary game theory, Markov processes and chemical reaction networks all describe the dynamics of a population or probability distribution. Under suitable assumptions, the distribution will approach an equilibrium with the passage of time. Relative entropy - that is, the Kullback–Leibler divergence, or various generalizations of this - provides a quantitative measure of how far from equilibrium the system is. We explain various theorems that give conditions under which relative entropy is nonincreasing. In biochemical applications these results can be seen as versions of the Second Law of Thermodynamics, stating that free energy can never increase with the passage of time. In ecological applications, they make precise the notion that a population gains information from its environment as it approaches equilibrium. Algorithmic Game Theory Semantics, Representations and Grammars for Deep Learning

The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Deep Reinforcement Learning from Self-Play in Imperfect-Information Games

In this paper we introduce the first scalable end-to-end approach to learning approximate Nash equilibria without prior domain knowledge. Causal Entropic Forces Taxonomies of chance, risk and unpredictability. AdaNet: Adaptive Structural Learning of Artificial Neural Networks

Our method optimizes for generalization performance, and it explicitly and automatically addresses the trade-off between model architecture and empirical risk minimization, ideas that have been under-explored in deep learning. GENERALIZABLE FEATURES FROM UNSUPERVISED LEARNING

In this work, we explore the potential of unsupervised learning to find features that promote better generalization to settings outside the supervised training distribution. Our task is predicting the stability of towers of square blocks. We demonstrate that an unsupervised model, trained to predict future frames of a video sequence of stable and unstable block configurations, can yield features that support extrapolating stability prediction to blocks configurations outside the training set distribution.

In this paper, we showed that data generated from an unsupervised model could help a supervised learner to generalize to unseen scenarios. We argue that this ability of transfer learning and generalization by observing the world could be one of the ingredients to construct a model of the world that could have applications in many tasks, such as model-based RL. UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION

What is it then that distinguishes neural networks that generalize well from those that don’t? A satisfying answer to this question would not only help to make neural networks more interpretable, but it might also lead to more principled and reliable model architecture design. Sharp Minima Can Generalize For Deep Nets

This paper argues that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization. Specifically, when focusing on deep networks with rectifier units, we can exploit the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit to build equivalent models corresponding to arbitrarily sharper minima. Furthermore, if we allow to reparametrize a function, the geometry of its parameters can change drastically without affecting its generalization properties. Theory II: Landscape of the Empirical Risk in Deep Learning

In this work, we characterize with a mix of theory and experiments, the landscape of the empirical risk of overparametrized DCNNs. We first prove the existence of a large number of degenerate global minimizers with zero empirical error (modulo inconsistent equations). The zero-minimizers – in the case of classification – have a non-zero margin. The same minimizers are degenerate and thus very likely to be found by SGD that will furthermore select with higher probability the zero-minimizer with larger margin, as discussed in Theory III (to be released). On Generalization and Regularization in Deep Learning Adaptive Learning with Robust Generalization Guarantees Geometry of Optimization and Implicit Regularization in Deep Learning

In this work we show proof-of-concept success, and we expect Path-SGD to be beneficial also in large-scale training for very deep convolutional networks. Combining Path-SGD with AdaGrad, with momentum or with other optimization heuristics might further enhance results. On the Emergence of Invariance and Disentangling in Deep Representations

Using classical notions of statistical decision and information theory, we show that invariance in a deep neural network is equivalent to minimality of the representation it computes, and can be achieved by stacking layers and injecting noise in the computation, under realistic and empirically validated assumptions. We use an Information Decomposition of the empirical loss to show that overfitting can be reduced by limiting the information content stored in the weights. We then present a sharp inequality that relates the information content in the weights – which are a representation of the training set and inferred by generic optimization agnostic of invariance and disentanglement – and the minimality and total correlation of the activation functions, which are a representation of the test datum. This allows us to tackle recent puzzles concerning the generalization properties of deep networks and their relation to the geometry of the optimization residual. Exploring Generalization in Deep Learning

Learning with deep neural networks displays good generalization behavior in practice, a phenomenon that remains largely unexplained. In this paper we discussed different candidate complexity measures that might explain generalization in neural networks. We outline a concrete methodology for investigating such measures, and report on experiments studying how well the measures explain different phenomena. While there is no clear choice yet, some combination of expected sharpness and norms do seem to capture much of the generalization behavior of neural networks. A major issue still left unresolved is how the choice of optimization algorithm biases such complexity to be low, and what is the precise relationship between optimization and implicit regularization Implicit Regularization in Deep Learning