This is an old revision of the document!

Name Implicit Ensemble aka Exponential Ensemble, Ensembles by Training

Intent

Create an ensemble of classifiers while keeping the number of Model parameters constant.

Motivation

How can we create more accurate classifiers using ensembles without increasing the number of model parameters?

Structure

<Diagram>

Discussion

One of the sure fire ways of improving generalization and predictive accuracy is to employ the use of the Ensemble Method. A mixture of experts tends to perform much better than individual experts. The problem however of using the Ensembles is that the number of Model parameters grows which each additional expert that gets involved in the classification. This is definitely detrimental to the speed of inference.

We would like to train as many experts while keeping the resources the same. The accidentally discovery of the DropOut method by G. Hinton reveals that you can effectively training multiple random expert in an Ensemble with each member sharing weights of the same layer. DropOut is known as a Regularization method, more specifically a Regularization in Training. However, DropOut is effective because it composes an implicit ensemble. Here, members of the the ensemble are intermixed with each other in the shared weights.

Known Uses

DropOut

Related Patterns

<Diagram>

References

http://arxiv.org/pdf/1605.06431v1.pdf Residual Networks are Exponential Ensembles of Relatively Shallow Networks

Hinton et al. show that dropping out individual neurons during training leads to a network which is equivalent to averaging over an ensemble of exponentially many networks. Similar in spirit, stochastic depth [9] trains an ensemble of networks by dropping out entire layers during training. These two strategies are “ensembles by training” because the ensemble arises only as a result of the special training strategy. However, we show that residual networks are “ensembles by construction” as a natural result of the structure of the architecture.

The paths through the network that contribute gradient are shorter than expected, because deep paths do not contribute any gradient during training due to vanishing gradients. If most of the paths that contribute gradient are very short compared to the overall depth of the network, increased depth alone can’t be the key characteristic of residual networks. We now believe that multiplicity, the network’s expressability in the terms of the number of paths, plays a key role.

https://arxiv.org/abs/1606.08415v1 Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units

https://arxiv.org/abs/1606.01305 Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

We propose zoneout, a novel method for regularizing RNNs. At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization.