**Name** DropOut

**Intent** Improve generalization by training implicit ensembles and weight sharing.

**Motivation** How can we have improve predictions while using less model weights.

**Structure**

<Diagram>

**Discussion**

There is an important technique that can be considered as a kind of Regularization by Training. This has emerged relatively recently and is known to work quite well. The technique is known as Dropout. The way Dropout works is as follows. Consider the a layer that connects to another layer. Representations that go into that layer to the next are transformed by the Model. Consider these transformations and randomly for every observation you train your network with, set half of the transformations to zero. Completely randomly, you basically take half of the data that's flowing through your layer and just simply destroy it.

So why does Dropout work? Your network can never rely on any given transformation to be present because they might be ignored randomly at any time in training. It is thus constrained to learn a redundant Model to ensure that at least some of the original information is preserved. Constraining your network to learn redundant Models might sound inefficient. However in practice, it is quite robust and does prevent over fitting. It also makes your network act as if taking the consensus over an ensemble of networks. This is referred to as an Implicit Ensemble.

Dropout works because the process creates multiple implicit ensembles that share weights. The idea is that for each training set, you randomly remove over 50% of the neurons. So effectively, you momentary have a subset of the original neural net that runs inference and gets its weights update. So effective you have many more neural nets working as ensemble to eventually perform the classification.

Batch Normalization is different in that you dynamically normalize the inputs on a per mini-batch basis. The research indicates that when removing Dropout while using Batch Normalization, the effect is much faster learning without a loss in generalization. The research appears to be have been done in Google's inception architecture. The intuition is that Inception already has a lot of weight sharing going on as a consequence of its optimal structure. Therefore, the generalization benefits of DropOut has diminishing returns.

**Known Uses**

**Related Patterns**

<Diagram>

**References**

References

Reducing Co-dependence in Deep Networks. Hinton et al. [13] introduced dropout for regularization of deep networks. When training a network layer with dropout, a random subset of neurons is excluded from both the forward and backward pass for each mini-batch. Effectively a different (random) network topology is trained at each iteration. As the authors observe, this approach has some similarities with that of using model ensembles, another effective way to increase generalization. However, one limitation of dropout is that it increases the number of training iterations required for convergence, typically by a factor of two. Recently, Szegedy et al. [35] have suggested that dropout provides little incremental accuracy improvement compared to simply training using batch normalization.

http://arxiv.org/pdf/1602.02389v3.pdf Ensemble Robustness of Deep Learning Algorithms

http://nlp.stanford.edu/pubs/sidaw13fast.pdf https://hips.seas.harvard.edu/blog/2013/08/01/icml-highlight-fast-dropout-training/

https://arxiv.org/pdf/1602.04484v4.pdf Surprising properties of dropout in deep networks

The effects of dropout in deep neural networks are rather complicated, and approximations can be misleading since the dropout penalty is very non-convex even in 1-layer networks. We show that dropout does enjoy several scale-invariance properties that are not shared by weight-decay. A perhaps surprising consequence of these invariances is that there are never isolated local minima when learning a deep network with dropout.

https://arxiv.org/pdf/1611.01232v1.pdf DEEP INFORMATION PROPAGATION

We show that the presence of dropout destroys the order-to-chaos critical point and therefore strongly limits the maximum trainable depth for random networks.

https://arxiv.org/abs/1611.01353 Information Dropout: learning optimal representations through noise

We introduce Information Dropout, a generalization of dropout that is motivated by the Information Bottleneck principle and highlights the way in which injecting noise in the activations can help in learning optimal representations of the data. Information Dropout is rooted in information theoretic principles, it includes as special cases several existing dropout methods, like Gaussian Dropout and Variational Dropout, and, unlike classical dropout, it can learn and build representations that are invariant to nuisances of the data, like occlusions and clutter. When the task is the reconstruction of the input, we show that the information dropout method yields a variational autoencoder as a special case, thus providing a link between representation learning, information theory and variational inference. Our experiments validate the theoretical intuitions behind our method, and we find that information dropout achieves a comparable or better generalization performance than binary dropout, especially on smaller models, since it can automatically adapt the noise to the structure of the network, as well as to the test sample.

http://www.computervisionblog.com/2016/06/making-deep-networks-probabilistic-via.html

https://www.reddit.com/r/MachineLearning/comments/5l3f1c/d_what_happened_to_dropout/

https://arxiv.org/abs/1611.06791 Generalized Dropout

In this work, we generalize this notion and introduce a rich family of regularizers which we call Generalized Dropout. One set of methods in this family, called Dropout++, is a version of Dropout with trainable parameters. Classical Dropout emerges as a special case of this method. Another member of this family selects the width of neural network layers. Experiments show that these methods help in improving generalization performance over Dropout.

https://arxiv.org/abs/1502.02478 Efficient batchwise dropout training using submatrices

We explore a very simple alternative to the dropout mask. Instead of masking dropped out units by setting them to zero, we perform matrix multiplication using a submatrix of the weight matrixâ€”unneeded hidden units are never calculated. Performing dropout batchwise, so that one pattern of dropout is used for each sample in a minibatch, we can substantially reduce training times. Batchwise dropout can be used with fully-connected and convolutional neural networks.