# Batch Normalization

References

http://arxiv.org/abs/1502.03167v3 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

http://arxiv.org/pdf/1602.07868v3.pdf Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

http://arxiv.org/abs/1603.01431v5 Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks

We address these drawbacks by proposing a non-adaptive normalization technique for removing internal covariate shift, that we call Normalization Propagation. Our approach does not depend on batch statistics, but rather uses a data-independent parametric estimate of mean and standard-deviation in every layer thus being computationally faster compared with BN

https://gab41.lab41.org/batch-normalization-what-the-hey-d480039a9e3b#.mhq99s2m8

http://blog.smola.org/post/4110255196/real-simple-covariate-shift-correction

http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/survey/node2.html

http://arxiv.org/abs/1607.08022v1 Instance Normalization: The Missing Ingredient for Fast Stylization

In this short note, we demonstrate that by replacing batch normalization with instance normalization it is possible to dramatically improve the performance of certain deep neural networks for image generation.

https://www.quora.com/Is-there-a-theory-for-why-batch-normalization-has-a-regularizing-effect

Batch norm is similar to dropout in the sense that it multiplies each hidden unit by a random value at each step of training. In this case, the random value is the standard deviation of all the hidden units in the minibatch. Because different examples are randomly chosen for inclusion in the minibatch at each step, the standard deviation randomly fluctuates.

Batch norm also subtracts a random value (the mean of the minibatch) from each hidden unit at each step.

Both of these sources of noise mean that every layer has to learn to be robust to a lot of variation in its input, just like with dropout.

http://arxiv.org/abs/1609.04836 On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

In this paper, we present ample numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions – and that sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We also discuss several empirical strategies that help large-batch methods eliminate the generalization gap and conclude with a set of future research ideas and open questions.

https://gab41.lab41.org/my-first-convergence-796718bc0104#.t3h4fnody

http://openreview.net/pdf?id=r1VdcHcxx RECURRENT BATCH NORMALIZATION

We propose a reparameterization of LSTM that brings the benefits of batch normalization to recurrent neural networks.

http://openreview.net/pdf?id=BJuysoFeg REVISITING BATCH NORMALIZATION FOR PRACTICAL DOMAIN ADAPTATION

Recent study (Tommasi et al., 2015) shows that a DNN has strong dependency towards the training dataset, and the learned features cannot be easily transferred to a different but relevant task without fine-tuning. In this paper, we propose a simple yet powerful remedy, called Adaptive Batch Normalization (AdaBN) to increase the generalization ability of a DNN. By modulating the statistics in all Batch Normalization layers across the network, our approach achieves deep adaptation effect for domain adaptation tasks. In contrary to other deep learning domain adaptation methods, our method does not require additional components, and is parameter-free.

http://openreview.net/pdf?id=rk5upnsxe NORMALIZING THE NORMALIZERS: COMPARING AND EXTENDING NETWORK NORMALIZATION SCHEMES

In this paper we propose a unified view of normalization techniques, as forms of divisive normalization, which includes layer and batch normalization as special cases. Our second contribution is the finding that a small modification to these normalization schemes, in conjunction with a sparse regularizer on the activations, leads to significant benefits over standard normalization techniques.

We have proposed a unified view of normalization techniques which contains batch and layer normalization as special cases. We have shown that when combined with a sparse regularizer on the activations, our framework has significant benefits over standard normalization techniques. We have demonstrated this in the context of both convolutional neural nets as well as recurrent neural networks.

https://arxiv.org/pdf/1610.06160.pdf Streaming Normalization: Towards Simpler and More Biologically-plausible Normalizations for Online and Recurrent Learning

We systematically explored a spectrum of normalization algorithms related to Batch Normalization (BN) and propose a generalized formulation that simultaneously solves two major limitations of BN: (1) online learning and (2) recurrent learning. Our proposal is simpler and more biologically-plausible. Unlike previous approaches, our technique can be applied out of the box to all learning scenarios (e.g., online learning, batch learning, fully-connected, convolutional, feedforward, recurrent and mixed — recurrent and convolutional) and compare favorably with existing approaches. We also propose Lp Normalization for normalizing by different orders of statistical moments. In particular, L1 normalization is well-performing, simple to implement, fast to compute, more biologically-plausible and thus ideal for GPU or hardware implementations.

https://arxiv.org/abs/1702.03275 Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

Batch Normalization is quite effective at accelerating and improving the training of deep models. However, its effectiveness diminishes when the training minibatches are small, or do not consist of independent samples. We hypothesize that this is due to the dependence of model layer inputs on all the examples in the minibatch, and different activations being produced between training and inference. We propose Batch Renormalization, a simple and effective extension to ensure that the training and inference models generate the same outputs that depend on individual examples rather than the entire minibatch. Models trained with Batch Renormalization perform substantially better than batchnorm when training with small or non-i.i.d. minibatches. At the same time, Batch Renormalization retains the benefits of batchnorm such as insensitivity to initialization and training efficiency.

Batch Renormalization is as easy to implement as batchnorm itself, runs at the same speed during both training and inference, and significantly improves training on small or non-i.i.d. minibatches. Our method does have extra hyperparameters: the update rate ∆ for the moving averages, and the schedules for correction limits dmax, rmax. A more extensive investigation of the effect of these is a part of future work.

https://arxiv.org/abs/1610.06160 Streaming Normalization: Towards Simpler and More Biologically-plausible Normalizations for Online and Recurrent Learning

https://openreview.net/forum?id=B1fUVMzKg¬eId=B1fUVMzKg Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization

We conjecture that instance normalization performs style normalization. Based on this hypothesis, we propose a new normalization scheme named adaptive instance normalization (AdaIN). AdaIN takes two feature maps as inputs, and simply adjusts the channel-wise mean and variance of the content feature map to match those of the style feature map.

https://arxiv.org/abs/1702.05870 Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks

To bound dot product and decrease the variance, we propose to use cosine similarity instead of dot product in neural networks, which we call cosine normalization. Our experiments show that cosine normalization in fully-connected neural networks notably reduces the test err with lower divergence, compared to other normalization techniques. Applied to convolutional networks, cosine normalization also significantly enhances the accuracy of classification and accelerates the training.

https://arxiv.org/abs/1603.09025v5 Recurrent Batch Normalization

We propose a reparameterization of LSTM that brings the benefits of batch normalization to recurrent neural networks. Whereas previous works only apply batch normalization to the input-to-hidden transformation of RNNs, we demonstrate that it is both possible and beneficial to batch-normalize the hidden-to-hidden transition, thereby reducing internal covariate shift between time steps. We evaluate our proposal on various sequential problems such as sequence classification, language modeling and question answering. Our empirical results show that our batch-normalized LSTM consistently leads to faster convergence and improved generalization.

https://arxiv.org/pdf/1702.08591v1.pdf We explore how batch normalization behaves differently in feedforward and resnets, and draw out facts that are relevant to the main results.

With batch normalization, neurons are co-active for 1/4 of distinct pairs of inputs, which is what would happen if activations were decided by unbiased coin flips. Without batch normalization, the co-active proportion climbs with depth, suggesting neuronal responses are increasingly redundant. Resnets with batch normalization behave the same as feeforward nets.

https://arxiv.org/pdf/1611.06013v3.pdf Improving training of deep neural networks via Singular Value Bounding

e. Our research is inspired by theoretical and empirical results that use orthogonal matrices to initialize networks, but we are interested in investigating how orthogonal weight matrices perform when network training converges. To this end, we propose to constrain the solutions of weight matrices in the orthogonal feasible set during the whole process of network training, and achieve this by a simple yet effective method called Singular Value Bounding (SVB). In SVB, all singular values of each weight matrix are simply bounded in a narrow band around the value of 1. Based on the same motivation, we also propose Bounded Batch Normalization (BBN), which improves Batch Normalization by removing its potential risk of ill-conditioned layer transform.

https://arxiv.org/abs/1606.03498 Improved Techniques for Training GANs

Virtual batch normalization Batch normalization greatly improves optimization of neural networks, and was shown to be highly effective for DCGANs [3]. However, it causes the output of a neural network for an input example x to be highly dependent on several other inputs x 0 in the same minibatch. To avoid this problem we introduce virtual batch normalization (VBN), in which each example x is normalized based on the statistics collected on a reference batch of examples that are chosen once and fixed at the start of training, and on x itself. The reference batch is normalized using only its own statistics. VBN is computationally expensive because it requires running forward propagation on two minibatches of data, so we use it only in the generator network.

https://arxiv.org/pdf/1412.6614.pdf IN SEARCH OF THE REAL INDUCTIVE BIAS : ON THE ROLE OF IMPLICIT REGULARIZATION IN DEEP LEARNING

In order to try to gain an understanding at the possible inductive bias, we draw an analogy to matrix factorization and understand dimensionality versus norm control there. Based on this analogy we suggest that implicit norm regularization might be central also for deep learning, and also there we should think of infinite-sized bounded-norm models.

https://arxiv.org/abs/1705.08741v1 Train longer, generalize better: closing the generalization gap in large batch training of neural networks

We examine the initial high learning rate training phase. We find that the weight distance from its initialization grows logarithmically with the number of weight updates. We therefore propose a “random walk on random landscape” statistical model which is known to exhibit similar “ultra-slow” diffusion behavior. Following this hypothesis we conducted experiments to show empirically that the “generalization gap” stems from the relatively small number of updates rather than the batch size, and can be completely eliminated by adapting the training regime used. We further investigate different techniques to train models in the large-batch regime and present a novel algorithm named “Ghost Batch Normalization” which enables significant decrease in the generalization gap without increasing the number of updates.

https://github.com/kevinzakka/research-paper-notes/blob/master/snn.md Self-Normalizing Neural Networks

The authors introduce self-normalizing neural networks (SNNs) whose layer activations automatically converge towards zero mean and unit variance and are robust to noise and perturbations. Significance: Removes the need for the finicky batch normalization and permits training deeper networks with a robust training scheme.

https://arxiv.org/abs/1711.00489v1 Don't Decay the Learning Rate, Increase the Batch Size

Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times.

https://arxiv.org/abs/1803.08494 Group Normalization

In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization.

https://github.com/switchablenorms/Switchable-Normalization

https://arxiv.org/abs/1706.05350 L2 Regularization versus Batch and Weight Normalization

L2 regularization has no regularizing effect when combined with normalization. Instead, regularization has an influence on the scale of weights, and thereby on the effective learning rate. We investigate this dependence, both in theory, and experimentally. We show that popular optimization methods such as ADAM only partially eliminate the influence of normalization on the learning rate.