**Name**

**Intent**

**Motivation**

**Sketch**

<Diagram>

**Discussion**

**Known Uses**

**Related Patterns**

<Diagram>

**References**

http://www.kdnuggets.com/2016/03/must-know-tips-deep-learning-part-1.html

Now we have obtained a large number of training samples (images/crops), but please do not hurry! Actually, it is necessary to do pre-processing on these images/crops. In this section, we will introduce several approaches for pre-processing.

The first and simple pre-processing approach is zero-center the data, and then normalize them, which is presented as two lines Python codes as follows:

X -= np.mean(X, axis = 0) # zero-center

X /= np.std(X, axis = 0) # normalize

where, X is the input data (NumIns×NumDim). Another form of this pre-processing normalizes each dimension so that the min and max along the dimension is -1 and 1 respectively. It only makes sense to apply this pre-processing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional pre-processing step.

Another pre-processing approach similar to the first one is PCA Whitening. In this process, the data is first centered as described above. Then, you can compute the covariance matrix that tells us about the correlation structure in the data:

X -= np.mean(X, axis = 0) # zero-center

cov = np.dot(X.T, X) / X.shape[0] # compute the covariance matrix

After that, you decorrelate the data by projecting the original (but zero-centered) data into the eigenbasis:

U,S,V = np.linalg.svd(cov) # compute the SVD factorization of the data covariance matrix

Xrot = np.dot(X, U) # decorrelate the data

The last transformation is whitening, which takes the data in the eigenbasis and divides every dimension by the eigenvalue to normalize the scale:

Xwhite = Xrot / np.sqrt(S + 1e-5) # divide by the eigenvalues (which are square roots of the singular values)

Note that here it adds 1e-5 (or a small constant) to prevent division by zero. One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input. This can in practice be mitigated by stronger smoothing (i.e., increasing 1e-5 to be a larger number).

Please note that, we describe these pre-processing here just for completeness. In practice, these transformations are not used with Convolutional Neural Networks. However, it is also very important to zero-center the data, and it is common to see normalization of every pixel as well.

Automatically standardize the data with feature scaling, setting the mean to 0 and the standard deviation to 1. This helps to ensure that each feature contributes the proper amount to the final model, regardless of its original units and distribution. http://www.lauradhamilton.com/10-tips-for-better-deep-learning-models

http://www.kdnuggets.com/2016/05/dont-just-assume-data-interval-scale.html

http://arxiv.org/pdf/1606.04934v1.pdf

http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf Local Response Normalization

ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron. However, we still find that the following local normalization scheme aids generalization. Denoting by a i x,y the activity of a neuron computed by applying kernel i at position (x, y) and then applying the ReLU nonlinearity, the response-normalized activity b i x,y is given by the expression

b i x,y = a i x,y/ k + α min(N X−1,i+n/2) j=max(0,i−n/2) (a j x,y) 2 β

where the sum runs over n “adjacent” kernel maps at the same spatial position, and N is the total number of kernels in the layer. The ordering of the kernel maps is of course arbitrary and determined before training begins. This sort of response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels. The constants k, n, α, and β are hyper-parameters whose values are determined using a validation set; we used k = 2, n = 5, α = 10−4 , and β = 0.75. We applied this normalization after applying the ReLU nonlinearity in certain layers (see Section 3.5). This scheme bears some resemblance to the local contrast normalization scheme of Jarrett et al. [11], but ours would be more correctly termed “brightness normalization”, since we do not subtract the mean activity. Response normalization reduces our top-1 and top-5 error rates by 1.4% and 1.2%, respectively. We also verified the effectiveness of this scheme on the CIFAR-10 dataset: a four-layer CNN achieved a 13% test error rate without normalization and 11% with normalization3 .

https://scirate.com/arxiv/1603.01431 Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks

We exploit the observation that the pre-activation before Rectified Linear Units follow Gaussian distribution in deep networks, and that once the first and second order statistics of any given dataset are normalized, we can forward propagate this normalization without the need for recalculating the approximate statistics for hidden layers.

http://arxiv.org/pdf/1603.06042v2.pdf Globally Normalized Transition-Based Neural Networks

In this work we demonstrate that simple feed-forward networks without any recurrence can achieve comparable or better accuracies than LSTMs, as long as they are globally normalized.