batch normalization in neural network

2019-03-27 16:42发布

问题:

I'm still fairly new with ANN and I was just reading the Batch Normalization paper (http://arxiv.org/pdf/1502.03167.pdf), but I'm not sure I'm getting what they are doing (and more importantly, why it works)

So let's say I have two layers L1 and L2, where L1 produces outputs and sends them to the neurons in L2. Batch normalization just takes all the outputs from L1 (i.e. every single output from every single neuron, getting an overall vector of |L1| X |L2| numbers for a fully connected network), normalizes them to have a mean of 0 and SD of 1, and then feeds them to their respective neurons in L2 (plus applying the linear transformation of gamma and beta they were discussing in the paper)?

If this is indeed the case, how is this helping the NN? what's so special about a constant distribution?

回答1:

During standard SGD training of a network, the distribution of inputs to a hidden layer will change because the hidden layer before it is constantly changing as well. This is known as covariate shift and can be a problem; see, for instance, here.

It is known that neural networks converge faster if the training data is "whitened", that is, transformed in such a way that each component has a Gaussian distribution and is independent of the other components. See the papers (LeCun et al., 1998b) and (Wiesler & Ney, 2011) cited in the paper.

The idea of the authors is now to apply this whitening not only to the input layer, but to the input of every intermediate layer as well. It would be too expensive to do this over the entire input dataset, so instead they do it batch-wise. They claim that this can vastly speed up the training process and also acts as a sort of regularization.