Is it normal to use batch normalization in RNN/lst

2020-02-23 07:31发布

问题:

I am a beginner in deep learning.I know in regular neural nets people use batch norm before activation and it will reduce the reliance on good weight initialization. I wonder if it would do the same to RNN/lstm RNN when i use it. Does anyone have any experience with it? Thank you.

回答1:

No, you cannot use Batch Normalization on a recurrent neural network, as the statistics are computed per batch, this does not consider the recurrent part of the network. Weights are shared in an RNN, and the activation response for each "recurrent loop" might have completely different statistical properties.

Other techniques similar to Batch Normalization that take these limitations into account have been developed, for example Layer Normalization. There are also reparametrizations of the LSTM layer that allow Batch Normalization to be used, for example as described in Recurrent Batch Normalization by Coijmaans et al. 2016.



回答2:

Batch normalization applied to RNNs is similar to batch normalization applied to CNNs: you compute the statistics in such a way that the recurrent/convolutional properties of the layer still hold after BN is applied.

For CNNs, this means computing the relevant statistics not just over the mini-batch, but also over the two spatial dimensions; in other words, the normalization is applied over the channels dimension.

For RNNs, this means computing the relevant statistics over the mini-batch and the time/step dimension, so the normalization is applied only over the vector depths. This also means that you only batch normalize the transformed input (so in the vertical directions, e.g. BN(W_x * x)) since the horizontal (across time) connections are time-dependent and shouldn't just be plainly averaged.



回答3:

In any non-recurrent network (convnet or not) when you do BN each layer gets to adjust the incoming scale and mean so the incoming distribution for each layer doesn't keep changing (which is what the authors of the BN paper claim is the advantage of BN).

The problem with doing this for the recurrent outputs of an RNN is that the parameters for the incoming distribution are now shared between all timesteps (which are effectively layers in backpropagation-through-time, or BPTT). So the distribution ends up being fixed across the temporal layers of BPTT. This may not make sense as there may be structure in the data that varies (in a non-random way) across the time series. For example, if the time series is a sentence certain words are much more likely to appear in the beginning or end. So having the distribution fixed might reduce the effectiveness of BN.



回答4:

It is not commonly used, though I found this paper from 2017 shows a way to use batch normalization in the input-to-hidden and the hidden-to-hidden transformations trains faster and generalizes better on some problems.

Also, check out Stack Exchange Cross Validated for a more machine-learning oriented Q&A.



回答5:

The answer is Yes and No.

Why Yes, according to the paper layer normalization, in section it clearly indicates the usage of BN in RNNs.

Why No? The distribution of output at each timestep has to be stored and calcualted to conduct BN. Imagine that you pad the sequence input so all examples have the same length, so if the predict case is longer than all training cases, at some time step you have no mean/std of the output distribution summarized from the SGD training procedure.

Meanwhile, at least in Keras, I believe the BN layer only consider the normalization in vertical direction, i.e., the sequence output. The horizontal direction, i.e., hidden_status, cell_status, are not normalized. Correct me if I an wrong here.

In multiple-layer RNNs, you may consider using layer normalization tricks.