I am building an RNN for classification (there is a softmax layer after the RNN). There are so many options for what to regularize and I am not sure if to just try all of them, would the effect be the same? which components do I regularize for what situation?
The components being:
- Kernel weights (layer input)
- Recurrent weights
- Bias
- Activation function (layer output)
Regularizers that'll work best will depend on your specific architecture, data, and problem; as usual, there isn't a single cut to rule all, but there are do's and (especially) don't's, as well as systematic means of determining what'll work best - via careful introspection and evaluation.
How does RNN regularization work?
Perhaps the best approach to understanding it is information-based. First, see "How does 'learning' work?" and "RNN: Depth vs. Width". To understand RNN regularization, one must understand how RNN handles information and learns, which the referred sections describe (though not exhaustively). Now to answer the question:
RNN regularization's goal is any regularization's goal: maximizing information utility and traversal of the test loss function. The specific methods, however, tend to differ substantially for RNNs per their recurrent nature - and some work better than others; see below.
RNN regularization methods:
WEIGHT DECAY
General: shrinks the norm ('average') of the weight matrix
sigmoid
,tanh
, but less sorelu
sigmoid
,tanh
grads flatten out for large activations - linearizing enables neurons to keep learningRecurrent weights: default
activation='sigmoid'
Kernel weights: for many-to-one (
return_sequences=False
), they work similar to weight decay on a typical layer (e.g.Dense
). For many-to-many (=True
), however, kernel weights operate on every timestep, so pros & cons similar to above will apply.Dropout:
0.2
in practice. Problem: tends to introduce too much noise, and erase important context information, especially in problems w/ limited timesteps.recurrent_dropout
): the recommended dropoutBatch Normalization:
Weight Constraints: set hard upper-bound on weights l2-norm; possible alternative to weight decay.
Activity Constraints: don't bother; for most purposes, if you have to manually constrain your outputs, the layer itself is probably learning poorly, and the solution is elsewhere.
What should I do? Lots of info - so here's some concrete advice:
Weight decay: try
1e-3
,1e-4
, see which works better. Do not expect the same value of decay to work forkernel
andrecurrent_kernel
, especially depending on architecture. Check weight shapes - if one is much smaller than the other, apply smaller decay to formerDropout: try
0.1
. If you see improvement, try0.2
- else, scrap itRecurrent Dropout: start with
0.2
. Improvement -->0.4
. Improvement -->0.5
, else0.3
.BatchNormalization
, however, you can setuse_bias=False
Introspection: bottom section on 'learning' isn't worth much without this; don't just look at validation performance and call it a day - inspect the effect that adjusting a regularizer has on weights and activations. Evaluate using info toward bottom & relevant theory.
BONUS: weight decay can be powerful - even more powerful when done right; turns out, adaptive optimizers like Adam can harm its effectiveness, as described in this paper. Solution: use AdamW. My Keras/TensorFlow implementation here.
This is too much! Agreed - welcome to Deep Learning. Two tips here:
Conv1D(strides > 1)
, for many timesteps (>1000
); slashes dimensionality, shouldn't harm performance (may in fact improve it).Introspection Code:
Gradients: see this answer
Weights: see this answer
Weights l2 norm
Activations: see this answer
Weights: use
.get_weights()
, organize to plot in histograms, per-gate. No code yet, but may link a future Q&A of mine.How does 'learning' work?
The 'ultimate truth' of machine learning that is seldom discussed or emphasized is, we don't have access to the function we're trying to optimize - the test loss function. All of our work is with what are approximations of the true loss surface - both the train set and the validation set. This has some critical implications:
Further, loss functions are way too complex to analyze directly; a better approach is to localize analysis to individual layers, their weight matrices, and roles relative to the entire NN. Two key considerations are:
Feature extraction capability. Ex: the driving mechanism of deep classifiers is, given input data, to increase class separability with each layer's transformation. Higher quality features will filter out irrelevant information, and deliver what's essential for the output layer (e.g. softmax) to learn a separating hyperplane.
Information utility. Dead neurons, and extreme activations are major culprits of poor information utility; no single neuron should dominate information transfer, and too many neurons shouldn't lie purposeless. Stable activations and weight distributions enable gradient propagation and continued learning.
How does regularization work? read above first
In a nutshell, via maximizing NN's information utility, and improving estimates of the test loss function. Each regularization method is unique, and no two exactly alike - see "RNN regularizers".
RNN: Depth vs. Width: not as simple as "one is more nonlinear, other works in higher dimensions".
Update:
Here is an example of a near-ideal RNN gradient propagation for 170+ timesteps:
This is rare, and was achieved via careful regularization, normalization, and hyperparameter tuning. Usually we see a large gradient for the last few timesteps, which drops off sharply toward left - as here. Also, since the model is stateful and fits 7 equivalent windows, gradient effectively spans 1200 timesteps.