RNN Regularization: Which Component to Regularize?

I am building an RNN for classification (there is a softmax layer after the RNN). There are so many options for what to regularize and I am not sure if to just try all of them, would the effect be the same? which components do I regularize for what situation?

The components being:

Kernel weights (layer input)
Recurrent weights
Bias
Activation function (layer output)

Regularizers that'll work best will depend on your specific architecture, data, and problem; as usual, there isn't a single cut to rule all, but there are do's and (especially) don't's, as well as systematic means of determining what'll work best - via careful introspection and evaluation.

How does RNN regularization work?

Perhaps the best approach to understanding it is information-based. First, see "How does 'learning' work?" and "RNN: Depth vs. Width". To understand RNN regularization, one must understand how RNN handles information and learns, which the referred sections describe (though not exhaustively). Now to answer the question:

RNN regularization's goal is any regularization's goal: maximizing information utility and traversal of the test loss function. The specific methods, however, tend to differ substantially for RNNs per their recurrent nature - and some work better than others; see below.

RNN regularization methods:

WEIGHT DECAY

General: shrinks the norm ('average') of the weight matrix
- Linearization, depending on activation; e.g. sigmoid, tanh, but less so relu
- Gradient boost, depending on activation; e.g. sigmoid, tanh grads flatten out for large activations - linearizing enables neurons to keep learning
Recurrent weights: default activation='sigmoid'
- Pros: linearizing can help BPTT (remedy vanishing gradient), hence also learning long-term dependencies, as recurrent information utility is increased
- Cons: linearizing can harm representational power - however, this can be offset by stacking RNNs
Kernel weights: for many-to-one (return_sequences=False), they work similar to weight decay on a typical layer (e.g. Dense). For many-to-many (=True), however, kernel weights operate on every timestep, so pros & cons similar to above will apply.

Dropout:

Activations (kernel): can benefit, but only if limited; values are usually kept less than 0.2 in practice. Problem: tends to introduce too much noise, and erase important context information, especially in problems w/ limited timesteps.
Recurrent activations (recurrent_dropout): the recommended dropout

Batch Normalization:

Activations (kernel): worth trying. Can benefit substantially, or not.
Recurrent activations: should work better; see Recurrent Batch Normalization. No Keras implementations yet as far as I know, but I may implement it in the future.

Weight Constraints: set hard upper-bound on weights l2-norm; possible alternative to weight decay.

Activity Constraints: don't bother; for most purposes, if you have to manually constrain your outputs, the layer itself is probably learning poorly, and the solution is elsewhere.

What should I do? Lots of info - so here's some concrete advice:

Weight decay: try 1e-3, 1e-4, see which works better. Do not expect the same value of decay to work for kernel and recurrent_kernel, especially depending on architecture. Check weight shapes - if one is much smaller than the other, apply smaller decay to former
Dropout: try 0.1. If you see improvement, try 0.2 - else, scrap it
Recurrent Dropout: start with 0.2. Improvement --> 0.4. Improvement --> 0.5, else 0.3.
Batch Normalization: try. Improvement --> keep it - else, scrap it.
Recurrent Batchnorm: same as 4.
Weight constraints: advisable w/ higher learning rates to prevent exploding gradients - else use higher weight decay
Activity constraints: probably not (see above)
Residual RNNs: introduce significant changes, along a regularizing effect. See application in IndRNNs
Biases: put simply, I don't know. No one seems to bother with them, so I haven't experimented much either. With BatchNormalization, however, you can set use_bias=False
Zoneout: don't know, never tried, might work - see paper.
Layer Normalization: some report it working better than BN for RNNs - but my application found it otherwise; paper
Data shuffling: is a strong regularizer. Also shuffle batch samples (samples in batch). See relevant info on stateful RNNs
Optimizer: can be an inherent regularizer. Don't have a full explanation, but in my application, Nadam (& NadamW) has stomped every other optimizer - worth trying.

Introspection: bottom section on 'learning' isn't worth much without this; don't just look at validation performance and call it a day - inspect the effect that adjusting a regularizer has on weights and activations. Evaluate using info toward bottom & relevant theory.

BONUS: weight decay can be powerful - even more powerful when done right; turns out, adaptive optimizers like Adam can harm its effectiveness, as described in this paper. Solution: use AdamW. My Keras/TensorFlow implementation here.

This is too much! Agreed - welcome to Deep Learning. Two tips here:

Bayesian Optimization; will save you time especially on prohibitively expensive training.
Conv1D(strides > 1), for many timesteps (>1000); slashes dimensionality, shouldn't harm performance (may in fact improve it).

Introspection Code:

Gradients: see this answer

Weights: see this answer

Weights l2 norm

rnn_weights = rnn_layer.get_weights() # returns [kernel, recurrent_kernel, bias], in order
kernel_l2norm    = np.sqrt(np.sum(np.square(rnn_weights[0]), axis=0, keepdims=True))
recurrent_l2norm = np.sqrt(np.sum(np.square(rnn_weights[1]), axis=0, keepdims=True))
max_kernel_l2norm    = np.max(kernel_l2norm)    # `kernel_constraint`    will check this
max_recurrent_l2norm = np.max(recurrent_l2norm) # `recurrent_constraint` will check this

Activations: see this answer

Weights: use .get_weights(), organize to plot in histograms, per-gate. No code yet, but may link a future Q&A of mine.

How does 'learning' work?

The 'ultimate truth' of machine learning that is seldom discussed or emphasized is, we don't have access to the function we're trying to optimize - the test loss function. All of our work is with what are approximations of the true loss surface - both the train set and the validation set. This has some critical implications:

Train set global optimum can lie very far from test set global optimum
Local optima are unimportant, and irrelevant:
- Train set local optimum is almost always a better test set optimum
- Actual local optima are almost impossible for high-dimensional problems; for the case of the "saddle", you'd need the gradients w.r.t. all of the millions of parameters to equal zero at once
- Local attractors are lot more relevant; the analogy then shifts from "falling into a pit" to "gravitating into a strong field"; once in that field, your loss surface topology is bound to that set up by the field, which defines its own local optima; high LR can help exit a field, much like "escape velocity"

Further, loss functions are way too complex to analyze directly; a better approach is to localize analysis to individual layers, their weight matrices, and roles relative to the entire NN. Two key considerations are:

Feature extraction capability. Ex: the driving mechanism of deep classifiers is, given input data, to increase class separability with each layer's transformation. Higher quality features will filter out irrelevant information, and deliver what's essential for the output layer (e.g. softmax) to learn a separating hyperplane.
Information utility. Dead neurons, and extreme activations are major culprits of poor information utility; no single neuron should dominate information transfer, and too many neurons shouldn't lie purposeless. Stable activations and weight distributions enable gradient propagation and continued learning.

How does regularization work? read above first

In a nutshell, via maximizing NN's information utility, and improving estimates of the test loss function. Each regularization method is unique, and no two exactly alike - see "RNN regularizers".

RNN: Depth vs. Width: not as simple as "one is more nonlinear, other works in higher dimensions".

RNN width is defined by (1) # of input channels; (2) # of cell's filters (output channels). As with CNN, each RNN filter is an independent feature extractor: more is suited for higher-complexity information, including but not limited to: dimensionality, modality, noise, frequency.
RNN depth is defined by (1) # of stacked layers; (2) # of timesteps. Specifics will vary by architecture, but from information standpoint, unlike CNNs, RNNs are dense: every timestep influences the ultimate output of a layer, hence the ultimate output of the next layer - so it again isn't as simple as "more nonlinearity"; stacked RNNs exploit both spatial and temporal information.

Update:

Here is an example of a near-ideal RNN gradient propagation for 170+ timesteps:

This is rare, and was achieved via careful regularization, normalization, and hyperparameter tuning. Usually we see a large gradient for the last few timesteps, which drops off sharply toward left - as here. Also, since the model is stateful and fits 7 equivalent windows, gradient effectively spans 1200 timesteps.