What does “learning rate warm-up” mean?

In machine learning, especially deep learning, what does it mean to warm-up?

I've heard some times that in some models, warming-up is a phase in training. but honestly, I don't know what it is because I'm very new to ML. Until now I've never used or come across it, but I want to know it because I think it might be useful for me. so:

What is learning rate warm-up and when do we need it?

thanks in advance.

回答1:

If your data set is highly differentiated, you can suffer from a sort of "early over-fitting". If your shuffled data happens to include a cluster of related, strongly-featured observations, your model's initial training can skew badly toward those features -- or worse, toward incidental features that aren't truly related to the topic at all.

Warm-up is a way to reduce the primacy effect of the early training examples. Without it, you may need to run a few extra epochs to get the convergence desired, as the model un-trains those early superstitions.

Many models afford this as a command-line option. The learning rate is increased linearly over the warm-up period. If the target learning rate is p and the warm-up period is n, then the first batch iteration uses 1*p/n for its learning rate; the second uses 2*p/n, and so on: iteration i uses i*p/n, until we hit the nominal rate at iteration n.

This means that the first iteration gets only 1/n of the primacy effect. This does a reasonable job of balancing that influence.

Note that the ramp-up is commonly on the order of one epoch -- but is occasionally longer for particularly skewed data, or shorter for more homogeneous distributions. You may want to adjust, depending on how functionally extreme your batches can become when the shuffling algorithm is applied to the training set.

回答2:

It means that if you specify your learning rate to be say 2e-5, then during training the learning rate will be linearly increased from approximately 0 to 2e-5 within the first say 10,000 steps.