TensorFlow: nr. of epochs vs. nr. of training step

2020-07-26 15:47发布

问题:

I have recently experimented with Google's seq2seq to set up a small NMT-system. I managed to get everything working, but I am still wondering about the exact difference between the number of epochs and the number of training steps of a model.

If I am not mistaken, one epoch consists of multiple training steps and has passed once your whole training data has been processed once. I do not understand, however, the difference between the two when I look at the documentation in Google's own tutorial on NMT. Note the last line of the following snippet.

export DATA_PATH=

export VOCAB_SOURCE=${DATA_PATH}/vocab.bpe.32000
export VOCAB_TARGET=${DATA_PATH}/vocab.bpe.32000
export TRAIN_SOURCES=${DATA_PATH}/train.tok.clean.bpe.32000.en
export TRAIN_TARGETS=${DATA_PATH}/train.tok.clean.bpe.32000.de
export DEV_SOURCES=${DATA_PATH}/newstest2013.tok.bpe.32000.en
export DEV_TARGETS=${DATA_PATH}/newstest2013.tok.bpe.32000.de

export DEV_TARGETS_REF=${DATA_PATH}/newstest2013.tok.de
export TRAIN_STEPS=1000000

It seems to me as if there is only a way to define the number of training steps and not the number of epochs of your model. Is it possible that there is an overlap in terminology and that it is thus not necessary to define a number of epochs?

回答1:

An epoch consists of going through all your training samples once. And one step/iteration refers to training over a single minibatch. So if you have 1,000,000 training samples and use a batch size of 100, one epoch will be equivalent to 10,000 steps, with 100 samples per step.

A high-level neural network framework may let you set either the number of epochs or total number of training steps. But you can't set them both since one directly determines the value of the other.