TensorFlow: Saver has 5 models limit

I wanted to save multiple models for my experiment but I noticed that tf.train.Saver() constructor could not save more than 5 models. Here is a simple code:

import tensorflow as tf 

x = tf.Variable(tf.zeros([1]))
saver = tf.train.Saver()
sess = tf.Session()

for i in range(10):
  sess.run(tf.initialize_all_variables())
  saver.save( sess, '/home/eneskocabey/Desktop/model' + str(i) )

When I ran this code, I saw only 5 models on my Desktop. Why is this? How can I save more than 5 models with the same tf.train.Saver() constructor?

回答1:

The tf.train.Saver() constructor takes an optional argument called max_to_keep, which defaults to keeping the 5 most recent checkpoints of your model. To save more models, simply specify a value for that argument:

import tensorflow as tf 

x = tf.Variable(tf.zeros([1]))
saver = tf.train.Saver(max_to_keep=10)
sess = tf.Session()

for i in range(10):
  sess.run(tf.initialize_all_variables())
  saver.save(sess, '/home/eneskocabey/Desktop/model' + str(i))

To keep all checkpoints, pass the argument max_to_keep=None to the saver constructor.

回答2:

If you use your own tf.Session() for the training:

In order to keep the intermediate checkpoints and not the last 5, you need to change 2 parameters in the tf.train.Saver():

max_to_keep - indicates the maximum number of recent checkpoint files to keep. As new files are created, older files are deleted. If None or 0, no checkpoints are deleted from the filesystem but only the last one is kept in the checkpoint file. Defaults to 5 (that is, the 5 most recent checkpoint files are kept.)
keep_checkpoint_every_n_hours - In addition to keeping the most recent max_to_keep checkpoint files, you might want to keep one checkpoint file for every N hours of training. This can be useful if you want to later analyze how a model progressed during a long training session. For example, passing keep_checkpoint_every_n_hours=2 ensures that you keep one checkpoint file for every 2 hours of training. The default value of 10,000 hours effectively disables the feature.

So if you do the following, you will store a checkpoint every 2 hours and if the total number of saved checkpoints reaches 10, then the oldest checkpoint will be deleted and a new one will replace it:

saver = tf.train.Saver(max_to_keep=10, keep_checkpoint_every_n_hours=2)

If you use tf.estimator.Estimator() then the saving of the checkpoint is done by the Estimator itself. That's why you need to pass it a tf.estimator.RunConfig() with some of the following parameters:
- keep_checkpoint_max - The maximum number of recent checkpoint files to keep. As new files are created, older files are deleted. If None or 0, all checkpoint files are kept. Defaults to 5 (that is, the 5 most recent checkpoint files are kept.)
- save_checkpoints_steps - Save checkpoints every this many steps. Can not be specified with save_checkpoints_secs.
- save_checkpoints_secs - Save checkpoints every this many seconds. Can not be specified with save_checkpoints_steps. Defaults to 600 seconds if both save_checkpoints_steps and save_checkpoints_secs are not set in constructor. If both save_checkpoints_steps and save_checkpoints_secs are None, then checkpoints are disabled.

So if you do the following, you will store a checkpoint every 100 iterations and if the total number of saved checkpoints reaches 10, then the oldest checkpoint will be deleted and a new one will replace it:

run_config = tf.estimator.RunConfig()
run_config = run_config.replace(keep_checkpoint_max=10, 
    save_checkpoints_steps=100)
classifier = tf.estimator.Estimator(
    model_fn=model_fn, model_dir=model_dir, config=run_config)