TensorFlow 1.10+ custom estimator early stopping w

Suppose you are training a custom tf.estimator.Estimator with tf.estimator.train_and_evaluate using a validation dataset in a setup similar to that of @simlmx's:

classifier = tf.estimator.Estimator(
    model_fn=model_fn,
    model_dir=model_dir,
    params=params)

train_spec = tf.estimator.TrainSpec(
    input_fn = training_data_input_fn,
)

eval_spec = tf.estimator.EvalSpec(
    input_fn = validation_data_input_fn,
)

tf.estimator.train_and_evaluate(
    classifier,
    train_spec,
    eval_spec
)

Often, one uses a validation dataset to cut off training to prevent over-fitting when the loss continues to improve for the training dataset but not for the validation dataset.

Currently the tf.estimator.EvalSpec allows one to specify after how many steps (defaults to 100) to evaluate the model.

How can one (if possible not using tf.contrib functions) designate to terminate training after n number of evaluation calls (n * steps) where the evaluation loss does not improve and then save the "best" model / checkpoint (determined by validation dataset) to a unique file name (e.g. best_validation.checkpoint)

I understand your confusion now. The documentation for stop_if_no_decrease_hook states (emphasis mine):

max_steps_without_decrease: int, maximum number of training steps with no decrease in the given metric.

eval_dir: If set, directory containing summary files with eval metrics. By default, estimator.eval_dir() will be used.

Looking through the code of the hook (version 1.11), though, you find:

def stop_if_no_metric_improvement_fn():
    """Returns `True` if metric does not improve within max steps."""

    eval_results = read_eval_metrics(eval_dir) #<<<<<<<<<<<<<<<<<<<<<<<

    best_val = None
    best_val_step = None
    for step, metrics in eval_results.items(): #<<<<<<<<<<<<<<<<<<<<<<<
      if step < min_steps:
        continue
      val = metrics[metric_name]
      if best_val is None or is_lhs_better(val, best_val):
        best_val = val
        best_val_step = step
      if step - best_val_step >= max_steps_without_improvement: #<<<<<
        tf_logging.info(
            'No %s in metric "%s" for %s steps, which is greater than or equal '
            'to max steps (%s) configured for early stopping.',
            increase_or_decrease, metric_name, step - best_val_step,
            max_steps_without_improvement)
        return True
    return False

What the code does is load the evaluation results (produced with your EvalSpec parameters) and extract the eval results and the global_step (or whichever other custom step you use to count) associated with the specific evaluation record.

This is the source of the training steps part of the docs: the early stopping is not triggered according to the number of non-improving evaluations, but to the number of non-improving evals in a certain step range (which IMHO is a bit counter-intuitive).

So, to recap: Yes, the early-stopping hook uses the evaluation results to decide when it's time to cut the training, but you need to pass in the number of training steps you want to monitor and keep in mind how many evaluations will happen in that number of steps.

Examples with numbers to hopefully clarify more

Let's assume you're training indefinitely long having an evaluation every 1k steps. The specifics of how the evaluation runs is not relevant, as long as it runs every 1k steps producing a metric we want to monitor.

If you set the hook as hook = tf.contrib.estimator.stop_if_no_decrease_hook(my_estimator, 'my_metric_to_monitor', 10000) the hook will consider the evaluations happening in a range of 10k steps.

Since you're running 1 eval every 1k steps, this boils down to early-stopping if there's a sequence of 10 consecutive evals without any improvement. If then you decide to rerun with evals every 2k steps, the hook will only consider a sequence of 5 consecutive evals without improvement.

Keeping the best model

First of all, an important note: this has nothing to do with early stopping, the issue of keeping a copy of the best model through the training and the one of stopping the training once performance start degrading are completely unrelated.

Keeping the best model can be done very easily defining a tf.estimator.BestExporter in your EvalSpec (snippet taken from the link):

  serving_input_receiver_fn = ... # define your serving_input_receiver_fn
  exporter = tf.estimator.BestExporter(
      name="best_exporter",
      serving_input_receiver_fn=serving_input_receiver_fn,
      exports_to_keep=5) # this will keep the 5 best checkpoints

  eval_spec = [tf.estimator.EvalSpec(
    input_fn=eval_input_fn,
    steps=100,
    exporters=exporter,
    start_delay_secs=0,
    throttle_secs=5)]

If you don't know how to define the serving_input_fn have a look here

This allows you to keep the overall best 5 models you obtained, stored as SavedModels (which is the preferred way to store models at the moment).