Tensorflow Estimator - High evaluation values on t

2019-08-17 13:25发布

问题:

I'm using Tensorflow 1.10 with a custom Estimator. To test my training/evaluation loop, I just feed the same image/label into the network every time, so I expected the network to converge fast, which it does.

I'm also using the same image for evaluation, but get a much bigger loss value than when training. After training 2000 steps the loss is:

INFO:tensorflow:Loss for final step: 0.01181452

but evaluates to:

Eval loss at step 2000: 0.41252694

This seems wrong to me. It looks like the same problem as in this thread. Is there something special to consider, when using the evaluate method of Estimator?


Some more details about my code:

I've defined my model (FeatureNet) like here as an inheritance of tf.keras.Model with init and call method.

My model_fn looks like this:

def model_fn(features, labels, mode):

    resize_shape = (180, 320)
    num_dimensions = 16

    model = featurenet.FeatureNet(resize_shape, num_dimensions=num_dimensions)

    training = (mode == tf.estimator.ModeKeys.TRAIN)
    seg_pred = model(features, training)

    predictions = {
       # Generate predictions (for PREDICT mode)
       "seg_pred": seg_pred
    }
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)

    # Calculate Loss (for both TRAIN and EVAL modes)
    seg_loss = tf.reduce_mean(tf.keras.backend.binary_crossentropy(labels['seg_true'], seg_pred))
    loss = seg_loss

    # Configure the Training Op (for TRAIN mode)
    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.MomentumOptimizer(learning_rate=1e-4, momentum=0.9)

        train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())

        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

    # Add evaluation metrics (for EVAL mode)
    return tf.estimator.EstimatorSpec(mode=mode, loss=loss)

Then in the main-part I train and evaluate with an custom Estimator:

# Create the Estimator
estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    model_dir="/tmp/discriminative_model"
    )

def input_fn():
    features, labels = create_synthetic_image()

    training_data = tf.data.Dataset.from_tensors((features, labels))
    training_data = training_data.repeat(None)
    training_data = training_data.batch(1)
    training_data = training_data.prefetch(1)
    return training_data

estimator.train(input_fn=lambda: input_fn(), steps=2000)
eval_results = estimator.evaluate(input_fn=lambda: input_fn(), steps=50)
print('Eval loss at step %d: %s' % (eval_results['global_step'], eval_results['loss']))

Where create_synthetic_image creates the same image/label every time.

回答1:

I've found, that the handling of BatchNormalization can cause such errors, like described here.

The usage of the control_dependencies in the model-fn solved the issue for me (see here).

if mode == tf.estimator.ModeKeys.TRAIN:
    optimizer = tf.train.MomentumOptimizer(learning_rate=1e-4, momentum=0.9)

    with tf.control_dependencies(model.get_updates_for(features)):
        train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())

    return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)