How to use Tensorflow BatchNormalization with Grad

2020-06-24 05:20发布

问题:

Suppose we have a simple Keras model that uses BatchNormalization:

model = tf.keras.Sequential([
                     tf.keras.layers.InputLayer(input_shape=(1,)),
                     tf.keras.layers.BatchNormalization()
])

How to actually use it with GradientTape? The following doesn't seem to work as it doesn't update the moving averages?

# model training... we want the output values to be close to 150
for i in range(1000):
  x = np.random.randint(100, 110, 10).astype(np.float32)
  with tf.GradientTape() as tape:
    y = model(np.expand_dims(x, axis=1))
    loss = tf.reduce_mean(tf.square(y - 150))
  grads = tape.gradient(loss, model.variables)
  opt.apply_gradients(zip(grads, model.variables))

In particular, if you inspect the moving averages, they remain the same (inspect model.variables, averages are always 0 and 1). I know one can use .fit() and .predict(), but I would like to use the GradientTape and I'm not sure how to do this. Some version of the documentation suggests to update update_ops, but that doesn't seem to work in eager mode.

In particular, the following code will not output anything close to 150 after the above training.

x = np.random.randint(200, 210, 100).astype(np.float32)
print(model(np.expand_dims(x, axis=1)))

回答1:

with gradient tape mode BatchNormalization layer should be called with argument training=True

example:

inp = KL.Input( (64,64,3) )
x = inp
x = KL.Conv2D(3, kernel_size=3, padding='same')(x)
x = KL.BatchNormalization()(x, training=True)
model = KM.Model(inp, x)

then moving vars are properly updated

>>> model.layers[2].weights[2]
<tf.Variable 'batch_normalization/moving_mean:0' shape=(3,) dtype=float32, numpy
=array([-0.00062087,  0.00015137, -0.00013239], dtype=float32)>


回答2:

I just give up. I spent quiet a bit of time trying to make sense of a model that looks like:

model = tf.keras.Sequential([
                     tf.keras.layers.BatchNormalization(),
])

And I do give up because that thing looks like that:

My intuition was that BatchNorm these days is not as straight forward as it used to be and that is why it scales original distribution but not so much new distribution (which is a shame), but ain't nobody got time for that.

Edit: the reason for that behavior is that BN only calculates moments and normalizes batches during training. During training it maintains running averages of mean and deviation and once you switch to evaluation, parameters are used as constants. i.e. evaluation should not depend on normalization because evaluation can be used even for a single input and can not rely on batch statistics. Since constants are calculated on a different distribution, you are getting a higher error during evaluation.



回答3:

With Gradient Tape mode, you would usually find gradients like:

with tf.GradientTape() as tape:
    y_pred = model(features)
    loss = your_loss_function(y_pred, y_true)
    gradients = tape.gradient(loss, model.trainable_variables)

train_op = model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))

However, if your model contains BatchNormalization or Dropout layer (or any layer that has different train/test phases) then tf will fail building the graph.

A good practice would be to explicitly use trainable parameter when obtaining output from a model. When optimizing use model(features, trainable=True) and when predicting use model(features, trainable=False), in order to explicitly choose train/test phase when using such layers.

For PREDICT and EVAL phase, use

training = (mode == tf.estimator.ModeKeys.TRAIN)
y_pred = model(features, trainable=training)

For TRAIN phase, use

with tf.GradientTape() as tape:
    y_pred = model(features, trainable=training)
    loss = your_loss_function(y_pred, y_true)
    gradients = tape.gradient(loss, model.trainable_variables)

train_op = model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))

Note that, iperov's answer works as well, except that you will need to set the training phase manually for those layers.

x = BatchNormalization()(x, training=True)
x = Dropout(rate=0.25)(x, training=True)

x = BatchNormalization()(x, training=False)
x = Dropout(rate=0.25)(x, training=False)

I'd recommended to have one get_model function that returns the model, while changing the phase using training parameter when calling the model.

Note:

If you use model.variables when finding gradients, you'll get this warning

Gradients do not exist for variables 
['layer_1_bn/moving_mean:0', 
'layer_1_bn/moving_variance:0', 
'layer_2_bn/moving_mean:0', 
'layer_2_bn/moving_variance:0'] 
when minimizing the loss.

This can be resolved by computing gradients only against trainable variables. Replace model.variables with model.trainable_variables