I'm trying to create a model in Keras to make numerical predictions from the pictures. My model has densenet121 convolutional base, with couple of additional layers on top. All layers except for the two last ones are set to layer.trainable = False
. My loss is mean squared error, since it's a regression task. During training I get loss: ~3
, while evaluation on the very same batch of the data gives loss: ~30
:
model.fit(x=dat[0],y=dat[1],batch_size=32)
Epoch 1/1 32/32 [==============================] - 0s 11ms/step -
loss: 2.5571
model.evaluate(x=dat[0],y=dat[1])
32/32 [==============================] - 2s 59ms/step
29.276123046875
I feed exactly the same 32 pictures during training and evaluation. And I also calculated loss using predicted values from y_pred=model.predict(dat[0])
and then constructed mean squared error using numpy. The result was the same as what I've got from evaluation (i.e. 29.276123...).
There was suggestion that this behavior might be due to BatchNormalization
layers in convolutional base (discussion on github). Of course, all BatchNormalization
layers in my model have been set to layer.trainable=False
as well. Maybe somebody has encountered this problem and figured out the solution?
Looks like I found the solution. As I have suggested the problem is with BatchNormalization layers. They make tree things 1) subtract mean and normalize by std 2)collect statistics on mean and std using running average 3) train two additional parameters (two per node). When one sets trainable to False, these two parameters freeze and layer also stops collecting statistic on mean and std. But it looks like the layer still performs normalization during training time using the training batch. Most likely it's a bug in keras or maybe they did it on purpose for some reason. As a result the calculations on forward propagation during training time are different as compared with prediction time even though the trainable atribute is set to False.
There are two possible solutions i can think of:
- To set all BatchNormalization layers to trainable. In this case these layers will collect statistics from your dataset instead of using pretrained one (which can be significantly different!). In this case you will adjust all the BatchNorm layers to your custom dataset during the training.
- Split the model in two parts
model=model_base+model_top
. After that, use model_base
to extract features by model_base.predict()
and then feed these features into model_top
and train only the model_top
.
I've just tried the first solution and it looks like it's working:
model.fit(x=dat[0],y=dat[1],batch_size=32)
Epoch 1/1
32/32 [==============================] - 1s 28ms/step - loss: **3.1053**
model.evaluate(x=dat[0],y=dat[1])
32/32 [==============================] - 0s 10ms/step
**2.487905502319336**
This was after some training - one need to wait till enough statistics on mean and std are collected.
Second solution i haven't tried yet, but i'm pretty sure it's gonna work since forward propagation during training and prediction will be the same.
Update. I found a great blog post where this issue has been discussed in all the details. Check it out here
But dropout layers usually create opposite effect making loss on evaluation less than loss during training.
Not necessarily! Although in dropout layer some of the neurons are dropped, but bear in mind that the output is scaled back according to dropout rate. In inference time (i.e. test time) dropout is removed entirely and considering that you have only trained your model for just one epoch, the behavior you saw may happen. Don't forget that since you are training the model for just one epoch, only a portion of neurons have been dropped in the dropout layer but all of them are present at inference time.
If you continue training the model for more epochs you might expect that the training loss and the test loss (on the same data) becomes more or less the same.
Experiment it yourself: just set the trainable
parameter of Dropout layer(s) to False
and see whether this happens or not.
One may be confused (as I was) by seeing that, after one epoch of training, the training loss is not equal to evaluation loss on the same batch of data. And this is not specific to models with Dropout
or BatchNormalization
layers. Consider this example:
from keras import layers, models
import numpy as np
model = models.Sequential()
model.add(layers.Dense(1000, activation='relu', input_dim=100))
model.add(layers.Dense(1))
model.compile(loss='mse', optimizer='adam')
x = np.random.rand(32, 100)
y = np.random.rand(32, 1)
print("Training:")
model.fit(x, y, batch_size=32, epochs=1)
print("\nEvaluation:")
loss = model.evaluate(x, y)
print(loss)
The output:
Training:
Epoch 1/1
32/32 [==============================] - 0s 7ms/step - loss: 0.1520
Evaluation:
32/32 [==============================] - 0s 2ms/step
0.7577340602874756
So why the losses are different if they have been computed over the same data, i.e. 0.1520 != 0.7577
?
If you ask this, it's because you, like me, have not paid enough attention: that 0.1520
is the loss before updating the parameters of model (i.e. before doing backward pass or backpropagation). And 0.7577
is the loss after the weights of model has been updated. Even though that the data used is the same, the state of the model when computing those loss values is not the same (Another question: so why has the loss increased after backpropagation? It is simply because you have only trained it for just one epoch and therefore the weights updates are not stable enough yet).
To confirm this, you can also use the same data batch as the validation data:
model.fit(x, y, batch_size=32, epochs=1, validation_data=(x,y))
If you run the code above with the modified line above you will get an output like this (obviously the exact values may be different for you):
Training:
Train on 32 samples, validate on 32 samples
Epoch 1/1
32/32 [==============================] - 0s 15ms/step - loss: 0.1273 - val_loss: 0.5344
Evaluation:
32/32 [==============================] - 0s 89us/step
0.5344240665435791
You see that the validation loss and evaluation loss are exactly the same: it is because the validation is performed at the end of epoch (i.e. when the model weights has already been updated).