I have seen a few different mean squared error loss functions in various posts for regression models in Tensorflow:
loss = tf.reduce_sum(tf.pow(prediction - Y,2))/(n_instances)
loss = tf.reduce_mean(tf.squared_difference(prediction, Y))
loss = tf.nn.l2_loss(prediction - Y)
What are the differences between these?
The first and the second loss functions calculate the same thing, but in a slightly different way. The third function calculate something completely different. You can see this by executing this code:
Now you can verify that 1-st and 2-nd calculates the same thing (in theory) by noticing that
tf.pow(a - b, 2)
is the same astf.squared_difference(a - b, 2)
. Alsoreduce_mean
is the same asreduce_sum / number_of_element
. The thing is that computers can't calculate everything exactly. To see what numerical instabilities can do to your calculations take a look at this:It is easy to see that the answer should be 1, but you will get something like this:
[1.0, 0.26843545]
.Regarding your last function, the documentation says that:
So if you want it to calculate the same thing (in theory) as the first one you need to scale it appropriately:
I would say that the third equation is different, while the 1st and 2nd are formally the same but behave differently due to numerical concerns.
I think that the 3rd equation (using
l2_loss
) is just returning 1/2 of the squared Euclidean norm, that is, the sum of the element-wise square of the input, which isx=prediction-Y
. You are not dividing by the number of samples anywhere. Thus, if you have a very large number of samples, the computation may overflow (returning Inf).The other two are formally the same, computing the mean of the element-wise squared
x
tensor. However, while the documentation does not specify it explicitly, it is very likely thatreduce_mean
uses an algorithm adapted to avoid overflowing with very large number of samples. In other words, it likely does not try to sum everything first and then divide by N, but use some kind of rolling mean that can adapt to an arbitrary number of samples without necessarily causing an overflow.