I have seen a few different mean squared error loss functions in various posts for regression models in Tensorflow:
loss = tf.reduce_sum(tf.pow(prediction - Y,2))/(n_instances)
loss = tf.reduce_mean(tf.squared_difference(prediction, Y))
loss = tf.nn.l2_loss(prediction - Y)
What are the differences between these?
I would say that the third equation is different, while the 1st and 2nd are formally the same but behave differently due to numerical concerns.
I think that the 3rd equation (using l2_loss
) is just returning 1/2 of the squared Euclidean norm, that is, the sum of the element-wise square of the input, which is x=prediction-Y
. You are not dividing by the number of samples anywhere. Thus, if you have a very large number of samples, the computation may overflow (returning Inf).
The other two are formally the same, computing the mean of the element-wise squared x
tensor. However, while the documentation does not specify it explicitly, it is very likely that reduce_mean
uses an algorithm adapted to avoid overflowing with very large number of samples. In other words, it likely does not try to sum everything first and then divide by N, but use some kind of rolling mean that can adapt to an arbitrary number of samples without necessarily causing an overflow.
The first and the second loss functions calculate the same thing, but in a slightly different way. The third function calculate something completely different. You can see this by executing this code:
import tensorflow as tf
shape_obj = (5, 5)
shape_obj = (100, 6, 12)
Y1 = tf.random_normal(shape=shape_obj)
Y2 = tf.random_normal(shape=shape_obj)
loss1 = tf.reduce_sum(tf.pow(Y1 - Y2, 2)) / (reduce(lambda x, y: x*y, shape_obj))
loss2 = tf.reduce_mean(tf.squared_difference(Y1, Y2))
loss3 = tf.nn.l2_loss(Y1 - Y2)
with tf.Session() as sess:
print sess.run([loss1, loss2, loss3])
# when I run it I got: [2.0291963, 2.0291963, 7305.1069]
Now you can verify that 1-st and 2-nd calculates the same thing (in theory) by noticing that tf.pow(a - b, 2)
is the same as tf.squared_difference(a - b, 2)
. Also reduce_mean
is the same as reduce_sum / number_of_element
. The thing is that computers can't calculate everything exactly. To see what numerical instabilities can do to your calculations take a look at this:
import tensorflow as tf
shape_obj = (5000, 5000, 10)
Y1 = tf.zeros(shape=shape_obj)
Y2 = tf.ones(shape=shape_obj)
loss1 = tf.reduce_sum(tf.pow(Y1 - Y2, 2)) / (reduce(lambda x, y: x*y, shape_obj))
loss2 = tf.reduce_mean(tf.squared_difference(Y1, Y2))
with tf.Session() as sess:
print sess.run([loss1, loss2])
It is easy to see that the answer should be 1, but you will get something like this: [1.0, 0.26843545]
.
Regarding your last function, the documentation says that:
Computes half the L2 norm of a tensor without the sqrt: output = sum(t
** 2) / 2
So if you want it to calculate the same thing (in theory) as the first one you need to scale it appropriately:
loss3 = tf.nn.l2_loss(Y1 - Y2) * 2 / (reduce(lambda x, y: x*y, shape_obj))