可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm using TensorFlow and I modified the tutorial example to take my RGB images.
The algorithm works flawlessly out of the box on the new image set, until suddenly (still converging, it's around 92% accuracy usually), it crashes with the error that ReluGrad received non-finite values. Debugging shows that nothing unusual happens with the numbers until very suddenly, for unknown reason, the error is thrown. Adding
print "max W vales: %g %g %g %g"%(tf.reduce_max(tf.abs(W_conv1)).eval(),tf.reduce_max(tf.abs(W_conv2)).eval(),tf.reduce_max(tf.abs(W_fc1)).eval(),tf.reduce_max(tf.abs(W_fc2)).eval())
print "max b vales: %g %g %g %g"%(tf.reduce_max(tf.abs(b_conv1)).eval(),tf.reduce_max(tf.abs(b_conv2)).eval(),tf.reduce_max(tf.abs(b_fc1)).eval(),tf.reduce_max(tf.abs(b_fc2)).eval())
as debug code to each loop, yields the following output:
Step 8600
max W vales: 0.759422 0.295087 0.344725 0.583884
max b vales: 0.110509 0.111748 0.115327 0.124324
Step 8601
max W vales: 0.75947 0.295084 0.344723 0.583893
max b vales: 0.110516 0.111753 0.115322 0.124332
Step 8602
max W vales: 0.759521 0.295101 0.34472 0.5839
max b vales: 0.110521 0.111747 0.115312 0.124365
Step 8603
max W vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38
max b vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38
Since none of my values is very high, the only way a NaN can happen is by a badly handled 0/0, but since this tutorial code doesn't do any divisions or similar operations, I see no other explanation than that this comes from the internal TF code.
I'm clueless on what to do with this. Any suggestions? The algorithm is converging nicely, its accuracy on my validation set was steadily climbing and just reached 92.5% at iteration 8600.
回答1:
Actually, it turned out to be something stupid. I'm posting this in case anyone else would run into a similar error.
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))
is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.
Replacing it with
cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))
solved all my problems.
回答2:
Actually, clipping is not a good idea as it will stop the gradient from propagating backwards when the threshold is reached. Instead we can add a little bit of constant to the softmax output.
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv + 1e-10))
回答3:
A bias free alternative.
Many of the other solutions use clipping to avoid an undefined gradient. Depending on your problem, clipping introduces bias and may not be acceptable in all cases. As the following code demonstrates, we need only handle the point of discontinuity--not the region near it.
Specific Answer
def cross_entropy(x, y, axis=-1):
safe_y = tf.where(tf.equal(x, 0.), tf.ones_like(y), y)
return -tf.reduce_sum(x * tf.log(safe_y), axis)
def entropy(x, axis=-1):
return cross_entropy(x, x, axis)
But did it work?
x = tf.constant([0.1, 0.2, 0., 0.7])
e = entropy(x)
# ==> 0.80181855
g = tf.gradients(e, x)[0]
# ==> array([1.30258512, 0.60943794, 0., -0.64332503], dtype=float32) Yay! No NaN.
(Note: deleted dup cross-post.)
General Recipe
Use an inner tf.where
to ensure the function has no asymptote.
That is, alter the input to the inf generating function such that no inf can be created.
Then use a second tf.where
to always select the valid code-path.
That is, implement the mathematical condition as you would "normally", i.e., the "naive" implementation.
In Python code, the recipe is:
Instead of this:
tf.where(x_ok, f(x), safe_f(x))
Do this:
safe_x = tf.where(x_ok, x, safe_x)
tf.where(x_ok, f(safe_x), safe_f(x))
Example
Suppose you wish to compute:
f(x) = { 1/x, x!=0
{ 0, x=0
A naive implementation results in NaNs in the gradient, i.e.,
def f(x):
x_ok = tf.not_equal(x, 0.)
f = lambda x: 1. / x
safe_f = tf.zeros_like
return tf.where(x_ok, f(x), safe_f(x))
Does it work?
x = tf.constant([-1., 0, 1])
tf.gradients(f(x), x)[0].eval()
# ==> array([ -1., nan, -1.], dtype=float32)
# ...bah! We have a NaN at the asymptote despite not having
# an asymptote in the non-differentiated result.
The basic pattern for avoiding NaN gradients when using tf.where
is to call tf.where
twice. The innermost tf.where
ensures that the result f(x)
is always finite. The outermost tf.where
ensures the correct result is chosen. For the running example, the trick plays out like this:
def safe_f(x):
x_ok = tf.not_equal(x, 0.)
f = lambda x: 1. / x
safe_f = tf.zeros_like
safe_x = tf.where(x_ok, x, tf.ones_like(x))
return tf.where(x_ok, f(safe_x), safe_f(x))
But did it work?
x = tf.constant([-1., 0, 1])
tf.gradients(safe_f(x), x)[0].eval()
# ==> array([-1., 0., -1.], dtype=float32)
# ...yay! double-where trick worked. Notice that the gradient
# is now a constant at the asymptote (as opposed to being NaN).
回答4:
If y_conv
is the result of a softmax, say, y_conv = tf.nn.softmax(x)
, then an even better solution is to replace it with log_softmax
:
y = tf.nn.log_softmax(x)
cross_entropy = -tf.reduce_sum(y_*y)
回答5:
You are trying to calculate cross-entropy using the standard formula. Not only the value is undefinined when x=0
, it is also numerically unstable.
It is better to use tf.nn.softmax_cross_entropy_with_logits or if you really want to use hand-crafted formula, to tf.clip_by_value zeros to very small number in the log.
回答6:
Sometimes you use tf.sqrt()
function without adding a small constant 1e-10
in it, inducing this nan
problem.
回答7:
Here is the implementation of the binary (sigmoid) and categorical (softmax) cross-entropy losses in TensorFlow 1.1:
- https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/python/ops/nn_impl.py#L159
- https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/python/ops/nn_ops.py#L1609
As one can see in the binary case they consider some special cases to achieve numerical stability:
# The logistic loss formula from above is
# x - x * z + log(1 + exp(-x))
# For x < 0, a more numerically stable formula is
# -x * z + log(1 + exp(x))
# Note that these two expressions can be combined into the following:
# max(x, 0) - x * z + log(1 + exp(-abs(x)))
# To allow computing gradients at zero, we define custom versions of max and
# abs functions.
zeros = array_ops.zeros_like(logits, dtype=logits.dtype)
cond = (logits >= zeros)
relu_logits = array_ops.where(cond, logits, zeros)
neg_abs_logits = array_ops.where(cond, -logits, logits)
return math_ops.add(relu_logits - logits * labels,
math_ops.log1p(math_ops.exp(neg_abs_logits)),
name=name)
回答8:
I used LSTM for long sequences and got nan gradients. None of these answers helped me. But I came up with three own solutions. I hope they will be useful for some other people who came here from google search.
Gradient clipping didn't help me because gradients turned nan in one batch update. In this case, you can replace nans with zeros with such lines:
opt = tf.train.AdamOptimizer(args.lr)
grads = opt.compute_gradients(loss)
grads2 = [(tf.where(tf.is_nan(grad), tf.zeros(grad.shape), grad), var) for grad, var in grads]
opt_op = opt.apply_gradients(grads2)
If you want to track if nans appeared you can use this code:
was_nan = tf.reduce_any(tf.convert_to_tensor([tf.reduce_any(tf.is_nan(g)) for g in grads]))
Replace LSTMCell with LayerNormBasicLSTMCell - an LSTM cell with layer norm - something similar to batch norm between timesteps.
If you use regular recurrent state dropout you can replace it with "Recurrent Dropout without Memory Loss". Code:
LayerNormBasicLSTMCell(neurons, dropout_keep_prob=0.8)
Note that you can also turn on the dropout feature alone without layer normalization:
LayerNormBasicLSTMCell(neurons, layer_norm=False, dropout_keep_prob=0.8)
回答9:
Besides all the great answers above, I will add mine. It's a scenario less common to run into, but does cause NaN: divide by zero.
In my network for a NLP task, there is a layer that does average pooling. Namely, each data is a sequence of tokens. My layer does some token embedding and then calculates the average of the embedded vector.
The average calculation is coded as
tf.reduce_sum(embedded)/tf.reduce_sum(tf.not_equal(input, pad))
Here pad
is some dummy token I use in batch processing.
Now if some data contains empty token list (for whatever reason), its length (the denominator in the code snippet above) would be 0. Then it causes a divide by zero issue and the NaN will remain in all the following layers/ optimization steps.
In case anyone ran into this issue, I used tf.where
to smooth those length:
sum_embedding = tf.reduce_sum(embedded, 1)
embedding_length = tf.reduce_sum(tf.cast(tf.not_equal(input, pad), dtype=tf.float32), axis=1, keep_dims=True)
embedding_length_smoothed = tf.where(tf.greater(embedding_length, 0.0), embedding_length, tf.ones(tf.shape(embedding_length)))
avg_embedding = sum_embedding / embedding_length_smoothed
Essentially this treats all those data with 0-length token list to be of length 1, and avoids the NaN issue.
回答10:
I was getting nans sometimes and not other times while working on a standard feed-forward network. I have previously used similar TensorFlow code and it worked fine.
It turns out that I imported the variable names by accident. So, as soon as the first row (the variable names) was selected in a batch, the nan losses started. Maybe keep an eye out for that?
回答11:
I will add here one of my previous problems with NaNs. I was using the sigmoid function as the activation of the last layer of my network. However, the sigmoid activation function uses the exponential function to be computed and I got some really big numbers entering the sigmoid.
It resulted in infinite gradients and some NaNs started to appear.
回答12:
I've been using Tensorflow Estimator, which I believe account for those division by zero and other numerical stability issues, and occasionally get this error (ERROR:tensorflow:Model diverged with loss = NaN during training
). Most of the time when I get this is because my inputs include nan
s. So: be sure that your input dataframes (or whatever you use) don't have NaN values hidden somewhere in them.