Tensorflow NaN bug?

2020-01-23 15:45发布

问题:

I'm using TensorFlow and I modified the tutorial example to take my RGB images.

The algorithm works flawlessly out of the box on the new image set, until suddenly (still converging, it's around 92% accuracy usually), it crashes with the error that ReluGrad received non-finite values. Debugging shows that nothing unusual happens with the numbers until very suddenly, for unknown reason, the error is thrown. Adding

print "max W vales: %g %g %g %g"%(tf.reduce_max(tf.abs(W_conv1)).eval(),tf.reduce_max(tf.abs(W_conv2)).eval(),tf.reduce_max(tf.abs(W_fc1)).eval(),tf.reduce_max(tf.abs(W_fc2)).eval())
print "max b vales: %g %g %g %g"%(tf.reduce_max(tf.abs(b_conv1)).eval(),tf.reduce_max(tf.abs(b_conv2)).eval(),tf.reduce_max(tf.abs(b_fc1)).eval(),tf.reduce_max(tf.abs(b_fc2)).eval())

as debug code to each loop, yields the following output:

Step 8600
max W vales: 0.759422 0.295087 0.344725 0.583884
max b vales: 0.110509 0.111748 0.115327 0.124324
Step 8601
max W vales: 0.75947 0.295084 0.344723 0.583893
max b vales: 0.110516 0.111753 0.115322 0.124332
Step 8602
max W vales: 0.759521 0.295101 0.34472 0.5839
max b vales: 0.110521 0.111747 0.115312 0.124365
Step 8603
max W vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38
max b vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38

Since none of my values is very high, the only way a NaN can happen is by a badly handled 0/0, but since this tutorial code doesn't do any divisions or similar operations, I see no other explanation than that this comes from the internal TF code.

I'm clueless on what to do with this. Any suggestions? The algorithm is converging nicely, its accuracy on my validation set was steadily climbing and just reached 92.5% at iteration 8600.

回答1:

Actually, it turned out to be something stupid. I'm posting this in case anyone else would run into a similar error.

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))

is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.

Replacing it with

cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))

solved all my problems.



回答2:

Actually, clipping is not a good idea as it will stop the gradient from propagating backwards when the threshold is reached. Instead we can add a little bit of constant to the softmax output.

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv + 1e-10))


回答3:

A bias free alternative.

Many of the other solutions use clipping to avoid an undefined gradient. Depending on your problem, clipping introduces bias and may not be acceptable in all cases. As the following code demonstrates, we need only handle the point of discontinuity--not the region near it.

Specific Answer

def cross_entropy(x, y, axis=-1):
  safe_y = tf.where(tf.equal(x, 0.), tf.ones_like(y), y)
  return -tf.reduce_sum(x * tf.log(safe_y), axis)

def entropy(x, axis=-1):
  return cross_entropy(x, x, axis)

But did it work?

x = tf.constant([0.1, 0.2, 0., 0.7])
e = entropy(x)
# ==> 0.80181855
g = tf.gradients(e, x)[0]
# ==> array([1.30258512,  0.60943794, 0., -0.64332503], dtype=float32)  Yay! No NaN.

(Note: deleted dup cross-post.)

General Recipe

Use an inner tf.where to ensure the function has no asymptote. That is, alter the input to the inf generating function such that no inf can be created. Then use a second tf.where to always select the valid code-path. That is, implement the mathematical condition as you would "normally", i.e., the "naive" implementation.

In Python code, the recipe is:

Instead of this:

tf.where(x_ok, f(x), safe_f(x))

Do this:

safe_x = tf.where(x_ok, x, safe_x)
tf.where(x_ok, f(safe_x), safe_f(x))

Example

Suppose you wish to compute:

f(x) = { 1/x, x!=0
       { 0,   x=0

A naive implementation results in NaNs in the gradient, i.e.,

def f(x):
  x_ok = tf.not_equal(x, 0.)
  f = lambda x: 1. / x
  safe_f = tf.zeros_like
  return tf.where(x_ok, f(x), safe_f(x))

Does it work?

x = tf.constant([-1., 0, 1])
tf.gradients(f(x), x)[0].eval()
# ==> array([ -1.,  nan,  -1.], dtype=float32)
#  ...bah! We have a NaN at the asymptote despite not having
# an asymptote in the non-differentiated result.

The basic pattern for avoiding NaN gradients when using tf.where is to call tf.where twice. The innermost tf.where ensures that the result f(x) is always finite. The outermost tf.where ensures the correct result is chosen. For the running example, the trick plays out like this:

def safe_f(x):
  x_ok = tf.not_equal(x, 0.)
  f = lambda x: 1. / x
  safe_f = tf.zeros_like
  safe_x = tf.where(x_ok, x, tf.ones_like(x))
  return tf.where(x_ok, f(safe_x), safe_f(x))

But did it work?

x = tf.constant([-1., 0, 1])
tf.gradients(safe_f(x), x)[0].eval()
# ==> array([-1.,  0., -1.], dtype=float32)
# ...yay! double-where trick worked. Notice that the gradient
# is now a constant at the asymptote (as opposed to being NaN).


回答4:

If y_conv is the result of a softmax, say, y_conv = tf.nn.softmax(x), then an even better solution is to replace it with log_softmax:

y = tf.nn.log_softmax(x)
cross_entropy = -tf.reduce_sum(y_*y)


回答5:

You are trying to calculate cross-entropy using the standard formula. Not only the value is undefinined when x=0, it is also numerically unstable.

It is better to use tf.nn.softmax_cross_entropy_with_logits or if you really want to use hand-crafted formula, to tf.clip_by_value zeros to very small number in the log.



回答6:

Sometimes you use tf.sqrt() function without adding a small constant 1e-10 in it, inducing this nan problem.



回答7:

Here is the implementation of the binary (sigmoid) and categorical (softmax) cross-entropy losses in TensorFlow 1.1:

  • https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/python/ops/nn_impl.py#L159
  • https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/python/ops/nn_ops.py#L1609

As one can see in the binary case they consider some special cases to achieve numerical stability:

# The logistic loss formula from above is
#   x - x * z + log(1 + exp(-x))
# For x < 0, a more numerically stable formula is
#   -x * z + log(1 + exp(x))
# Note that these two expressions can be combined into the following:
#   max(x, 0) - x * z + log(1 + exp(-abs(x)))
# To allow computing gradients at zero, we define custom versions of max and
# abs functions.
zeros = array_ops.zeros_like(logits, dtype=logits.dtype)
cond = (logits >= zeros)
relu_logits = array_ops.where(cond, logits, zeros)
neg_abs_logits = array_ops.where(cond, -logits, logits)
return math_ops.add(relu_logits - logits * labels,
                    math_ops.log1p(math_ops.exp(neg_abs_logits)),
                    name=name)


回答8:

I used LSTM for long sequences and got nan gradients. None of these answers helped me. But I came up with three own solutions. I hope they will be useful for some other people who came here from google search.

  1. Gradient clipping didn't help me because gradients turned nan in one batch update. In this case, you can replace nans with zeros with such lines:

    opt = tf.train.AdamOptimizer(args.lr)
    grads = opt.compute_gradients(loss)
    grads2 = [(tf.where(tf.is_nan(grad), tf.zeros(grad.shape), grad), var) for grad, var in grads]
    opt_op = opt.apply_gradients(grads2)
    

    If you want to track if nans appeared you can use this code:

    was_nan = tf.reduce_any(tf.convert_to_tensor([tf.reduce_any(tf.is_nan(g)) for g in grads]))
    
  2. Replace LSTMCell with LayerNormBasicLSTMCell - an LSTM cell with layer norm - something similar to batch norm between timesteps.

  3. If you use regular recurrent state dropout you can replace it with "Recurrent Dropout without Memory Loss". Code:

    LayerNormBasicLSTMCell(neurons, dropout_keep_prob=0.8)
    

    Note that you can also turn on the dropout feature alone without layer normalization:

    LayerNormBasicLSTMCell(neurons, layer_norm=False, dropout_keep_prob=0.8)
    


回答9:

Besides all the great answers above, I will add mine. It's a scenario less common to run into, but does cause NaN: divide by zero.

In my network for a NLP task, there is a layer that does average pooling. Namely, each data is a sequence of tokens. My layer does some token embedding and then calculates the average of the embedded vector.

The average calculation is coded as

tf.reduce_sum(embedded)/tf.reduce_sum(tf.not_equal(input, pad)) 

Here pad is some dummy token I use in batch processing.

Now if some data contains empty token list (for whatever reason), its length (the denominator in the code snippet above) would be 0. Then it causes a divide by zero issue and the NaN will remain in all the following layers/ optimization steps.

In case anyone ran into this issue, I used tf.where to smooth those length:

sum_embedding = tf.reduce_sum(embedded, 1)
embedding_length = tf.reduce_sum(tf.cast(tf.not_equal(input, pad), dtype=tf.float32), axis=1, keep_dims=True)
embedding_length_smoothed = tf.where(tf.greater(embedding_length, 0.0), embedding_length, tf.ones(tf.shape(embedding_length)))
avg_embedding = sum_embedding / embedding_length_smoothed

Essentially this treats all those data with 0-length token list to be of length 1, and avoids the NaN issue.



回答10:

I was getting nans sometimes and not other times while working on a standard feed-forward network. I have previously used similar TensorFlow code and it worked fine.

It turns out that I imported the variable names by accident. So, as soon as the first row (the variable names) was selected in a batch, the nan losses started. Maybe keep an eye out for that?



回答11:

I will add here one of my previous problems with NaNs. I was using the sigmoid function as the activation of the last layer of my network. However, the sigmoid activation function uses the exponential function to be computed and I got some really big numbers entering the sigmoid.

It resulted in infinite gradients and some NaNs started to appear.



回答12:

I've been using Tensorflow Estimator, which I believe account for those division by zero and other numerical stability issues, and occasionally get this error (ERROR:tensorflow:Model diverged with loss = NaN during training). Most of the time when I get this is because my inputs include nans. So: be sure that your input dataframes (or whatever you use) don't have NaN values hidden somewhere in them.