tf.keras loss becomes NaN

I'm programming a neural network in tf.keras, with 3 layers. My dataset is the MNIST dataset. I decreased the number of examples in the dataset, so the runtime is lower. This is my code:

import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import pandas as pd

!git clone https://github.com/DanorRon/data
%cd data
!ls

batch_size = 32
epochs = 10
alpha = 0.0001
lambda_ = 0
h1 = 50

train = pd.read_csv('/content/first-repository/mnist_train.csv.zip')
test = pd.read_csv('/content/first-repository/mnist_test.csv.zip')

train = train.loc['1':'5000', :]
test = test.loc['1':'2000', :]

train = train.sample(frac=1).reset_index(drop=True)
test = test.sample(frac=1).reset_index(drop=True)

x_train = train.loc[:, '1x1':'28x28']
y_train = train.loc[:, 'label']

x_test = test.loc[:, '1x1':'28x28']
y_test = test.loc[:, 'label']

x_train = x_train.values
y_train = y_train.values

x_test = x_test.values
y_test = y_test.values

nb_classes = 10
targets = y_train.reshape(-1)
y_train_onehot = np.eye(nb_classes)[targets]

nb_classes = 10
targets = y_test.reshape(-1)
y_test_onehot = np.eye(nb_classes)[targets]

model = tf.keras.Sequential()
model.add(layers.Dense(784, input_shape=(784,)))
model.add(layers.Dense(h1, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(lambda_)))
model.add(layers.Dense(10, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l2(lambda_)))

model.compile(optimizer=tf.train.GradientDescentOptimizer(alpha), 
             loss = 'categorical_crossentropy',
             metrics = ['accuracy'])

model.fit(x_train, y_train_onehot, epochs=epochs, batch_size=batch_size)

Whenever I run it, one of 3 things happens:

The loss decreases and the accuracy increases for a few epochs, until the loss becomes NaN for no apparent reason and the accuracy plummets.
The loss and accuracy stay the same for each epoch. Usually the loss is 2.3025 and the accuracy is 0.0986.
The loss starts at NaN(and stays that way), while the accuracy stays low.

Most of the time, the model does one of these things, but sometimes it does something random. It seems like the type of erratic behavior that occurs is completely random. I have no idea what the problem is. How do I fix this problem?

Edit: Sometimes, the loss decreases, but the accuracy stays the same. Also, sometimes the loss decreases and the accuracy increases, then after a while the accuracy decreases while the loss still decreases. Or, the loss decreases and the accuracy increases, then it switches and the loss goes up fast while the accuracy plummets, eventually ending with loss: 2.3025 acc: 0.0986.

Edit 2: This is an example of something that sometimes happens:

Epoch 1/100
49999/49999 [==============================] - 5s 92us/sample - loss: 1.8548 - acc: 0.2390

Epoch 2/100
49999/49999 [==============================] - 5s 104us/sample - loss: 0.6894 - acc: 0.8050

Epoch 3/100
49999/49999 [==============================] - 4s 90us/sample - loss: 0.4317 - acc: 0.8821

Epoch 4/100
49999/49999 [==============================] - 5s 104us/sample - loss: 2.2178 - acc: 0.1345

Epoch 5/100
49999/49999 [==============================] - 5s 90us/sample - loss: 2.3025 - acc: 0.0986

Epoch 6/100
49999/49999 [==============================] - 4s 90us/sample - loss: 2.3025 - acc: 0.0986

Epoch 7/100
49999/49999 [==============================] - 4s 89us/sample - loss: 2.3025 - acc: 0.0986

Edit 3: I changed the loss to mean squared error and the network works well now. Is there a way to keep it in cross entropy without it converging to a local minimum?

回答1:

I changed the loss to mean squared error and the network works well now

MSE is not the appropriate loss function for such classification problems; you should certainly stick to loss = 'categorical_crossentropy'.

Most probably, the issue is due to your MNIST data being not normalized; you should normalize your final variables as

x_train = x_train.values/255
x_test = x_test.values/255

Not normalizing input data is a known cause of exploding gradient problems, which is probably what is happening here.

Other advice: set activation='relu' for your first dense layer, and get rid of both the regularizer & initializer arguments from all layers (the default glorot_uniform is actually a better initializer, while regularization here may actually be harmful for the performance).

As a general advice, try not to reinvent the wheel - start with a Keras example using the built-in MNIST data...

回答2:

The frustration your feeling towards the seemly random output of your code is understandable and correctly identified. Every time the model begins training it randomly initializes the weights. Depending on this initialization you see one of your three output scenarios.

The issue is most likely due to vanishing gradients. It's a phenomenon that occurs when the backpropagation causes very small weights to be multiplied by a small number to create an almost infinitely small value. The solution is to add small jitter (1e-10) to each of your gradients (from within the cost function) so that they never reach zero.

There are tons of more detailed blogs about vanishing gradients online and for an implementation example checkout line 217 of this TensorFlow Network