Neural Network: Mysterious ReLu

I've been building a programming language detector, i.e., a classifier of code snippets, as part of a bigger project. My baseline model is pretty straight-forward: tokenize the input and encode the snippets as bag-of-words or, in this case, bag-of-tokens, and make a simple NN on top of these features.

The input to NN is a fixed-length array of counters of most distinctive tokens, such as "def", "self", "function", "->", "const", "#include", etc., that are automatically extracted from the corpus. The idea is that these tokens are pretty unique to programming languages, so even this naive approach should get high accuracy score.

Input:
  def   1
  for   2
  in    2
  True  1
  ):    3
  ,:    1

  ...

Output: python

Setup

I got 99% accuracy pretty quickly and decided that's the sign that it works just as expected. Here's the model (a full runnable script is here):

# Placeholders
x = tf.placeholder(shape=[None, vocab_size], dtype=tf.float32, name='x')
y = tf.placeholder(shape=[None], dtype=tf.int32, name='y')
training = tf.placeholder_with_default(False, shape=[], name='training')

# One hidden layer with dropout
reg = tf.contrib.layers.l2_regularizer(0.01)
hidden1 = tf.layers.dense(x, units=96, kernel_regularizer=reg, 
                          activation=tf.nn.elu, name='hidden1')
dropout1 = tf.layers.dropout(hidden1, rate=0.2, training=training, name='dropout1')

# Output layer
logits = tf.layers.dense(dropout1, units=classes, kernel_regularizer=reg,
                         activation=tf.nn.relu, name='logits')

# Cross-entropy loss
loss = tf.reduce_mean(
    tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, abels=y))

# Misc reports: accuracy, correct/misclassified samples, etc.
correct_predicted = tf.nn.in_top_k(logits, y, 1, name='in-top-k')
prediction = tf.argmax(logits, axis=1)
wrong_predicted = tf.logical_not(correct_predicted, name='not-in-top-k')
x_misclassified = tf.boolean_mask(x, wrong_predicted, name='misclassified')
accuracy = tf.reduce_mean(tf.cast(correct_predicted, tf.float32), name='accuracy')

The output is pretty encouraging:

iteration=5  loss=2.580  train-acc=0.34277
iteration=10  loss=2.029  train-acc=0.69434
iteration=15  loss=2.054  train-acc=0.92383
iteration=20  loss=1.934  train-acc=0.98926
iteration=25  loss=1.942  train-acc=0.99609
Files.VAL mean accuracy = 0.99121             <-- After just 1 epoch!

iteration=30  loss=1.943  train-acc=0.99414
iteration=35  loss=1.947  train-acc=0.99512
iteration=40  loss=1.946  train-acc=0.99707
iteration=45  loss=1.946  train-acc=0.99609
iteration=50  loss=1.944  train-acc=0.99902
iteration=55  loss=1.946  train-acc=0.99902
Files.VAL mean accuracy = 0.99414

Test accuracy was also around 1.0. Everything looked perfect.

Mysterious ReLu

But then I noticed that I put activation=tf.nn.relu into the final dense layer (logits), which is clearly a bug: there is no need to discard negative scores before softmax, because they indicate the classes with low probability. Zero threshold will only make these classes artificially more probable, which would be a mistake. Getting rid of it should only make the model more robust and confident in the correct class.

That's what I thought. So I replaced it with activation=None, run the model again and then a surprising thing happened: the performance didn't improve. At all. In fact, it degraded significantly:

iteration=5  loss=5.236  train-acc=0.16602
iteration=10  loss=4.068  train-acc=0.18750
iteration=15  loss=3.110  train-acc=0.37402
iteration=20  loss=5.149  train-acc=0.14844
iteration=25  loss=2.880  train-acc=0.18262
Files.VAL mean accuracy = 0.28711

iteration=30  loss=3.136  train-acc=0.25781
iteration=35  loss=2.916  train-acc=0.22852
iteration=40  loss=2.156  train-acc=0.39062
iteration=45  loss=1.777  train-acc=0.45312
iteration=50  loss=2.726  train-acc=0.33105
Files.VAL mean accuracy = 0.29362

The accuracy got better with training, but never surpassed 91-92%. I changed the activation back and forth several times, varying different parameters (layer size, dropout, regularizer, extra layers, anything) and always had the same outcome: the "wrong" model hit 99% immediately, while the "right" model barely achieved 90% after 50 epochs. According to tensorboard, there was no big difference in weight distribution: the gradients didn't die out and both models learned normally.

How is this possible? How can the final ReLu make a model so much superior? Especially if this ReLu is a bug?

Prediction distribution

After playing around with it for a while, I decided to visualize the actual prediction distribution for both models:

predicted_distribution = tf.nn.softmax(logits, name='distribution')

Below are the histograms of the distributions and how they evolved over time.

With ReLu (wrong model)

Without ReLu (correct model)

The first histogram makes sense, most of probabilities are close to 0. But the histogram of the ReLu model is suspicious: the values seem to concentrate around 0.15 after few iterations. Printing the actual predictions confirmed this idea:

[0.14286 0.14286 0.14286 0.14286 0.14286 0.14286 0.14286]
[0.14286 0.14286 0.14286 0.14286 0.14286 0.14286 0.14286]

I had 7 classes (for 7 different languages at that moment) and 0.14286 is 1/7. It turns out, the "perfect" model learned to output 0 logits, which in turn translated in uniform prediction.

But how can this distribution be reported as 99% accurate?

`tf.nn.in_top_k`

Before diving into tf.nn.in_top_k I checked an alternative way to compute accuracy:

true_correct = tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64))
alternative_accuracy = tf.reduce_mean(tf.cast(true_correct, tf.float32))

... which performs honest comparison of the highest predicted class and the ground truth. The result is this:

iteration=2  loss=3.992  train-acc=0.13086  train-alt-acc=0.13086
iteration=4  loss=3.590  train-acc=0.13086  train-alt-acc=0.12207
iteration=6  loss=2.871  train-acc=0.21777  train-alt-acc=0.13672
iteration=8  loss=2.466  train-acc=0.37695  train-alt-acc=0.16211
iteration=10  loss=2.099  train-acc=0.62305  train-alt-acc=0.10742
iteration=12  loss=2.066  train-acc=0.79980  train-alt-acc=0.17090
iteration=14  loss=2.016  train-acc=0.84277  train-alt-acc=0.17285
iteration=16  loss=1.954  train-acc=0.91309  train-alt-acc=0.13574
iteration=18  loss=1.956  train-acc=0.95508  train-alt-acc=0.06445
iteration=20  loss=1.923  train-acc=0.97754  train-alt-acc=0.11328

Indeed, tf.nn.in_top_k with k=1 diverged from the right accuracy quickly and began to report fantasized 99% values. So what does it do actually? Here's what the documentation says about it:

Says whether the targets are in the top K predictions.

This outputs a batch_size bool array, an entry out[i] is true if the prediction for the target class is among the top k predictions among all predictions for example i. Note that the behavior of InTopK differs from the TopK op in its handling of ties; if multiple classes have the same prediction value and straddle the top-k boundary, all of those classes are considered to be in the top k.

That's what it is. If the probabilities are uniform (which actually means "I have no idea"), they are all correct. The situation is even worse, because if the logits distribution is almost uniform, softmax may transform it into exactly uniform distribution, as can be seen in this simple example:

x = tf.constant([0, 1e-8, 1e-8, 1e-9])
tf.nn.softmax(x).eval()
# >>> array([0.25, 0.25, 0.25, 0.25], dtype=float32)

... which means that every nearly uniform prediction, may be considered "correct" according to tf.nn.in_top_k spec.

Conclusion

tf.nn.in_top_k is a dangerous choice of accuracy measure in tensorflow, because it may silently swallow wrong predictions and report them as "correct". Instead, you should always use this long but trusted expression:

accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64)), tf.float32))