I've been building a programming language detector, i.e., a classifier of code snippets, as part of a bigger project. My baseline model is pretty straight-forward: tokenize the input and encode the snippets as bag-of-words or, in this case, bag-of-tokens, and make a simple NN on top of these features.
The input to NN is a fixed-length array of counters of most distinctive tokens, such as "def"
,
"self"
, "function"
, "->"
, "const"
, "#include"
, etc., that are automatically extracted from the corpus.
The idea is that these tokens are pretty unique to programming languages, so even this naive approach should get
high accuracy score.
Input:
def 1
for 2
in 2
True 1
): 3
,: 1
...
Output: python
Setup
I got 99% accuracy pretty quickly and decided that's the sign that it works just as expected. Here's the model (a full runnable script is here):
# Placeholders
x = tf.placeholder(shape=[None, vocab_size], dtype=tf.float32, name='x')
y = tf.placeholder(shape=[None], dtype=tf.int32, name='y')
training = tf.placeholder_with_default(False, shape=[], name='training')
# One hidden layer with dropout
reg = tf.contrib.layers.l2_regularizer(0.01)
hidden1 = tf.layers.dense(x, units=96, kernel_regularizer=reg,
activation=tf.nn.elu, name='hidden1')
dropout1 = tf.layers.dropout(hidden1, rate=0.2, training=training, name='dropout1')
# Output layer
logits = tf.layers.dense(dropout1, units=classes, kernel_regularizer=reg,
activation=tf.nn.relu, name='logits')
# Cross-entropy loss
loss = tf.reduce_mean(
tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, abels=y))
# Misc reports: accuracy, correct/misclassified samples, etc.
correct_predicted = tf.nn.in_top_k(logits, y, 1, name='in-top-k')
prediction = tf.argmax(logits, axis=1)
wrong_predicted = tf.logical_not(correct_predicted, name='not-in-top-k')
x_misclassified = tf.boolean_mask(x, wrong_predicted, name='misclassified')
accuracy = tf.reduce_mean(tf.cast(correct_predicted, tf.float32), name='accuracy')
The output is pretty encouraging:
iteration=5 loss=2.580 train-acc=0.34277
iteration=10 loss=2.029 train-acc=0.69434
iteration=15 loss=2.054 train-acc=0.92383
iteration=20 loss=1.934 train-acc=0.98926
iteration=25 loss=1.942 train-acc=0.99609
Files.VAL mean accuracy = 0.99121 <-- After just 1 epoch!
iteration=30 loss=1.943 train-acc=0.99414
iteration=35 loss=1.947 train-acc=0.99512
iteration=40 loss=1.946 train-acc=0.99707
iteration=45 loss=1.946 train-acc=0.99609
iteration=50 loss=1.944 train-acc=0.99902
iteration=55 loss=1.946 train-acc=0.99902
Files.VAL mean accuracy = 0.99414
Test accuracy was also around 1.0. Everything looked perfect.
Mysterious ReLu
But then I noticed that I put activation=tf.nn.relu
into the final dense layer (logits
), which is clearly a bug:
there is no need to discard negative scores before softmax
, because they indicate the classes with low probability.
Zero threshold will only make these classes artificially more probable, which would be a mistake. Getting rid of it should only make the model more robust and confident in the correct class.
That's what I thought.
So I replaced it with activation=None
, run the model again and then a surprising thing happened:
the performance didn't improve. At all. In fact, it degraded significantly:
iteration=5 loss=5.236 train-acc=0.16602
iteration=10 loss=4.068 train-acc=0.18750
iteration=15 loss=3.110 train-acc=0.37402
iteration=20 loss=5.149 train-acc=0.14844
iteration=25 loss=2.880 train-acc=0.18262
Files.VAL mean accuracy = 0.28711
iteration=30 loss=3.136 train-acc=0.25781
iteration=35 loss=2.916 train-acc=0.22852
iteration=40 loss=2.156 train-acc=0.39062
iteration=45 loss=1.777 train-acc=0.45312
iteration=50 loss=2.726 train-acc=0.33105
Files.VAL mean accuracy = 0.29362
The accuracy got better with training, but never surpassed 91-92%. I changed the activation back and forth several times, varying different parameters (layer size, dropout, regularizer, extra layers, anything) and always had the same outcome: the "wrong" model hit 99% immediately, while the "right" model barely achieved 90% after 50 epochs. According to tensorboard, there was no big difference in weight distribution: the gradients didn't die out and both models learned normally.
How is this possible? How can the final ReLu make a model so much superior? Especially if this ReLu is a bug?