I'm building a convolutional neural network using Tensorflow (I'm new with both), in order to recognize letters. I've got a very weird behavior with the dropout layer : if I don't put it (ie. keep_proba at 1), it performs quite well and learns (see Tensorboard screenshots of accuracy and loss below, with training in blue and testing in orange).
However, when I put the dropout layer during the training phase (I tried at 0.8 and 0.5), the network learns nothing : loss falls quickly around 3 or 4 and doesn't move anymore (I also noticed that my network always predicts the same values, regardless to the input image). Same graphs :
What could be the causes of this weird behavior ? I've read that dropout is a good thing to use to avoid overfitting. Am I using it wrong ?
Here's my network architecture if useful :
CONVOLUTION -> MAX_POOL -> RELU -> CONVOLUTION -> MAX_POOL -> RELU -> FC 1024 neurons -> DROPOUT -> OUTPUT LAYER
.
Thanks a lot for any help or idea.
Dropout is a regularization technique that causes networks to fit data more robustly by probabilistically removing neurons at train time based on the dropout rate. It can be a powerful tool for mitigating over-fitting of data in a neural network.
There's really no hard-and-fast rules for "how much" dropout regularization to use. With the right training data, you might not need any dropout at all, while in other cases its absence will result in serious overfitting pathologies. In your case, it appears that 50% or 80% dropout rates may be excessive (over-regularization in general leads to under-fitting).
The typical indicator of over-fitting pathology is divergence between train and test sets (often, both will improve for a while, but then the training error will continue to go down while the test error starts to go in the opposite direction). While your training error is clearly less than your test error, the test error doesn't ever deteriorate over the training period (which would be an unambiguous indicator of overfitting). There may still be some opportunity to trade off some training error for better out-of-sample prediction error (which is typically the ultimate goal) with a modest amount of dropout. The only way to know this is to test with more modest dropout rates (I'd start with something like 20% and test around that value) to see if training error improves (if it does not, you can reduce the dropout rate even further). In the best case, your out-of-sample test error will get better at the expense of some increase of training error (or slower convergence in the training error). If you are over-regularized though, you'll see degradation of both (which is pretty clearly evident in the second set of plots).
As others noted, you may find that dropout regularization is more effective in the convolutional layers (or not, it's hard to say without trying it). The space of model structure and all hyperparameter settings is far too big to be able to search effectively, and there's not much in the way of theory to guide our choices. In general, it's best to start with recipes that have been demonstrated to work effectively on similar problems (based on published results) and to test and experiment from there.
Being able to work effectively with neural networks has a lot to do with learning to recognize these dynamics from the train-test metrics which will let you recognize improvement based on changes to model structure or hyperparameters (including dropout rates).