I am using a CNN for a regression task. I use Tensorflow and the optimizer is Adam. The network seems to converge perfectly fine till one point where the loss suddenly increases along with the validation error. Here are the loss plots of the labels and the weights separated (Optimizer is run on the sum of them)
I use l2 loss for weight regularization and also for the labels. I apply some randomness on the training data. I am currently trying RSMProp to see if the behavior changes but it takes at least 8h to reproduce the error.
I would like to understand how this can happen. Hope you can help me.
My experience over the last months is the following:
Adam is very easy to use because you don't have to play with initial learning rate very much and it almost always works. However, when coming to convergence Adam does not really sattle with a solution but jiggles around at higher iterations. While SGD gives an almost perfectly shaped loss plot and seems to converge much better in higher iterations.
But changing litte parts of the setup requires to adjust the SGD parameters or you will end up with NaNs... For experiments on architectures and general approaches I favor Adam, but if you want to get the best version of one chosen architecture you should use SGD and at least compare the solutions.
I also noticed that a good initial SGD setup (learning rate, weight decay etc.) converges as fast as using Adam, at leas for my setup.
Hope this may help some of you!
EDIT: Please note that the effects in my initial question are NOT normal even with Adam. Seems like I had a bug but I can't really remember the issue there.