Often when reading blogs about transfer learning it says - remove the last layer, or remove the last two layers. That is, remove output layer and last hidden layer.
So if the transfer learning implies changing the cost function also, e.g. from cross-entropy to mean squared errro, I understand that you need to change the last output layer from 1001 layer of softmax values to a Dense(1) layer which outputs a float, but:
- why also change the last hidden layer?
- what weights is the two last new layers get initialized with if using Keras and one of the predefined CNN models with imagenet weights? He initialized or 0 initialized?
Why remove layers?
If you're only trying to change the cost function, you're not doing transfer learning by most people's definition. Transfer learning is primarily about moving to a new application domain. So for images, taking a dog identifier/detector and transferring it to be a bird identifier/detector, not a dog age/weight guesser. (Or taking your 1001 general purpose object detector and using it to only look at security camera footage, etc)
Most literature says that lower levels of the CNN are learning low level concepts the size of a few pixels which are fairly general purpose. The middle layers are object detectors, corresponding to eyeball or nose, and the top layers are highest level, specifying locations of those mid-level objects in relation to each other, and represent highest level features. The last softmax is just saying which species of dog. Those last, highest level features are probably not relevant to the new task.
Here's a couple of other explanations: https://machinelearningmastery.com/transfer-learning-for-deep-learning/
https://medium.com/nanonets/nanonets-how-to-use-deep-learning-when-you-have-limited-data-f68c0b512cab
What should the new layers be initialized to?
In your original question you asked "He initialized or 0 initialized?". Again, I think this is more of an engineering question in that there's evidence that some things work better than others, but I don't know that there's yet a widely accepted proof guaranteeing optimal performance of one over the other. Except don't initialize everything to zero. That's definitely wrong, as you can see in the first post I link to below. Also keep in mind this is just initialization. So even if my knowledge is slightly out of date, all that it should cost you is some extra epochs of training vice outright failure or junk answers. Depending on your problem that may be a large cost or a small cost, which would dictate how much time you'd spend investigating the options and trying some out on a small scale.
http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization
https://datascience.stackexchange.com/questions/13061/when-to-use-he-or-glorot-normal-initialization-over-uniform-init-and-what-are/13362
https://stats.stackexchange.com/questions/229885/whats-the-recommended-weight-initialization-strategy-when-using-the-elu-activat