Transfer learning why remove last hidden layer?

2020-05-04 07:14发布

问题:

Often when reading blogs about transfer learning it says - remove the last layer, or remove the last two layers. That is, remove output layer and last hidden layer.

So if the transfer learning implies changing the cost function also, e.g. from cross-entropy to mean squared errro, I understand that you need to change the last output layer from 1001 layer of softmax values to a Dense(1) layer which outputs a float, but:

  1. why also change the last hidden layer?
  2. what weights is the two last new layers get initialized with if using Keras and one of the predefined CNN models with imagenet weights? He initialized or 0 initialized?

回答1:

Why remove layers?

If you're only trying to change the cost function, you're not doing transfer learning by most people's definition. Transfer learning is primarily about moving to a new application domain. So for images, taking a dog identifier/detector and transferring it to be a bird identifier/detector, not a dog age/weight guesser. (Or taking your 1001 general purpose object detector and using it to only look at security camera footage, etc)

Most literature says that lower levels of the CNN are learning low level concepts the size of a few pixels which are fairly general purpose. The middle layers are object detectors, corresponding to eyeball or nose, and the top layers are highest level, specifying locations of those mid-level objects in relation to each other, and represent highest level features. The last softmax is just saying which species of dog. Those last, highest level features are probably not relevant to the new task.

This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset.

from: http://cs231n.github.io/transfer-learning/

Here's a couple of other explanations: https://machinelearningmastery.com/transfer-learning-for-deep-learning/

https://medium.com/nanonets/nanonets-how-to-use-deep-learning-when-you-have-limited-data-f68c0b512cab

What should the new layers be initialized to?

In your original question you asked "He initialized or 0 initialized?". Again, I think this is more of an engineering question in that there's evidence that some things work better than others, but I don't know that there's yet a widely accepted proof guaranteeing optimal performance of one over the other. Except don't initialize everything to zero. That's definitely wrong, as you can see in the first post I link to below. Also keep in mind this is just initialization. So even if my knowledge is slightly out of date, all that it should cost you is some extra epochs of training vice outright failure or junk answers. Depending on your problem that may be a large cost or a small cost, which would dictate how much time you'd spend investigating the options and trying some out on a small scale.

http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization

https://datascience.stackexchange.com/questions/13061/when-to-use-he-or-glorot-normal-initialization-over-uniform-init-and-what-are/13362

https://stats.stackexchange.com/questions/229885/whats-the-recommended-weight-initialization-strategy-when-using-the-elu-activat



回答2:

  1. In Keras, for Inception v3, the last hidden layer is also removed if you want to change the output layer. By default the last hidden layer is globalAveragePooling, but depending on the problem domain, either globalAveragePooling or globalMaxPooling might be prefered
  2. By default Keras initializes dense layers with Glorot uniform initializer, also called Xavier uniform initializer.