How does input image size influence size and shape

2020-05-09 23:03发布

问题:

I am reading a lot of tutorials that state two things.

  1. "[Replacing fully connected layers with convolutional layers] casts them into fully convolutional networks that take input of any size and output classification maps." Fully Convolutional Networks for Semantic Segmentation, Shelhamer et al.
  2. A traditional CNN can't do this because it has a fully connected layer and it's shape is decided by the input image size.

Based on these statements, my questions are the following?

  1. Whenever I've made a FCN, I could only get it to work with a fixed dimension of input images for both training and testing. But in the paper's abstract, they note: "Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning." How is this possible the first layer has a fixed number of weights, and an input image of different sizes would not properly link to these weights.
  2. How exactly does the input image size determine the fully connected layer? I tried looking online, but couldn't find a direct answer.

回答1:

It seems like you are confusion spatial dimensions (height and width) of an image/feature map, and the "channel dimension" which is the dimension of the information stored per pixel.

An input image can have arbitrary height and width, but will always have a fixed "channel" dimension = 3; That is, each pixel has a fixed dimension of 3, which are the RGB values of the color of each pixel.
Let's denote the input shape as 3xHxW (3 RGB channels, by height H by width W).

Applying a convolution with kernel_size=5 and output_channel=64, means that you have 64 filters of size 3x5x5. For each filter you take all overlapping 3x5x5 windows in the image (RGB by 5 by 5 pixels) and output a single number per filter which is the weighted sum of the input RGB values. Doing so for all 64 filters will give you 64 channels per sliding window, or an output feature map of shape 64x(H-4)x(W-4).

Additional convolution layer with, say kernel_size=3 and output_channels=128 will have 128 filters of shape 64x3x3 applied to all 3x3 sliding windows in the input feature map os shape 64x(H-4)x(W-4) resulting with an output feature map of shape 128x(H-6)x(W-6).

You can continue in a similar way with additional convolution and even pooling layers.
This post has a very good explanation on how convolution/pooling layers affect the shapes of the feature maps.

To recap, as long as you do not change the number of input channels, you can apply a fully convolutional net to images of arbitrary spatial dimensions, resulting with different spatial shapes of the output feature maps, but always with the same number of channels.

As for a fully connected (aka inner-product/linear) layer; this layer does not care about spatial dimensions or channel dimensions. The input to a fully connected layer is "flattened" and then the number of weights are determined by the number of input elements (channel and spatial combined) and the number of outputs.
For instance, in a VGG network, when training on 3x224x224 images, the last convolution layer outputs a feature map of shape 512x7x7 which is than flattened to a 25,088 dimensional vector and fed into a fully connected layer with 4,096 outputs.

If you were to feed VGG with input images of different spatial dimensions, say 3x256x256, your last convolution layer will output a feature map of shape 512x8x8 -- note how the channel dimension, 512, did not change, but the spatial dimensions grew from 7x7 to 8x8. Now, if you were to "flatten" this feature map you will have a 32,768 dimensional input vector for your fully connected layer, but alas, your fully connected layer expects a 25,088 dimensional input: You will get a RunTimeError.

If you were to convert your fully connected layer to a convolutional layer with kernel_size=7 and output_channels=4096 it will do exactly the same mathematical operation on the 512x7x7 input feature map, to produce a 4096x1x1 output feature.
However, when you feed it a 512x8x8 feature map it will not produce an error, but rather output a 4096x2x2 output feature map - spatial dimensions adjusted, number of channels fixed.



回答2:

  1. Images has to be of a pre-defined size during training and testing. For the fully connected layer, you can have as many nodes as you want, and that number doesn't depend on the input image size, or the convolution layer's output dimensions.
  2. The input image size and the convolutions will determine the shape of the convolution layers and the final flattened output, which will be fed to a fully connected layer. The fully connected layer can have any dimension, and is not dependent on the input image. Below is a sample code.
    model = Sequential()
    model.add(Conv2D(32, (3,3), activation='relu', input_shape=input_shape))
    model.add(BatchNormalization())
    model.add(Conv2D(64, (3,3), activation='relu'))
    model.add(BatchNormalization())
    model.add(Conv2D(128, (3,3), activation='relu'))
    model.add(BatchNormalization())
    model.add(Conv2D(256, (3,3), activation='relu')
    model.add(BatchNormalization())
    model.add(Conv2D(256, (3,3), activation='relu')
    model.add(MaxPooling2D())
    model.add(BatchNormalization())
    model.add(Flatten())
    model.add(Dense(512, activation='sigmoid')) #This is the fully connected layer, whose dimensions are independent of the previous layers