I am reading a lot of tutorials that state two things.
- "[Replacing fully connected layers with convolutional layers] casts them into fully convolutional networks that take input of any size and output classification maps." Fully Convolutional Networks for Semantic Segmentation, Shelhamer et al.
- A traditional CNN can't do this because it has a fully connected layer and it's shape is decided by the input image size.
Based on these statements, my questions are the following?
- Whenever I've made a FCN, I could only get it to work with a fixed dimension of input images for both training and testing. But in the paper's abstract, they note: "Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning." How is this possible the first layer has a fixed number of weights, and an input image of different sizes would not properly link to these weights.
- How exactly does the input image size determine the fully connected layer? I tried looking online, but couldn't find a direct answer.
It seems like you are confusion spatial dimensions (height and width) of an image/feature map, and the "channel dimension" which is the dimension of the information stored per pixel.
An input image can have arbitrary height and width, but will always have a fixed "channel" dimension = 3; That is, each pixel has a fixed dimension of 3, which are the RGB values of the color of each pixel.
Let's denote the input shape as
3xHxW
(3 RGB channels, by height H by width W).Applying a convolution with
kernel_size=5
andoutput_channel=64
, means that you have 64 filters of size 3x5x5. For each filter you take all overlapping3x5x5
windows in the image (RGB by 5 by 5 pixels) and output a single number per filter which is the weighted sum of the input RGB values. Doing so for all 64 filters will give you 64 channels per sliding window, or an output feature map of shape64x(H-4)x(W-4)
.Additional convolution layer with, say
kernel_size=3
andoutput_channels=128
will have 128 filters of shape64x3x3
applied to all 3x3 sliding windows in the input feature map os shape64x(H-4)x(W-4)
resulting with an output feature map of shape128x(H-6)x(W-6)
.You can continue in a similar way with additional convolution and even pooling layers.
This post has a very good explanation on how convolution/pooling layers affect the shapes of the feature maps.
To recap, as long as you do not change the number of input channels, you can apply a fully convolutional net to images of arbitrary spatial dimensions, resulting with different spatial shapes of the output feature maps, but always with the same number of channels.
As for a fully connected (aka inner-product/linear) layer; this layer does not care about spatial dimensions or channel dimensions. The input to a fully connected layer is "flattened" and then the number of weights are determined by the number of input elements (channel and spatial combined) and the number of outputs.
For instance, in a VGG network, when training on
3x224x224
images, the last convolution layer outputs a feature map of shape512x7x7
which is than flattened to a 25,088 dimensional vector and fed into a fully connected layer with 4,096 outputs.If you were to feed VGG with input images of different spatial dimensions, say
3x256x256
, your last convolution layer will output a feature map of shape512x8x8
-- note how the channel dimension, 512, did not change, but the spatial dimensions grew from 7x7 to 8x8. Now, if you were to "flatten" this feature map you will have a 32,768 dimensional input vector for your fully connected layer, but alas, your fully connected layer expects a 25,088 dimensional input: You will get aRunTimeError
.If you were to convert your fully connected layer to a convolutional layer with
kernel_size=7
andoutput_channels=4096
it will do exactly the same mathematical operation on the512x7x7
input feature map, to produce a4096x1x1
output feature.However, when you feed it a
512x8x8
feature map it will not produce an error, but rather output a4096x2x2
output feature map - spatial dimensions adjusted, number of channels fixed.