I am reading a lot of tutorials that state two things.
- "[Replacing fully connected layers with convolutional layers] casts them into fully convolutional networks that take input of any size and output classification maps." Fully Convolutional Networks for Semantic Segmentation, Shelhamer et al.
- A traditional CNN can't do this because it has a fully connected layer and it's shape is decided by the input image size.
Based on these statements, my questions are the following?
- Whenever I've made a FCN, I could only get it to work with a fixed dimension of input images for both training and testing. But in the paper's abstract, they note: "Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning." How is this possible the first layer has a fixed number of weights, and an input image of different sizes would not properly link to these weights.
- How exactly does the input image size determine the fully connected layer? I tried looking online, but couldn't find a direct answer.
It seems like you are confusion spatial dimensions (height and width) of an image/feature map, and the "channel dimension" which is the dimension of the information stored per pixel.
An input image can have arbitrary height and width, but will always have a fixed "channel" dimension = 3; That is, each pixel has a fixed dimension of 3, which are the RGB values of the color of each pixel.
Let's denote the input shape as 3xHxW
(3 RGB channels, by height H by width W).
Applying a convolution with kernel_size=5
and output_channel=64
, means that you have 64 filters of size 3x5x5. For each filter you take all overlapping 3x5x5
windows in the image (RGB by 5 by 5 pixels) and output a single number per filter which is the weighted sum of the input RGB values. Doing so for all 64 filters will give you 64 channels per sliding window, or an output feature map of shape 64x(H-4)x(W-4)
.
Additional convolution layer with, say kernel_size=3
and output_channels=128
will have 128 filters of shape 64x3x3
applied to all 3x3 sliding windows in the input feature map os shape 64x(H-4)x(W-4)
resulting with an output feature map of shape 128x(H-6)x(W-6)
.
You can continue in a similar way with additional convolution and even pooling layers.
This post has a very good explanation on how convolution/pooling layers affect the shapes of the feature maps.
To recap, as long as you do not change the number of input channels, you can apply a fully convolutional net to images of arbitrary spatial dimensions, resulting with different spatial shapes of the output feature maps, but always with the same number of channels.
As for a fully connected (aka inner-product/linear) layer; this layer does not care about spatial dimensions or channel dimensions. The input to a fully connected layer is "flattened" and then the number of weights are determined by the number of input elements (channel and spatial combined) and the number of outputs.
For instance, in a VGG network, when training on 3x224x224
images, the last convolution layer outputs a feature map of shape 512x7x7
which is than flattened to a 25,088 dimensional vector and fed into a fully connected layer with 4,096 outputs.
If you were to feed VGG with input images of different spatial dimensions, say 3x256x256
, your last convolution layer will output a feature map of shape 512x8x8
-- note how the channel dimension, 512, did not change, but the spatial dimensions grew from 7x7 to 8x8. Now, if you were to "flatten" this feature map you will have a 32,768 dimensional input vector for your fully connected layer, but alas, your fully connected layer expects a 25,088 dimensional input: You will get a RunTimeError
.
If you were to convert your fully connected layer to a convolutional layer with kernel_size=7
and output_channels=4096
it will do exactly the same mathematical operation on the 512x7x7
input feature map, to produce a 4096x1x1
output feature.
However, when you feed it a 512x8x8
feature map it will not produce an error, but rather output a 4096x2x2
output feature map - spatial dimensions adjusted, number of channels fixed.
- Images has to be of a pre-defined size during training and testing. For the fully connected layer, you can have as many nodes as you want, and that number doesn't depend on the input image size, or the convolution layer's output dimensions.
- The input image size and the convolutions will determine the shape of the convolution layers and the final flattened output, which will be fed to a fully connected layer. The fully connected layer can have any dimension, and is not dependent on the input image.
Below is a sample code.
model = Sequential()
model.add(Conv2D(32, (3,3), activation='relu', input_shape=input_shape))
model.add(BatchNormalization())
model.add(Conv2D(64, (3,3), activation='relu'))
model.add(BatchNormalization())
model.add(Conv2D(128, (3,3), activation='relu'))
model.add(BatchNormalization())
model.add(Conv2D(256, (3,3), activation='relu')
model.add(BatchNormalization())
model.add(Conv2D(256, (3,3), activation='relu')
model.add(MaxPooling2D())
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(512, activation='sigmoid')) #This is the fully connected layer, whose dimensions are independent of the previous layers