I'm currently trying to modify the VGG16 network architecture so that it's able to accept 400x400 px images.
Based on literature that I've read, the way to do it would be to covert the fully connected (FC) layers into convolutional (CONV) layers. This would essentially " allow the network to efficiently “slide” across a larger input image and make multiple evaluations of different parts of the image, incorporating all available contextual information." Afterwards, an Average Pooling layer is used to "average the multiple feature vectors into a single feature vector that summarizes the input image".
I've done this using this function, and have come up with the following network architecture:
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 64, 400, 400] 1,792
ReLU-2 [-1, 64, 400, 400] 0
Conv2d-3 [-1, 64, 400, 400] 36,928
ReLU-4 [-1, 64, 400, 400] 0
MaxPool2d-5 [-1, 64, 200, 200] 0
Conv2d-6 [-1, 128, 200, 200] 73,856
ReLU-7 [-1, 128, 200, 200] 0
Conv2d-8 [-1, 128, 200, 200] 147,584
ReLU-9 [-1, 128, 200, 200] 0
MaxPool2d-10 [-1, 128, 100, 100] 0
Conv2d-11 [-1, 256, 100, 100] 295,168
ReLU-12 [-1, 256, 100, 100] 0
Conv2d-13 [-1, 256, 100, 100] 590,080
ReLU-14 [-1, 256, 100, 100] 0
Conv2d-15 [-1, 256, 100, 100] 590,080
ReLU-16 [-1, 256, 100, 100] 0
MaxPool2d-17 [-1, 256, 50, 50] 0
Conv2d-18 [-1, 512, 50, 50] 1,180,160
ReLU-19 [-1, 512, 50, 50] 0
Conv2d-20 [-1, 512, 50, 50] 2,359,808
ReLU-21 [-1, 512, 50, 50] 0
Conv2d-22 [-1, 512, 50, 50] 2,359,808
ReLU-23 [-1, 512, 50, 50] 0
MaxPool2d-24 [-1, 512, 25, 25] 0
Conv2d-25 [-1, 512, 25, 25] 2,359,808
ReLU-26 [-1, 512, 25, 25] 0
Conv2d-27 [-1, 512, 25, 25] 2,359,808
ReLU-28 [-1, 512, 25, 25] 0
Conv2d-29 [-1, 512, 25, 25] 2,359,808
ReLU-30 [-1, 512, 25, 25] 0
MaxPool2d-31 [-1, 512, 12, 12] 0
Conv2d-32 [-1, 4096, 1, 1] 301,993,984
ReLU-33 [-1, 4096, 1, 1] 0
Dropout-34 [-1, 4096, 1, 1] 0
Conv2d-35 [-1, 4096, 1, 1] 16,781,312
ReLU-36 [-1, 4096, 1, 1] 0
Dropout-37 [-1, 4096, 1, 1] 0
Conv2d-38 [-1, 3, 1, 1] 12,291
AdaptiveAvgPool2d-39 [-1, 3, 1, 1] 0
Softmax-40 [-1, 3, 1, 1] 0
================================================================
Total params: 333,502,275
Trainable params: 318,787,587
Non-trainable params: 14,714,688
----------------------------------------------------------------
Input size (MB): 1.83
Forward/backward pass size (MB): 696.55
Params size (MB): 1272.21
Estimated Total Size (MB): 1970.59
----------------------------------------------------------------
My question is simple: Is the use of the average pooling layer at the end necessary? It seems like by the last convolutional layer, we get a 1x1 image with 3 channels. Doing an average pooling on that would seem to not have any effect.
If there is anything amiss in my logic/ architecture, kindly feel free to point it out. Thanks!
First Approach
The problem with
VGG
style architecture is we are hardcoding the number of input & output features in our Linear Layers. i.eIt is expecting 25,088 input features.
If we pass an image of size
(3, 224, 224)
throughvgg.features
the output feature map will be of dimensions:If we change the input image size to
(3, 400, 400)
and pass throughvgg.features
the output feature map will be of dimensions:One way to fix this issue is by using
nn.AdaptiveAvgPool
in place ofnn.AvgPool
. AdaptiveAvgPool helps to define the output size of the layer which remains constant irrespective of the size of the input through thevgg.features
layer.for eg:
You can read more about Adaptive Pooling in here.
Second Approach
If you use the technique here to convert your Linear layers to Convolutional Layers, you don't have to worry about the input dimension, however you have to change the weight initialisation techniques because of the change in number of parameters.
No, in this case. It is not changing the size of the input feature map, hence it not doing an average over a set of nodes.
Purpose of
AdaptiveAvgPool2d
is to make the convnet work on input of any arbitrary size (and produce an output of fixed size). In your case, since input size is fixed to 400x400, you probably do not need it.I think this paper might give you a better idea of this method - https://arxiv.org/pdf/1406.4729v3.pdf