可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm currently trying to modify the VGG16 network architecture so that it's able to accept 400x400 px images.

Based on literature that I've read, the way to do it would be to covert the fully connected (FC) layers into convolutional (CONV) layers. This would essentially " allow the network to efficiently “slide” across a larger input image and make multiple evaluations of different parts of the image, incorporating all available contextual information." Afterwards, an Average Pooling layer is used to "average the multiple feature vectors into a single feature vector that summarizes the input image".

I've done this using this function, and have come up with the following network architecture:

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1         [-1, 64, 400, 400]           1,792
              ReLU-2         [-1, 64, 400, 400]               0
            Conv2d-3         [-1, 64, 400, 400]          36,928
              ReLU-4         [-1, 64, 400, 400]               0
         MaxPool2d-5         [-1, 64, 200, 200]               0
            Conv2d-6        [-1, 128, 200, 200]          73,856
              ReLU-7        [-1, 128, 200, 200]               0
            Conv2d-8        [-1, 128, 200, 200]         147,584
              ReLU-9        [-1, 128, 200, 200]               0
        MaxPool2d-10        [-1, 128, 100, 100]               0
           Conv2d-11        [-1, 256, 100, 100]         295,168
             ReLU-12        [-1, 256, 100, 100]               0
           Conv2d-13        [-1, 256, 100, 100]         590,080
             ReLU-14        [-1, 256, 100, 100]               0
           Conv2d-15        [-1, 256, 100, 100]         590,080
             ReLU-16        [-1, 256, 100, 100]               0
        MaxPool2d-17          [-1, 256, 50, 50]               0
           Conv2d-18          [-1, 512, 50, 50]       1,180,160
             ReLU-19          [-1, 512, 50, 50]               0
           Conv2d-20          [-1, 512, 50, 50]       2,359,808
             ReLU-21          [-1, 512, 50, 50]               0
           Conv2d-22          [-1, 512, 50, 50]       2,359,808
             ReLU-23          [-1, 512, 50, 50]               0
        MaxPool2d-24          [-1, 512, 25, 25]               0
           Conv2d-25          [-1, 512, 25, 25]       2,359,808
             ReLU-26          [-1, 512, 25, 25]               0
           Conv2d-27          [-1, 512, 25, 25]       2,359,808
             ReLU-28          [-1, 512, 25, 25]               0
           Conv2d-29          [-1, 512, 25, 25]       2,359,808
             ReLU-30          [-1, 512, 25, 25]               0
        MaxPool2d-31          [-1, 512, 12, 12]               0
           Conv2d-32           [-1, 4096, 1, 1]     301,993,984
             ReLU-33           [-1, 4096, 1, 1]               0
          Dropout-34           [-1, 4096, 1, 1]               0
           Conv2d-35           [-1, 4096, 1, 1]      16,781,312
             ReLU-36           [-1, 4096, 1, 1]               0
          Dropout-37           [-1, 4096, 1, 1]               0
           Conv2d-38              [-1, 3, 1, 1]          12,291
AdaptiveAvgPool2d-39              [-1, 3, 1, 1]               0
          Softmax-40              [-1, 3, 1, 1]               0
================================================================
Total params: 333,502,275
Trainable params: 318,787,587
Non-trainable params: 14,714,688
----------------------------------------------------------------
Input size (MB): 1.83
Forward/backward pass size (MB): 696.55
Params size (MB): 1272.21
Estimated Total Size (MB): 1970.59
----------------------------------------------------------------

My question is simple: Is the use of the average pooling layer at the end necessary? It seems like by the last convolutional layer, we get a 1x1 image with 3 channels. Doing an average pooling on that would seem to not have any effect.

If there is anything amiss in my logic/ architecture, kindly feel free to point it out. Thanks!

回答1:

How to convert VGG to except input size of 400 x 400 ?

First Approach

The problem with VGG style architecture is we are hardcoding the number of input & output features in our Linear Layers. i.e

vgg.classifier[0]: Linear(in_features=25088, out_features=4096, bias=True)

It is expecting 25,088 input features.

If we pass an image of size (3, 224, 224) through vgg.features the output feature map will be of dimensions:

(512, 7, 7) => 512 * 7 * 7 => 25,088

If we change the input image size to (3, 400, 400) and pass through vgg.features the output feature map will be of dimensions:

(512, 12, 12) => 512 * 12 * 12 =>  73,728

throws `sizemismatch` error.

One way to fix this issue is by using nn.AdaptiveAvgPool in place of nn.AvgPool. AdaptiveAvgPool helps to define the output size of the layer which remains constant irrespective of the size of the input through the vgg.features layer.

for eg:

vgg.features[30] = nn.AdaptiveAvgPool(output_size=(7,7))

will make sure the final feature maps have a dimension of `(512, 7, 7)` 
irrespective of the input size.

You can read more about Adaptive Pooling in here.

Second Approach

If you use the technique here to convert your Linear layers to Convolutional Layers, you don't have to worry about the input dimension, however you have to change the weight initialisation techniques because of the change in number of parameters.

Is the use of the average pooling layer at the end necessary?

No, in this case. It is not changing the size of the input feature map, hence it not doing an average over a set of nodes.

回答2:

Purpose of AdaptiveAvgPool2d is to make the convnet work on input of any arbitrary size (and produce an output of fixed size). In your case, since input size is fixed to 400x400, you probably do not need it.

I think this paper might give you a better idea of this method - https://arxiv.org/pdf/1406.4729v3.pdf