I can't give the correct number of parameters of AlexNet or VGG Net.
For example, to calculate the number of parameters of a conv3-256
layer of VGG Net, the answer is 0.59M = (3*3)*(256*256), that is (kernel size) * (product of both number of channels in the joint layers), however in that way, I can't get the 138M
parameters.
So could you please show me where is wrong with my calculation, or show me the right calculation procedure?
If you refer to VGG Net with 16-layer (table 1, column D) then
138M
refers to the total number of parameters of this network, i.e including all convolutional layers, but also the fully connected ones.Looking at the 3rd convolutional stage composed of 3 x
conv3-256
layers:The convolution kernel is 3x3 for each of these layers. In terms of parameters this gives:
As explained above you have to do that for all layers, but also the fully-connected ones, and sum these values to obtain the final 138M number.
-
UPDATE: the breakdown among layers give:
In particular for the fully-connected layers (fc):
(x) see section 3.2 of the article: the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers).
Details about
fc1
As precised above the spatial resolution right before feeding the fully-connected layers is 7x7 pixels. This is because this VGG Net uses spatial padding before convolutions, as detailed within section 2.1 of the paper:
[...] the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3×3 conv. layers.
With such a padding, and working with a 224x224 pixels input image, the resolution decreases as follow along the layers: 112x112, 56x56, 28x28, 14x14 and 7x7 after the last convolution/pooling stage which has 512 feature maps.
This gives a feature vector passed to
fc1
with dimension: 512x7x7.I know this is a old post nevertheless, I think the accepted answer by @deltheil contains a mistake. If not, I would be happy to be corrected. The convolution layer should not have bias. i.e. 128x3x3x256 (weights) + 256 (biases) = 295,168 should be 128x3x3x256 (weights) = 294,9112
Thanks
The below VGG-16 architechture is in the original paper as highlighted by @deltheil in (table 1, column D) , and I quote from there
Using the above, and
Note:
you can simply multiply respective activation shape column to get the activation size
CONV3: means a filter of 3*3 will convolve on the input!
MAXPOOL3-2: means, 3rd pooling layer, with 2*2 filter, stride=2, padding=0(pretty standard in pooling layers)
Stage-3 : means it has multiple CONV layer stacked! with same padding=1, , stride=1, and filter 3*3
Cin : means the depth a.k.a channel coming from the input layer!
Cout: means the depth a.k.a channel outgoing (you configure it differently- to learn more complex features!),
Cin and Cout are the number of filters that you stack together to learn multiple features at different scales such as in the first layer you might want to learn vertical edges, and horizontal edges and edges at say 45degree, blah blah!, 64 possible different filters each of different kind of edges!!
n: input dimension without depth such n=224 in case of INPUT-image!
p: padding for each layer
s: stride used for each layer
f: filter size i.e 3*3 for CONV and 2*2 for MAXPOOL layers!
After MAXPOOL5-2, you simply flatten the volume and interface it with the first FC layer.!
We get the table:
Finally, if you add all the weights calculated in the last column, you end up with 138,357,544(138 million) parameters to train for VGG-15!
A great breakdown of the calculation for VGG-16 network is also given in CS231n lecture notes.
Here is how to compute the number of parameters in each cnn layer:
some definitions
n--width of filter
m--height of filter
k--number of input feature maps
L--number of output feature maps
Then number of paramters #= (n*m *k+1)*L in which the first contribution is from weights and the second is from bias.