I am reading through Residual learning, and I have a question. What is "linear projection" mentioned in 3.2? Looks pretty simple once got this but could not get the idea...
I am basically not a computer science person, so I would very appreciate if someone provide me a simple example.
First up, it's important to understand what
x
,y
andF
are and why they need any projection at all. I'll try explain in simple terms, but basic understanding of ConvNets is required.x
is an input data (called tensor) of the layer, in case of ConvNets it's rank is 4. You can think of it as a 4-dimensional array.F
is usually a conv layer (conv+relu+batchnorm
in this paper), andy
combines the two together (forming the output channel). The result ofF
is also of rank 4, and most of dimensions will be the same as inx
, except for one. That's exactly what the transformation should patch.For example,
x
shape might be(64, 32, 32, 3)
, where 64 is the batch size, 32x32 is image size and 3 stands for (R, G, B) color channels.F(x)
might be(64, 32, 32, 16)
: batch size never changes, for simplicity, ResNet conv-layer doesn't change the image size too, but will likely use a different number of filters - 16.So, in order for
y=F(x)+x
to be a valid operation,x
must be "reshaped" from(64, 32, 32, 3)
to(64, 32, 32, 16)
.I'd like to stress here that "reshaping" here is not what
numpy.reshape
does.Instead,
x[3]
is padded with 13 zeros, like this:If you think about it, this is a projection of a 3-dimensional vector onto 16 dimensions. In other words, we start to think that our vector is the same, but there are 13 more dimensions out there. None of the other
x
dimensions are changed.Here's the link to the code in Tensorflow that does this.
A linear projection is one where each new feature is simple a weighted sum of the original features. As in the paper, this can be represented by matrix multiplication. if
x
is the vector ofN
input features andW
is anM
-byN
matrix, then the matrix productWx
yieldsM
new features where each one is a linear projection ofx
. Each row ofW
is a set of weights that defines one of theM
linear projections (i.e., each row ofW
contains the coefficients for one of the weighted sums ofx
).