I'm trying to develop a deconvolutional layer (or a transposed convolutional layer to be precise).
In the forward pass, I do a full convolution (convolution with zero padding)
In the backward pass, I do a valid convolution (convolution without padding) to pass the errors to the previous layer
The gradients of the biases are easy to compute, simply a matter of averaging over the superfluous dimensions.
The problem is I don't know how to update the weights of the convolutional filters. What are the gradients ? I'm sure it is a convolution operation but I don't see how. I tried a valid convolution of the inputs with the errors but to no avail.
Deconvolution explained
First of all, deconvolution is a convolutional layer, only used for a different purpose, namely upsampling (why it's useful is explained in this paper).
For example, here a 2x2
input image (bottom image in blue) is upsampled to 4x4
(top image in green):
To make it a valid convolution, the input is first padded to make it 6x6
, after which 3x3
filter is applied without striding. Just like in ordinary convolutional layer, you can choose different padding/striding strategies to produce the image size you want.
Backward pass
Now it should be clear that backward pass for deconvolution is a partial case of backward pass for a convolutional layer, with particular stride and padding. I think you've done it already, but here's a naive (and not very efficient) implementation for any stride and padding:
# input: x, w, b, stride, pad, d_out
# output: dx, dw, db <- gradients with respect to x, w, and b
N, C, H, W = x.shape
F, C, HH, WW = w.shape
N, C, H_out, W_out = d_out.shape
x_pad = np.pad(x, pad_width=((0, 0), (0, 0), (pad, pad), (pad, pad)), mode='constant', constant_values=0)
db = np.sum(d_out, axis=(0, 2, 3))
dw = np.zeros_like(w)
dx = np.zeros_like(x_pad)
for n in xrange(N):
for f in xrange(F):
filter_w = w[f, :, :, :]
for out_i, i in enumerate(xrange(0, H, stride)):
for out_j, j in enumerate(xrange(0, W, stride)):
dw[f, :, :, :] += d_out[n, f , out_i, out_j] * x_pad[n, :, i:i+HH, j:j+WW]
dx[n, :, i:i+HH, j:j+WW] += filter_w * d_out[n, f, out_i, out_j]
dx = dx[:,:,1:H+1,1:W+1]
The same can be done more efficiently using im2col
and col2im
, but it's just an implementation detail. Another funny fact: the backward pass for a convolution operation (for both the data and the weights) is again a convolution, but with spatially-flipped filters.
Here's how it's applied (plain simple SGD):
# backward_msg is the message from the next layer, usually ReLu
# conv_cache holds (x, w, b, conv_params), i.e. the info from the forward pass
backward_msg, dW, db = conv_backward(backward_msg, conv_cache)
w = w - learning_rate * dW
b = b - learning_rate * db
As you can see, it's pretty straightforward, just need to understand that you're applying same old convolution.