How does 2D convolution for images work?

I am studying image processing these days and I am a beginner to the subject. I got stuck on the subject of convolution and how to implement it for images. Let me brief - there is a general formula of convolution for images like so:

x(n1,n2) represents a pixel in the output image, but I do not know what k1 and k2 stand for. Actually, this is what would like to learn. In order to implement this in some programming language, I need to know what k1 and k2 stand for. Can someone explain me this to me or lead me to an article? I would be really appreciative of any help.

回答1:

Convolution in this case deals with extracting out patches of image pixels that surround a target image pixel. When you perform image convolution, you perform this with what is known as a mask or point spread function or kernel and this is usually much smaller than the size of the image itself.

For each target image pixel in the output image, you grab a neighbourhood of pixel values from the input, including the pixel that is at the same target coordinates in the input. The size of this neighbourhood coincides with exactly the same size as the mask. At that point, you rotate the mask so that it's 180 degrees, then do an element-by-element multiplication of each value in the mask with the pixel values that coincide at each location in the neighbourhood. You add all of these up, and that is the output for the target pixel in the target image.

For example, let's say I had this small image:

1   2   3   4   5
6   7   8   9  10
11  12 13  14  15
16  17 18  19  20
21  22 23  24  25

And let's say I wanted to perform an averaging within a 3 x 3 window, so my mask would all be:

    [1  1  1]
1/9*[1  1  1]
    [1  1  1]

To perform 2D image convolution, rotating the mask by 180 degrees still gives us the same mask, and so let's say I wanted to find the output at row 2, column 2. The 3 x 3 neighbourhood I would extract is:

1  2  3
6  7  8
11 12 13

To find the output, I would multiply each value in the mask by the same location of the neighbourhood:

[1  2  3 ]           [1 1 1]
[6  7  8 ]  ** (1/9)*[1 1 1]
[11 12 13]           [1 1 1]

Perform a point by point multiplication and adding the values would give us:

1(1/9) + 2(1/9) + 3(1/9) + 6(1/9) + 7(1/9) + 8(1/9) + 11(1/9) + 12(1/9) + 13(1/9) = 63/9 = 7

The output at location (2,2) in the output image would be 7.

Bear in mind that I didn't tackle the case where the mask would go out of bounds. Specifically, if I tried to find the output at row 1, column 1 for example, there would be five locations where the mask would go out of bounds. There are many ways to handle this. Some people consider those pixels outside to be zero. Other people like to replicate the image border so that the border pixels are copied outside of the image dimensions. Some people like to pad the image using more sophisticated techniques like doing symmetric padding where the border pixels are a mirror reflection of what's inside the image, or a circular padding where the border pixels are copied from the other side of the image.

That's beyond the scope of this post, but in your case, start with the most simplest case where any pixels that go outside the bounds of the image when you're collecting neighbourhoods, set those to zero.

Now, what does k1 and k2 mean? k1 and k2 denote the offset with respect to the centre of the neighbourhood and mask. Notice that the n1 - k1 and n2 - k2 are important in the sum. The output position is denoted by n1 and n2. Therefore, n1 - k1 and n2 - k2 are the offsets with respect to this centre in both the horizontal sense n1 - k1 and the vertical sense n2 - k2. If we had a 3 x 3 mask, the centre would be k1 = k2 = 0. The top-left corner would be k1 = k2 = -1. The bottom right corner would be k1 = k2 = 1. The reason why they go to infinity is because we need to make sure we cover all elements in the mask. Masks are finite in size so that's just to ensure that we cover all of the mask elements. Therefore, the above sum simplifies to that point by point summation I was talking about earlier.

Here's a better illustration where the mask is a vertical Sobel filter which finds vertical gradients in an image:

^{Source: http://blog.saush.com/2011/04/20/edge-detection-with-the-sobel-operator-in-ruby/}

As you can see, for each output pixel in the target image, we take a look at a neighbourhood of pixels in the same spatial location in the input image, and that's 3 x 3 in this case, we perform a weighted element by element sum between the mask and the neighbourhood and we set the output pixel to be the total sum of these weighted elements. Bear in mind that this example does not rotate the mask by 180 degrees, but that's what you do when it comes to convolution.

Hope this helps!

回答2:

$k_1$ and $k_2$ are variables that should cover the whole definition area of your kernel. Check out wikipedia for further description: http://en.wikipedia.org/wiki/Kernel_%28image_processing%29