Average filter is windowed filter of linear class, that smooths signal (image). The filter works as low-pass one. The basic idea behind filter is for any element of the signal (image) take an average across its neighborhood.
If we've an m x n
matrix and we want to apply average filter with size k
on it,then for each point in the matrix p:(i,j)
the value of the point would be the average of all points in the square
This figure is for Square kernel of filtering with size 2
, that the yellow box is the pixel to be averaged, and all the grid is the square of neighbor pixels, that the pixel's new value will be the average of them.
The problem is that this algorithm is very slow, specially on large images, so I thought about using GPGPU
.
The question now is, How Can this be implemented in cuda, if it's possible ?
If the filter's size is normal and not humongous, the average filter is a very good case for implementing with CUDA. You can set this up using square blocks and every thread of the block is responsible for the calculation of the value of one pixel, by summing and averaging its neighbors.
If you store the image in Global Memory then it can be programmed easily. One possible optimization is that you load blocks of the image into the block's Shared Memory. Using phantom elements (so that you won't exceed the shared block's dimensions when looking for neighboring pixels) you can calculate the average of the pixels within a block.
The only think that you have to be careful of is how the "stitching" will be done in the end, because the shared memory blocks will overlap (because of the extra "padding" pixels) and you don't want to calculate their values twice.
This is a classic case of embarrassingly parallel image processing problem that can be very easily mapped to CUDA framework. The averaging filter is knows as Box Filter in image processing domains.
The easiest approach would be to use CUDA textures for the filtering process as the boundary conditions can be handled very easily by textures.
Assuming you have source and destination pointers allocated on the host. The procedure would be something like this.
Sample Implementation Of Box Filter
Kernel
Wrapper Function:
The good news is that you don't have to implement the filter yourself. The CUDA Toolkit comes with free signal and image processing library named NVIDIA Performance Primitives aka NPP, made by NVIDIA. NPP utilizes CUDA enabled GPUs to accelerate processing. The averaging filter is already implemented in NPP. The current version of NPP (5.0) has support for 8 bit, 1 channel and 4 channel images. The functions are:
nppiFilterBox_8u_C1R
for 1 channel image.nppiFilterBox_8u_C4R
for 4 channel image.Some Basic thoughts/steps:
You should be able to scale this pretty easily with 2D-memory and multidimensional kernel-calls.