I am having some difficulties to understand the batch loading as in the comments is referred. In order to compute the convolution in a pixel the mask whose size is 5 must become centered on this specific pixel. The image is divided into tiles. These tiles after applying the convolution mask are the final output tiles whose size is TILE_WIDTH*TILE_WIDTH
. For the pixels that belong to the border of the output tile the mask must borrow some pixels from the neighbor tile, when this tile belong to the borders of the image. Otherwise, these borrowed values are assigned to zero. These two steps are depicted in
if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width)
N_ds[destY][destX] = I[src];
else
N_ds[destY][destX] = 0;
For that reason the shared memory array has TILE_WIDTH + Mask_width - 1
dimension in each side. The following parts of the code are unclear to me.
- The
destY
anddestX
index. Dividing the output index by the input tile width what does it means? The
srcY
addsrcX
index. WhydestY
anddestX
index take part insrcY
addsrcX
index?srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;
srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;
- Why in the second loading we use the offset
TILE_WIDTH * TILE_WIDTH
? - Generally, what is the intuitive explanation of having two loadings?
- Can all these question followed by an intuitive example based on the image bellow?
- Thank you!
EDIT: Image added. In green there are the output tiles and in red we have the mask centered in 114 index. It is obvious that the mask borrows elements from different tiles. Finally, this image refers to one channel.
Example: Based on the image below I have tryied to wrote an example. The output tile has blockIdx.x=1
and blockIdx.y=1
based on that destY=0
and destX=0
. Also,
srcY = 1*6+0-3=3
, srcX = 3
and src = (3*18+3)*3+0=171
. Based on the calculations and the image example we do not have a match. In the first shared memory possision the value that should be stored is that with global index 57
. What is wrong with the abovementioned calculations? Can any one help please?
#define Mask_width 5
#define Mask_radius Mask_width/2
#define TILE_WIDTH 16
#define w (TILE_WIDTH + Mask_width - 1)
#define clamp(x) (min(max((x), 0.0), 1.0))
__global__ void convolution(float *I, const float* __restrict__ M, float *P,
int channels, int width, int height) {
__shared__ float N_ds[w][w];
int k;
for (k = 0; k < channels; k++) {
// First batch loading
int dest = threadIdx.y * TILE_WIDTH + threadIdx.x,
destY = dest / w, destX = dest % w,
srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius,
srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius,
src = (srcY * width + srcX) * channels + k;
if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width)
N_ds[destY][destX] = I[src];
else
N_ds[destY][destX] = 0;
// Second batch loading
dest = threadIdx.y * TILE_WIDTH + threadIdx.x + TILE_WIDTH * TILE_WIDTH;
destY = dest / w, destX = dest % w;
srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;
srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;
src = (srcY * width + srcX) * channels + k;
if (destY < w) {
if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width)
N_ds[destY][destX] = I[src];
else
N_ds[destY][destX] = 0;
}
__syncthreads();
float accum = 0;
int y, x;
for (y = 0; y < Mask_width; y++)
for (x = 0; x < Mask_width; x++)
accum += N_ds[threadIdx.y + y][threadIdx.x + x] * M[y * Mask_width + x];
y = blockIdx.y * TILE_WIDTH + threadIdx.y;
x = blockIdx.x * TILE_WIDTH + threadIdx.x;
if (y < height && x < width)
P[(y * width + x) * channels + k] = clamp(accum);
__syncthreads();
}
}