I am attempting to implement a shared memory based matrix multiplication kernel as outlined in the CUDA C Programming Guide. The following is the kernel:
__global__ void matrixMultiplyShared(float * A, float * B, float * C,
int ARows, int AColumns,
int BRows, int BColumns,
int CRows, int CColumns) {
float * CSub = &C[CColumns * 16 * blockIdx.y + 16 * blockIdx.x];
float CValue = 0;
for (int k = 0; k < (AColumns / 16); ++k) {
float * ASub = &A[AColumns * 16 * blockIdx.y + 16 * k];
float * BSub = &B[AColumns*16*k + 16*blockIdx.y];
__shared__ float As[16][16];
__shared__ float Bs[16][16];
As[threadIdx.y][threadIdx.x] = ASub[threadIdx.y*AColumns+threadIdx.x];
Bs[threadIdx.y][threadIdx.x] = BSub[threadIdx.y*AColumns+threadIdx.x];
__syncthreads();
for (int n = 0; n < 16; ++n)
CValue += As[threadIdx.y][n] * Bs[n][threadIdx.x];
__syncthreads();
}
CSub[threadIdx.x*CColumns+threadIdx.y]=CValue;
}
While the following is the call to the kernel:
dim3 dimBlock(16, 16, 1);
dim3 dimGrid;
dimGrid.x = (CColumns + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (CRows + dimBlock.y - 1)/dimBlock.y;
matrixMultiplyShared<<<dimGrid , dimBlock>>>(deviceA , deviceB , deviceC , ARows , AColumns, BRows ,BColumns , CRows , CColumns);
Unfortunately this seems to produce incorrect results.
Any assistance/explanations would be greatly appreciated.
There are at least 2 basic errors in your kernel, both relatively trivial. Where you have this:
You should use this:
And where you have this:
You should use this:
This should allow you to get basic correctness under the following conditions:
Fixing the square matrix limitation is not difficult. Fixing the dimension limitation on the tile dimension involves considerable changes to the kernel, in order to:
Since your code doesn't comprehend any of this, I wasn't sure if you're asking about it and chose not to address those issues specifically.
I was able to get the following adaptation of your code working as a basic example: (note that for the benefit of reduced code size to look at, I have dispensed with usual CUDA error checking. Please don't use this as a representative example of good coding. Do proper error checking. The point of my answer is not to explain good CUDA error checking but to show an algorithmically correct example.)