Can someone please help me convert a nested for loop into a CUDA kernel? Here is the function I am trying to convert into a CUDA kernel:
// Convolution on Host
void conv(int* A, int* B, int* out) {
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
out[i + j] += A[i] * B[j];
}
I have tried very hard to parallelize this code.
Here is my attempt:
__global__ void conv_Kernel(int* A, int* B, int* out) {
int i = blockIdx.x;
int j = threadIdx.x;
__shared__ int temp[N];
__syncthreads();
temp[i + j] = A[i] * B[j];
__syncthreads();
int sum = 0;
for (int k = 0; k < N; k++)
sum += temp[k];
out[i + j] = sum;
}
Your cpu
conv
function appears to be doing this (forN
= 4, as an example):Your convolution is (to me) distinguished by the fact that it is convolving 2 signals of equal length. Since the GPU likes to work on "large" problems, this implies
N
should be large. However one immediate problem with yourconv_Kernel
realization is that it implies that the block dimension will be used to index intoA
, and the thread dimension will be used to index intoB
. But the thread dimension (threadIdx.x
) is limited to 512 or 1024 for current CUDA GPUs. This will relegate us to only solving pretty small problems.There are various other problems with your realization. One problem is that the shared memory size allocated is not enough to fit the
i+j
range (which can go from 0->2*(N-1)). This is trivial to fix of course, but the more serious issue is that I don't see a way to map your arithmetic onto anything resembling the desired pattern above. After spending a little while thinking about your kernel, I discarded it.The convolution problem has a great deal of research associated with it, and can be optimized in various ways for massively parallel architectures like the GPU. Therefore I will focus on two very simple realizations which immediately suggest themselves based on the diagram above.
The first realization is simply to re-create the diagram above. We will create an intermediate
temp
array to store all the individual AxBy products, calculating and storing these products in theconv_Kernel
. We will then launch a second kernel (sum_Kernel
) which simply sums columns of thetemp
array, to produce the variousout
values. The first kernel requiresN
threads, which will successively calculate each row of the above diagram, in a slanting fashion as we iterate throughN
for-loop iterations, one per row. The second kernel requires (2*N)-1 threads, one for each column/out
value.My second realization (conv_Kernel2) dispenses with the need for a
temp
array, and just assigns one thread to each column/out
value, and iterates through theN
rows, computing the necessary products row-by-row, and summing those products "on-the-fly". The sum result is then directly stored in theout
array.Considering only the calculations, and not the time required for data movement/initialization, the GPU realizations begin to be faster than the naive single-threaded CPU implementation at around
N
=512 on a K20x GPU, which is what I happened to be using. The second realization is also commended by the fact that the only data movement required is for A, B, and the result. The first realization requires in addition thetemp
array to be allocated and initialized to all zeroes. The size of thetemp
array is proportional toN
*N
, so the second realization also has the benefit that it does not require this temporary storage.Here's a fully worked test case, running and timing the CPU realization you provided plus the two slightly different GPU realizations that I created:
(note that even just using the -O3 parameter makes a significant difference in the CPU code execution)
As I mentioned, I would consider both of my examples to be also quite "naive" for GPU code (niether uses shared memory, for example), but they may give you some ideas for how to get started.
For brevity of presentation, I have dispensed with CUDA error checking. However, I would suggest that any time you are having trouble with a CUDA code, that you perform proper cuda error checking. In the case of your
conv_Kernel
, I believe it would have indicated some errors (if you tried to run it.) As a quick test, you can always run any CUDA code withcuda-memcheck
to see if any API errors are occurring.EDIT: I tried a simple shared memory version of my
conv_Kernel2
but it wasn't any faster. I believe the reason for this is that these data sets (atN
=4096,A
andB
are 16Kbytes each,out
is approximately 32Kbytes) are small enough to easily fit in the GPU L2 cache, with no thrashing.However, for newer architectures (cc 3.5 and newer) the CUDA compiler can sometimes make additional optimizations if the read-only input data is properly identified as such to the kernel. Therefore if we change my
conv_Kernel2
definition to:then I witness slightly improved execution times, in my case:
I created a modified version of the code which does the following:
N
is specified on the command lineconv
and gpuconv_Kernel2
are included.typedef ... mytype;
is provided so that the code can be re-compiled easily to test behavior with various datatypes.modified code: