I was reading about CUDA, and I tried to implement a simple code to create every possible permutation of the array {a,b,c,d}
, but I don't know how to implement the CUDA way (since all the examples I read were of the form a[blockIdx.x] = b[blockIdx.x] + c[blockIdx.x]
).
Any help will be appreciated.
Below is a somewhat naive implementation of a parallel permutation generator in CUDA. The example aims at generating all possible permutations of ABCD
.
Since all possible permutations can be generated by fixing the first symbol to, say, X
and appending all possible permutations of the remaining symbols, then changing the first symbol to, say, Y
and doing the above process again, the simple idea behind the code is to appoint 4
threads to do the job, each thread referring to a different initial symbol.
The permutations following the first symbol are evaluated the canonical way, namely by recursion.
Obviously, the code below can be made more general and perhaps improved, but it should give you a first rough idea to work on.
#include <stdio.h>
inline void GPUassert(cudaError_t code, char * file, int line, bool Abort=true)
{
if (code != 0) {
fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code),file,line);
if (Abort) exit(code);
}
}
#define GPUerrchk(ans) { GPUassert((ans), __FILE__, __LINE__); }
__host__ __device__ void swap(char *x, char *y)
{
char temp;
temp = *x;
*x = *y;
*y = temp;
}
__device__ void permute_device(char *a, int i, int n, int tid, int* count)
{
if (i == n) {
char b[4]; char* c = a - 1;
b[0] = c[0]; b[1] = c[1]; b[2] = c[2]; b[3] = c[3];
printf("Permutation nr. %i from thread nr. %i is equal to %s\n", count[0], tid, c); count[0] = count[0] + 1;
}
else
{
for (int j = i; j <= n; j++)
{
swap((a+i), (a+j));
permute_device(a, i+1, n, tid, count);
swap((a+i), (a+j)); //backtrack
}
}
}
__global__ void permute_kernel(char* d_A, int size) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int count[1]; count[0] = 0;
char local_array[4];
for (int i=0; i<size; i++) local_array[i] = d_A[i];
swap(local_array+threadIdx.x,local_array);
permute_device(local_array+1,0,2,tid,count);
}
int main()
{
char h_a[] = "ABCD";
char* d_a; cudaMalloc((void**)&d_a,sizeof(h_a));
GPUerrchk(cudaMemcpy(d_a, h_a, sizeof(h_a), cudaMemcpyHostToDevice));
printf("\n\n Permutations on GPU\n");
permute_kernel<<<1,4>>>(d_a, 4);
GPUerrchk(cudaPeekAtLastError());
GPUerrchk(cudaDeviceSynchronize());
getchar();
return 0;
}
I wonder if your combination is pairwise.
- Think about the length of array.(n)
- Think about the combination complex. (n^2, pairwise)
- choose a method to calculate in parallel way. (use block id as add offset, and thread id as base one. array(threadid) + array(threadid+offset) for example)
so your kernel should writen like this:
template<typename T>
__global__ void
combination(T* in, T* out) {
int tid = threadId.x;
int bid = blockId.x+1;
out[tid+bid*blockDim.x] = in[tid]+in[(tid+bid)%arrlen]; //0+1, 1+2,..., 0+2, 1+3
}
you can call the kernel with gridDim(arrlen)
and blockDim(arrlen)
.