I am new to thrust (cuda) and I want to do some array operations but I don´t find any similar example on the internet.
I have following two arrays (2d):
a = { {1, 2, 3}, {4} }
b = { {5}, {6, 7} }
I want that thrust compute this array:
c = { {1, 2, 3, 5}, {1, 2, 3, 6, 7}, {1, 2, 3, 5}, {1, 2, 3, 6, 7} }
I know how it works in c/c++ but not how to say thrust to do it.
Here is my idea how it wohl maybe could work:
Thread 1:
Take a[0] -> expand it with b.
Write it to c.
Thread 2:
Take a[1] -> expand it with b.
Write it to c.
But I have no idea how to do that. I could write the array a and b to an 1d array like:
thrust::device_vector<int> dev_a;
dev_a.push_back(3); // size of first array
dev_a.push_back(1);
dev_a.push_back(2);
dev_a.push_back(3);
dev_a.push_back(1); // size of secound array
dev_a.push_back(4);
thrust::device_vector<int> dev_b;
dev_b.push_back(1); // size of first array
dev_b.push_back(5);
dev_b.push_back(2); // size of secound array
dev_b.push_back(6);
dev_b.push_back(7);
And the pseudo-function:
struct expand
{
__host__ __device__
?? ?? (const array ai, const array *b) {
for bi in b: // each array in the 2d array
{
c.push_back(bi[0] + ai[0]); // write down the array count
for i in ai: // each element in the ai array
c.push_back(i);
for i in bi: // each element in the bi array
c.push_back(i);
}
}
};
Anyone any idea?
I guess you won't get any speed increase on the GPU in such kind of operation since it needs a lot oo memory accesses - a slow operation on GPU.
But if you anyway want to implement this:
I guess, for the reason I wrote previously, trust won't help you with ready-to-use algorithm. This means that you need to write your own kernel, however, you can leave memory management to thust.
It is always faster to create arrays in CPU memory and, when ready, copy the whole array to GPU. (CPU<->GPU copies are faster on long continiuos pieces of data)
Keep in mind that GPU runs hundreds of threads in parallel. Each thread need to know what to read and where to write.
Global memory operations are slow (300-400 clocks). Avoid thread reading the whole array from global memory to find out that it needed only the last few bytes.
So, as I can see you program.
Make your arrays 1D in a CPU memory look like this:
float array1[] = { 1, 2, 3, 4};
float array2[] = { 5, 6, 7};
int arr1offsets[] = {0, 2, 3, 1}; // position of the first element and length of subarray pairs
int arr2offsets[] = {0, 1, 1, 2};
Copy your arrays and offsets to GPU and allocate memory for result and it's offsets. I guess, you'll have to count max length of one joint subarray and allocate memory for the worst case.
Run the kernel.
Collect the results
The kernel may look like this (If I correctly understood your idea)
__global__ void kernel(float* arr1, int* arr1offset,
float* arr2, int* arr2offset,
float* result, int* resultoffset)
{
int idx = threadIdx.x+ blockDim.x*blockIdx.x;
int a1beg = arr1offset[Idx*2];
int a2beg = arr2offset[Idx*2];
int a1len = arr1offset[Idx*2+1];
int a2len = arr2offset[Idx*2+1];
resultoffset[idx*2] = idx*MAX_SUBARRAY_LEN;
resultoffset[idx*2+1] = a1len+a2len;
for (int k = 0; k < a1len; ++k) result[idx*MAX_SUBARRAY_LEN+k] = arr1[a1beg+k];
for (int k = 0; k < a2len; ++k) result[idx*MAX_SUBARRAY_LEN+a1len+k] = arr2[a2beg+k];
}
This code is not perfect, but should do the right thing.