I've been messing around with this for a while now, but can't seem to get it right. I'm trying to copy objects that contain arrays into CUDA device memory (and back again, but I'll cross that bridge when I come to it):
struct MyData {
float *data;
int dataLen;
}
void copyToGPU() {
// Create dummy objects to copy
int N = 10;
MyData *h_items = new MyData[N];
for (int i=0; i<N; i++) {
h_items[i].dataLen = 100;
h_items[i].data = new float[100];
}
// Copy objects to GPU
MyData *d_items;
int memSize = N * sizeof(MyData);
cudaMalloc((void**)&d_items, memSize);
cudaMemCpy(d_items, h_items, memSize, cudaMemcpyHostToDevice);
// Run the kernel
MyFunc<<<100,100>>>(d_items);
}
__global__
static void MyFunc(MyData *data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (int i=0; i<data[idx].dataLen; i++) {
// Do something with data[idx].data[i]
}
}
When I call MyFunc(d_items), I can access data[idx].dataLen just fine. However, data[idx].data has not been copied yet.
I can't use d_items.data in copyToGPU as a destination for cudaMalloc/cudaMemCpy operations since the host code cannot dereference a device pointer.
What to do?
example:
The code you provide copies MyData structures only: a host address and a integer. To be overly clear, you are copying the pointer and not the data - you have to explicitly copy the data.
If the data is always the same
LENGTH
, then you probably just want to make one big array:If it needs to be in a struct with other data, then:
But, I am assuming you have data that is a variety of lengths. One solution is to set LENGTH to be the maximum length (and just waste some space), and then do it the same way as above. That might be the easiest way to start, and then you can optimize later.
If you can't afford the lost memory and transfer time, then I would have three arrays, one with all the data and then one with offsets and one with lengths, for both the host and device:
Now in thread
i
you can find the data that starts atd_data[d_offsets[i]]
and has a length ofd_data[d_lengths[i]]