I've been messing around with this for a while now, but can't seem to get it right. I'm trying to copy objects that contain arrays into CUDA device memory (and back again, but I'll cross that bridge when I come to it):
struct MyData {
float *data;
int dataLen;
}
void copyToGPU() {
// Create dummy objects to copy
int N = 10;
MyData *h_items = new MyData[N];
for (int i=0; i<N; i++) {
h_items[i].dataLen = 100;
h_items[i].data = new float[100];
}
// Copy objects to GPU
MyData *d_items;
int memSize = N * sizeof(MyData);
cudaMalloc((void**)&d_items, memSize);
cudaMemCpy(d_items, h_items, memSize, cudaMemcpyHostToDevice);
// Run the kernel
MyFunc<<<100,100>>>(d_items);
}
__global__
static void MyFunc(MyData *data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (int i=0; i<data[idx].dataLen; i++) {
// Do something with data[idx].data[i]
}
}
When I call MyFunc(d_items), I can access data[idx].dataLen just fine. However, data[idx].data has not been copied yet.
I can't use d_items.data in copyToGPU as a destination for cudaMalloc/cudaMemCpy operations since the host code cannot dereference a device pointer.
What to do?
- allocate device data for all
structures, as a single array.
- Copy contiguous data from host to
GPU.
- adjust GPU pointers
example:
float *d_data;
cudaMalloc((void**)&d_data, N*100*sizeof(float));
for (...) {
h_items[i].data = i*100 + d_data;
}
The code you provide copies MyData structures only: a host address and a integer. To be overly clear, you are copying the pointer and not the data - you have to explicitly copy the data.
If the data is always the same LENGTH
, then you probably just want to make one big array:
float *d_data;
memSize = N * LENGTH * sizeof(float);
cudaMalloc((void**) &d_data, memSize);
//and a single copy
cudaMemcpy(d_data, h_data, memSize, cudaMemcpyHostToDevice);
If it needs to be in a struct with other data, then:
struct MyData {
float data[LENGTH];
int other_data;
}
MyData *d_items;
memSize = N * sizeof(MyData);
cudaMalloc((void**) &d_items, memSize);
//and again a single copy
cudaMemcpy(d_data, h_data, memSize, cudaMemcpyHostToDevice);
But, I am assuming you have data that is a variety of lengths. One solution is to set LENGTH to be the maximum length (and just waste some space), and then do it the same way as above. That might be the easiest way to start, and then you can optimize later.
If you can't afford the lost memory and transfer time, then I would have three arrays, one with all the data and then one with offsets and one with lengths, for both the host and device:
//host memory
float *h_data;
int h_offsets[N], h_lengths[N]; //or allocate these dynamically if necessary
int totalLength;
//device memory
float *d_data;
int *d_offsets, *d_lengths;
/* calculate totalLength, allocate h_data, and fill the three arrays */
//allocate device memory
cudaMalloc((void**) &d_data, totalLength * sizeof(float));
cudaMalloc((void**) &d_ffsets, N * sizeof(int));
cudaMalloc((void**) &d_lengths, N * sizeof(int));
//and now three copies
cudaMemcpy(d_data, h_data, totalLength * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_offsets, h_offsets, N * sizeof(int); cudaMemcpyHostToDevice);
cudaMemcpy(d_lengths, h_lengths, N * sizeof(int); cudaMemcpyHostToDevice);
Now in thread i
you can find the data that starts at d_data[d_offsets[i]]
and has a length of d_data[d_lengths[i]]