So, im trying to write some code that utilizes Nvidia's CUDA architecture. I noticed that copying to and from the device was really hurting my overall performance, so now I am trying to move a large amount of data onto the device.
As this data is used in numerous functions, I would like it to be global. Yes, I can pass pointers around, but I would really like to know how to work with globals in this instance.
So, I have device functions that want to access a device allocated array.
Ideally, I could do something like:
__device__ float* global_data;
main()
{
cudaMalloc(global_data);
kernel1<<<blah>>>(blah); //access global data
kernel2<<<blah>>>(blah); //access global data again
}
However, I havent figured out how to create a dynamic array. I figured out a work around by declaring the array as follows:
__device__ float global_data[REALLY_LARGE_NUMBER];
And while that doesn't require a cudaMalloc call, I would prefer the dynamic allocation approach.
Something like this should probably work.
Give it a whirl.
Spend some time focusing on the copious documentation offered by NVIDIA.
From the Programming Guide:
That's a simple example of how to allocate memory. Now, in your kernels, you should accept a pointer to a float like so:
So now you can invoke them like so:
There are few good reasons to use globals. This definitely is not one. I'll leave it as an exercise to expand this example to include moving "devPtr" to a global scope.
EDIT:
Ok, the fundamental problem is this: your kernels can only access device memory and the only global-scope pointers that they can use are GPU ones. When calling a kernel from your CPU, behind the scenes what happens is that the pointers and primitives get copied into GPU registers and/or shared memory before the kernel gets executed.
So the closest I can suggest is this: use cudaMemcpyToSymbol() to achieve your goals. But, in the background, consider that a different approach might be the Right Thing.
Don't forget '--host-compilation=c++' for this example.
check out the samples included with the SDK. Many of those sample projects are a decent way to learn by example.
-
What if the kernel operates on a large const structure consisting of arrays? Using the so called constant memory is not an option, because it's very limited in size.. so then you have to put it in global memory..?
I went ahead and tried the solution of allocating a temporary pointer and passing it to a simple global function similar to kernel1.
The good news is that it does work :)
However, I think it confuses the compiler as I now get "Advisory: Cannot tell what pointer points to, assuming global memory space" whenever I try to access the global data. Luckily, the assumption happens to be correct, but the warnings are annoying.
Anyway, for the record - I have looked at many of the examples and did run through the nvidia exercises where the point is to get the output to say "Correct!". However, I haven't looked at all of them. If anyone knows of an sdk example where they do dynamic global device memory allocation, I would still like to know.
Erm, it was exactly that problem of moving devPtr to global scope that was my problem.
I have an implementation that does exactly that, with the two kernels having a pointer to data passed in. I explicitly don't want to pass in those pointers.
I have read the documentation fairly closely, and hit up the nvidia forums (and google searched for an hour or so), but I haven't found an implementation of a global dynamic device array that actually runs (i have tried several that compile and then fail in new and interesting ways).