CUDA global (as in C) dynamic arrays allocated to

2020-02-26 10:01发布

So, im trying to write some code that utilizes Nvidia's CUDA architecture. I noticed that copying to and from the device was really hurting my overall performance, so now I am trying to move a large amount of data onto the device.

As this data is used in numerous functions, I would like it to be global. Yes, I can pass pointers around, but I would really like to know how to work with globals in this instance.

So, I have device functions that want to access a device allocated array.

Ideally, I could do something like:

__device__ float* global_data;

main()
{
  cudaMalloc(global_data);
  kernel1<<<blah>>>(blah); //access global data
  kernel2<<<blah>>>(blah); //access global data again
}

However, I havent figured out how to create a dynamic array. I figured out a work around by declaring the array as follows:

__device__ float global_data[REALLY_LARGE_NUMBER];

And while that doesn't require a cudaMalloc call, I would prefer the dynamic allocation approach.

标签: cuda nvidia
6条回答
叛逆
2楼-- · 2020-02-26 10:09

Something like this should probably work.

#include <algorithm>

#define NDEBUG
#define CUT_CHECK_ERROR(errorMessage) do {                                 \
        cudaThreadSynchronize();                                           \
         cudaError_t err = cudaGetLastError();                             \
         if( cudaSuccess != err) {                                         \
                     fprintf(stderr, "Cuda error: %s in file '%s' in line %i : %s.\n",    \
                                             errorMessage, __FILE__, __LINE__, cudaGetErrorString( err) );\
                     exit(EXIT_FAILURE);                                                  \
                 } } while (0)


__device__ float *devPtr;

__global__
void kernel1(float *some_neat_data)
{
    devPtr = some_neat_data;
}

__global__
void kernel2(void)
{
    devPtr[threadIdx.x] *= .3f;
}


int main(int argc, char *argv[])
{
    float* otherDevPtr;
    cudaMalloc((void**)&otherDevPtr, 256 * sizeof(*otherDevPtr));
    cudaMemset(otherDevPtr, 0, 256 * sizeof(*otherDevPtr));

    kernel1<<<1,128>>>(otherDevPtr);
    CUT_CHECK_ERROR("kernel1");

    kernel2<<<1,128>>>();

    CUT_CHECK_ERROR("kernel2");

    return 0;
}

Give it a whirl.

查看更多
Deceive 欺骗
3楼-- · 2020-02-26 10:20

Spend some time focusing on the copious documentation offered by NVIDIA.

From the Programming Guide:

float* devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(*devPtr));
cudaMemset(devPtr, 0, 256 * sizeof(*devPtr));

That's a simple example of how to allocate memory. Now, in your kernels, you should accept a pointer to a float like so:

__global__
void kernel1(float *some_neat_data)
{
    some_neat_data[threadIdx.x]++;
}

__global__
void kernel2(float *potentially_that_same_neat_data)
{
    potentially_that_same_neat_data[threadIdx.x] *= 0.3f;
}

So now you can invoke them like so:

float* devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(*devPtr));
cudaMemset(devPtr, 0, 256 * sizeof(*devPtr));

kernel1<<<1,128>>>(devPtr);
kernel2<<<1,128>>>(devPtr);

As this data is used in numerous functions, I would like it to be global.

There are few good reasons to use globals. This definitely is not one. I'll leave it as an exercise to expand this example to include moving "devPtr" to a global scope.

EDIT:

Ok, the fundamental problem is this: your kernels can only access device memory and the only global-scope pointers that they can use are GPU ones. When calling a kernel from your CPU, behind the scenes what happens is that the pointers and primitives get copied into GPU registers and/or shared memory before the kernel gets executed.

So the closest I can suggest is this: use cudaMemcpyToSymbol() to achieve your goals. But, in the background, consider that a different approach might be the Right Thing.

#include <algorithm>

__constant__ float devPtr[1024];

__global__
void kernel1(float *some_neat_data)
{
    some_neat_data[threadIdx.x] = devPtr[0] * devPtr[1];
}

__global__
void kernel2(float *potentially_that_same_neat_data)
{
    potentially_that_same_neat_data[threadIdx.x] *= devPtr[2];
}


int main(int argc, char *argv[])
{
    float some_data[256];
    for (int i = 0; i < sizeof(some_data) / sizeof(some_data[0]); i++)
    {
        some_data[i] = i * 2;
    }
    cudaMemcpyToSymbol(devPtr, some_data, std::min(sizeof(some_data), sizeof(devPtr) ));
    float* otherDevPtr;
    cudaMalloc((void**)&otherDevPtr, 256 * sizeof(*otherDevPtr));
    cudaMemset(otherDevPtr, 0, 256 * sizeof(*otherDevPtr));

    kernel1<<<1,128>>>(otherDevPtr);
    kernel2<<<1,128>>>(otherDevPtr);

    return 0;
}

Don't forget '--host-compilation=c++' for this example.

查看更多
看我几分像从前
4楼-- · 2020-02-26 10:22

check out the samples included with the SDK. Many of those sample projects are a decent way to learn by example.

查看更多
Animai°情兽
5楼-- · 2020-02-26 10:22

As this data is used in numerous functions, I would like it to be global.

-

There are few good reasons to use globals. This definitely is not one. I'll leave it as an exercise to expand this example to include moving "devPtr" to a global scope.

What if the kernel operates on a large const structure consisting of arrays? Using the so called constant memory is not an option, because it's very limited in size.. so then you have to put it in global memory..?

查看更多
贼婆χ
6楼-- · 2020-02-26 10:30

I went ahead and tried the solution of allocating a temporary pointer and passing it to a simple global function similar to kernel1.

The good news is that it does work :)

However, I think it confuses the compiler as I now get "Advisory: Cannot tell what pointer points to, assuming global memory space" whenever I try to access the global data. Luckily, the assumption happens to be correct, but the warnings are annoying.

Anyway, for the record - I have looked at many of the examples and did run through the nvidia exercises where the point is to get the output to say "Correct!". However, I haven't looked at all of them. If anyone knows of an sdk example where they do dynamic global device memory allocation, I would still like to know.

查看更多
狗以群分
7楼-- · 2020-02-26 10:31

Erm, it was exactly that problem of moving devPtr to global scope that was my problem.

I have an implementation that does exactly that, with the two kernels having a pointer to data passed in. I explicitly don't want to pass in those pointers.

I have read the documentation fairly closely, and hit up the nvidia forums (and google searched for an hour or so), but I haven't found an implementation of a global dynamic device array that actually runs (i have tried several that compile and then fail in new and interesting ways).

查看更多
登录 后发表回答