I am currently going through the tutorial examples on http://code.google.com/p/stanford-cs193g-sp2010/ to learn CUDA. The code which demostrates __global__
functions is given below. It simply creates two arrays, one on the CPU and one on the GPU, populates the GPU array with the number 7 and copies the GPU array data into the CPU array.
#include <stdlib.h>
#include <stdio.h>
__global__ void kernel(int *array)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
array[index] = 7;
}
int main(void)
{
int num_elements = 256;
int num_bytes = num_elements * sizeof(int);
// pointers to host & device arrays
int *device_array = 0;
int *host_array = 0;
// malloc a host array
host_array = (int*)malloc(num_bytes);
// cudaMalloc a device array
cudaMalloc((void**)&device_array, num_bytes);
int block_size = 128;
int grid_size = num_elements / block_size;
kernel<<<grid_size,block_size>>>(device_array);
// download and inspect the result on the host:
cudaMemcpy(host_array, device_array, num_bytes, cudaMemcpyDeviceToHost);
// print out the result element by element
for(int i=0; i < num_elements; ++i)
{
printf("%d ", host_array[i]);
}
// deallocate memory
free(host_array);
cudaFree(device_array);
}
My question is why have they worded the cudaMalloc((void**)&device_array, num_bytes);
statement with a double pointer? Even here definition of cudamalloc() on says the first argument is a double pointer.
Why not simply return a pointer to the beginning of the allocated memory on the GPU, just like the malloc
function does on the CPU?
All CUDA API functions return an error code (or cudaSuccess if no error occured). All other parameters are passed by reference. However, in plain C you cannot have references, that's why you have to pass an address of the variable that you want the return information to be stored. Since you are returning a pointer, you need to pass a double-pointer.
Another well-known function which operates on addresses for the same reason is the scanf
function. How many times have you forgotten to write this &
before the variable that you want to store the value to? ;)
int i;
scanf("%d",&i);
This is simply a horrible, horrible API design. The problem with passing double-pointers for an allocation function that obtains abstract (void *
) memory is that you have to make a temporary variable of type void *
to hold the result, then assign it into the real pointer of the correct type you want to use. Casting, as in (void**)&device_array
, is invalid C and results in undefined behavior. You should simply write a wrapper function that behaves like normal malloc
and returns a pointer, as in:
void *fixed_cudaMalloc(size_t len)
{
void *p;
if (cudaMalloc(&p, len) == success_code) return p;
return 0;
}
We cast it into double pointer because it's a pointer to the pointer. It has to point to a pointer of GPU memory. What cudaMalloc() does is that it allocates a memory pointer (with space) on GPU which is then pointed by the first argument we give.
The problem: you have to return two values: Return code AND pointer to memory (in case return code indicates success). So you must make one of it a pointer to return type. And as the return type you have the choice between return pointer to int (for error code) or return pointer to pointer (for memory address). There one solution is as good as the other (and one of it yields the pointer to pointer (I prefer to use this term instead of double pointer, as this sounds more as a pointer to a double floating point number)).
In malloc you have the nice property that you can have null pointers to indicate an error, so you basically need just one return value.. I am not sure if this is possible with a pointer to device memory, as it might be that there is no or a wrong null value (remember: This is CUDA and NOT Ansi C). It could be that the null pointer on the host system is entirely different from the null used for the device, and as such the return of null pointer to indicate errors does not work, and you must make the API this way (that would also mean that you have NO common NULL on both devices).
In C/C++, you can allocate a block of memory dynamically at runtime by calling the malloc
function.
int * h_array
h_array = malloc(sizeof(int))
The malloc
function returns the address of the allocated memory block which can be stored in a variable of some kind of pointer.
Memory allocation in CUDA is a bit different in two ways,
- The
cudamalloc
return an integer as error code instead of a
pointer to the memory block.
In addition to the byte size to be
allocated, cudamalloc
also requires a double void pointer as its
first parameter.
int * d_array
cudamalloc((void **) &d_array, sizeof(int))
The reason behind the first difference is that all CUDA API function follows the convention of returning an integer error code. So to make things consistent, cudamalloc
API also returns an integer.
There requirements for a double pointer as the function first argument can be understood in two steps.
Firstly, since we have already decided to make the cudamalloc return an integer value, we can no longer use it to return the address of the allocated memory. In C, the only other way for a function to communicate is by passing the pointer or address to the function. The function can make changes to the value stored at the address or the address where the pointer is pointing. The changes to those value can be later retrieved outside the function scope by using the same memory address.
how the double pointer works
The following diagram illustrated how it works with the double pointer.
int cudamalloc((void **) &d_array, int type_size) {
*d_array = malloc(type_size)
return return_code
}
Why do we need the double pointer? Why this does work
I normally live the python world so I also struggled to understand why this will not work.
int cudamalloc((void *) d_array, int type_size) {
d_array = malloc(type_size)
...
return error_status
}
So why it doesn't work? Because in C, when cudamalloc
is called, a local variable named d_array is created and assigned with the value of the first function argument. There is no way we can retrieve the value in that local variable outside the function's scope. That why we need to a pointer to a pointer here.
int cudamalloc((void *) d_array, int type_size) {
*d_array = malloc(type_size)
...
return return_code
}