I am in the process of implementing multithreading through a NVIDIA GeForce GT 650M GPU for a simulation I have created. In order to make sure everything works properly, I have created some side code to test that everything works. At one point I need to update a vector of variables (they can all be updated separately).
Here is the gist of it:
`\__device__
int doComplexMath(float x, float y)
{
return x+y;
}`
`// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y, vector<complex<long double> > *z)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
z[i] = doComplexMath(*x, *y);
}`
`int main(void)
{
int iGAMAf = 1<<10;
float *x, *y;
vector<complex<long double> > VEL(iGAMAf,zero);
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, sizeof(float));
cudaMallocManaged(&y, sizeof(float));
cudaMallocManaged(&VEL, iGAMAf*sizeof(vector<complex<long double> >));
// initialize x and y on the host
*x = 1.0f;
*y = 2.0f;
// Run kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (iGAMAf + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(iGAMAf, x, y, *VEL);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
return 0;
}`
I am trying to allocate unified memory (memory accessible from the GPU and CPU). When compiling using nvcc, I get the following error:
error: no instance of overloaded function "cudaMallocManaged" matches the argument list argument types are: (std::__1::vector, std::__1::allocator>> *, unsigned long)
How can I overload the function properly in CUDA to use this type with multithreading?
It isn't possible to do what you are trying to do.
To allocate a vector using managed memory you would have to write your own implementation of an allocator which inherits from
std::allocator_traits
and callscudaMallocManaged
under the hood. You can then instantiate astd::vector
using your allocator class.Also note that your CUDA kernel code is broken in that you can't use
std::vector
in device code.Note that although the question has managed memory in view, this is applicable to other types of CUDA allocation such as pinned allocation.
As another alternative, suggested here, you could consider using a thrust host vector in lieu of
std::vector
and use a custom allocator with it. A worked example is here in the case of pinned allocator (cudaMallocHost
/cudaHostAlloc
).