According to the official CUDA doc, we have
__host__ __device__ cudaError_t cudaMemcpyAsync ( void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0 )
which implies it is both a host and a device function. However, in the actual installation on my local Linux box, I am seeing in /usr/local/cuda/include/cuda_runtime_api.h
:
/** CUDA Runtime API Version */
#define CUDART_VERSION 9000
// Many lines away...
extern __host__ __cudart_builtin__ cudaError_t CUDARTAPI cudaMemcpyAsync(void *dst, const void *src, size_t count, enum cudaMemcpyKind kind, cudaStream_t stream __dv(0));
which seems to imply it is strictly a host function.
I tried to compile a simple kernel that calls cudaMemcpyAsync()
, and got the error
streaming.cu(338): error: calling a __host__
function("cudaMemcpyAsync") from a __global__
function("loopy_plus_one") is not allowed
which is another piece of evidence.
So I'm really confused: is the doc incorrect, or is my CUDA installation out of date?
EDIT: update - if I change my compilation command to explicitly specify sm_60, i.e., nvcc -arch=sm_60 -o out ./src.cu
, then the compilation error is gone, but a new one pops out:
ptxas fatal : Unresolved extern function 'cudaMemcpyAsync'
There is a device implementation of cudaMemcpyAsync
in the CUDA device runtime API, which you can see documented in the Programming Guide here. There, within the introductory section on Dynamic Parallelism it notes
Dynamic Parallelism is only supported by devices of compute capability
3.5 and higher
and within the documentation it also notes usage of the device runtime API memory functions:
Notes about all memcpy/memset functions:
- Only async memcpy/set functions are supported
- Only device-to-device memcpy is permitted
- May not pass in local or shared memory pointers
You can also find exact instructions for how you must compile and link code which uses the device runtime API:
CUDA programs are automatically linked with the host runtime library
when compiled with nvcc, but the device runtime is shipped as a static
library which must explicitly be linked with a program which wishes to
use it.
The device runtime is offered as a static library (cudadevrt.lib on
Windows, libcudadevrt.a under Linux and MacOS), against which a GPU
application that uses the device runtime must be linked. Linking of
device libraries can be accomplished through nvcc and/or nvlink.
So to make this work you must do exactly three things:
- Choose a physical target architecture which is at least compute capability 3.5 when you are compiling
- Use separate compilation for device code when you are compiling
- Link the CUDA device runtime library
It is for these three reasons (i.e. not doing any of them) that you have seen the compilation and linking errors when trying to use cudaMemcpyAsync
inside kernel code.
It seems to work once I specify the compute capability correctly,
nvcc -arch=compute_60 -o out src.cu