可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

According to the official CUDA doc, we have

__host__  __device__ cudaError_t cudaMemcpyAsync ( void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0 )

which implies it is both a host and a device function. However, in the actual installation on my local Linux box, I am seeing in /usr/local/cuda/include/cuda_runtime_api.h:

/** CUDA Runtime API Version */
#define CUDART_VERSION  9000
// Many lines away...
extern __host__ __cudart_builtin__ cudaError_t CUDARTAPI cudaMemcpyAsync(void *dst, const void *src, size_t count, enum cudaMemcpyKind kind, cudaStream_t stream __dv(0));

which seems to imply it is strictly a host function.

I tried to compile a simple kernel that calls cudaMemcpyAsync(), and got the error

streaming.cu(338): error: calling a __host__ function("cudaMemcpyAsync") from a __global__ function("loopy_plus_one") is not allowed

which is another piece of evidence.

So I'm really confused: is the doc incorrect, or is my CUDA installation out of date?

EDIT: update - if I change my compilation command to explicitly specify sm_60, i.e., nvcc -arch=sm_60 -o out ./src.cu, then the compilation error is gone, but a new one pops out:

ptxas fatal : Unresolved extern function 'cudaMemcpyAsync'

回答1:

There is a device implementation of cudaMemcpyAsync in the CUDA device runtime API, which you can see documented in the Programming Guide here. There, within the introductory section on Dynamic Parallelism it notes

Dynamic Parallelism is only supported by devices of compute capability 3.5 and higher

and within the documentation it also notes usage of the device runtime API memory functions:

Notes about all memcpy/memset functions:

Only async memcpy/set functions are supported

Only device-to-device memcpy is permitted

May not pass in local or shared memory pointers

You can also find exact instructions for how you must compile and link code which uses the device runtime API:

CUDA programs are automatically linked with the host runtime library when compiled with nvcc, but the device runtime is shipped as a static library which must explicitly be linked with a program which wishes to use it.

The device runtime is offered as a static library (cudadevrt.lib on Windows, libcudadevrt.a under Linux and MacOS), against which a GPU application that uses the device runtime must be linked. Linking of device libraries can be accomplished through nvcc and/or nvlink.

So to make this work you must do exactly three things:

Choose a physical target architecture which is at least compute capability 3.5 when you are compiling
Use separate compilation for device code when you are compiling
Link the CUDA device runtime library

It is for these three reasons (i.e. not doing any of them) that you have seen the compilation and linking errors when trying to use cudaMemcpyAsync inside kernel code.