可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

CUDA code compiled with a higher compute capability will execute perfectly for a long time on a device with lower compute capability, before silently failing one day in some kernel. I spent half a day chasing an elusive bug only to realize that the Build Rule had sm_21 while the device (Tesla C2050) was a 2.0.

Is there any CUDA API code I can add which can self-check if it is running on a device with compatible compute capability? I need to compile and work with devices of many compute capabilities. Is there any other action I can take to ensure such errors do not occur?

回答1:

In the runtime API, cudaGetDeviceProperties returns two fields major and minor which return the compute capability any given enumerated CUDA device. You can use that to parse the compute capability of any GPU before establishing a context on it to make sure it is the right architecture for what your code does. nvcc can generate a object file containing multiple architectures from a single invocation using the -gencode option, for example:

nvcc -c -gencode arch=compute_20,code=sm_20  \
        -gencode arch=compute_13,code=sm_13  \
        source.cu

would produce an output object file with an embedded fatbinary object containing cubin files for GT200 and GF100 cards. The runtime API will automagically handle architecture detection and try loading suitable device code from the fatbinary object without any extra host code.