CUDA atomicAdd for doubles definition error

2019-04-06 14:57发布

问题:

In previous versions of CUDA, atomicAdd was not implemented for doubles, so it is common to implement this like here. With the new CUDA 8 RC, I run into troubles when I try to compile my code which includes such a function. I guess this is due to the fact that with Pascal and Compute Capability 6.0, a native double version of atomicAdd has been added, but somehow that is not properly ignored for previous Compute Capabilities.

The code below used to compile and run fine with previous CUDA versions, but now I get this compilation error:

test.cu(3): error: function "atomicAdd(double *, double)" has already been defined

But if I remove my implementation, I instead get this error:

test.cu(33): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (double *, double)

I should add that I only see this if I compile with -arch=sm_35 or similar. If I compile with -arch=sm_60 I get the expected behavior, i.e. only the first error, and successful compilation in the second case.

Edit: Also, it is specific for atomicAdd -- if I change the name, it works well.

It really looks like a compiler bug. Can someone else confirm that this is the case?

Example code:

__device__ double atomicAdd(double* address, double val)
{
    unsigned long long int* address_as_ull = (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;
    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                __double_as_longlong(val + __longlong_as_double(assumed)));
    } while (assumed != old);
    return __longlong_as_double(old);
}

__global__ void kernel(double *a)
{
    double b=1.3;
    atomicAdd(a,b);
}

int main(int argc, char **argv)
{
    double *a;
    cudaMalloc(&a,sizeof(double));

    kernel<<<1,1>>>(a);

    cudaFree(a);
    return 0;
}

Edit: I got an answer from Nvidia who recognize this problem, and here is what the developers say about it:

The sm_60 architecture, that is newly supported in CUDA 8.0, has native fp64 atomicAdd function. Because of the limitations of our toolchain and CUDA language, the declaration of this function needs to be present even when the code is not being specifically compiled for sm_60. This causes a problem in your code because you also define a fp64 atomicAdd function.

CUDA builtin functions such as atomicAdd are implementation-defined and can be changed between CUDA releases. Users should not define functions with the same names as any CUDA builtin functions. We would suggest you to rename your atomicAdd function to one that is not the same as any CUDA builtin functions.

回答1:

That flavor of atomicAdd is a new method introduced for compute capability 6.0. You may keep your previous implementation of other compute capabilities guarding it using macro definition

#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600
#else
<... place here your own pre-pascal atomicAdd definition ...>
#endif

This macro named architecture identification macro is documented here:

5.7.4. Virtual Architecture Identification Macro

The architecture identification macro __CUDA_ARCH__ is assigned a three-digit value string xy0 (ending in a literal 0) during each nvcc compilation stage 1 that compiles for compute_xy.

This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it.

I assume NVIDIA did not place it for previous CC to avoid conflict for users defining it and not moving to Compute Capability >= 6.x. I would not consider it a BUG though, rather a release delivery practice.

EDIT: macro guard was incomplete (fixed) - here a complete example.

#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600
#else
__device__ double atomicAdd(double* a, double b) { return b; }
#endif

__device__ double s_global ;
__global__ void kernel () { atomicAdd (&s_global, 1.0) ; }


int main (int argc, char* argv[])
{
        kernel<<<1,1>>> () ;
        return ::cudaDeviceSynchronize () ;
}

Compilation with:

$> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Wed_May__4_21:01:56_CDT_2016
Cuda compilation tools, release 8.0, V8.0.26

Command lines (both successful):

$> nvcc main.cu -arch=sm_60
$> nvcc main.cu -arch=sm_35

You may find why it works with the include file: sm_60_atomic_functions.h, where the method is not declared if __CUDA_ARCH__ is lower than 600.