CUDA kernel function seems to show race conditions

2019-08-20 03:49发布

My CUDA kernel function is not returning the intended result (a sum of all elements in vector b) but is instead returning a single value from vector b. I tried memcheck and racecheck, but nothing came up:

[breecej@compute-0-32 newsum]$ cuda-memcheck mystock
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
[breecej@compute-0-32 newsum]$ cuda-memcheck --tool racecheck mystock
========= CUDA-MEMCHECK
========= RACECHECK SUMMARY: 0 hazards displayed (0 errors, 0 warnings) 
[breecej@compute-0-32 newsum]$ 

Here is the kernel function:

__global__ void AddDoubles(double *a, double *b, int count)
{
    int id = blockIdx.x * blockDim.x + threadIdx.x;
    if(id < count)
    {
        a[0] += b[id];
        __syncthreads();
    }
}

And here are the variables going into it and the way I'm calling it in main:

int count = vect.size();
//allocate memory for the sum and stock values for the host side
double *h_a = new double[1];
double *h_b = new double[count];
//array a set to 0, array b set to the values from the line
h_a[0] = 0;
for(int i = 0; i < count; i++)
{
    h_b[i] = vect.at(i);
}
//allocate memory for the sum and stock values arrays for the device side
double *d_a, *d_b;
if(cudaMalloc(&d_a, sizeof(double)) != cudaSuccess)
{
    cout << "Nope! a did not allocate correctly";
    return 0;
}
if(cudaMalloc(&d_b, sizeof(double) * count) != cudaSuccess)
{
    cout << "Nope! b did not allocate correctly";
    cudaFree(d_a);
    return 0;
}
//copy the host pointer to the device pointer
if(cudaMemcpy(d_a, h_a, sizeof(double), cudaMemcpyHostToDevice) != cudaSuccess)
{
    cout << "Could not copy!" << endl;
    cudaFree(d_a);
    cudaFree(d_b);
    return 0;
}
if(cudaMemcpy(d_b, h_b, sizeof(double) * count, cudaMemcpyHostToDevice) != cudaSuccess)
{
    cout << "Could not copy!" << endl;
    cudaFree(d_a);
    cudaFree(d_b);
    return 0;
}
//use AddDoubles to sum up all of the values in b and put them into a
AddDoubles<<<count / 256 + 1, 256>>>(d_a, d_b, count);

where "vect" is a vector of doubles.

Why does it seem like there is a race to change the value in the kernel function, but nothing comes up in racecheck?

1条回答
Lonely孤独者°
2楼-- · 2019-08-20 04:44

Your code has a global memory race, but cuda-memcheck won't detect it. Quoting from the documentation:

The racecheck tool is a run time shared memory data access hazard detector. The primary use of this tool is to help identify memory access race conditions in CUDA applications that use shared memory.

i.e. cuda-memcheck can only detect shared memory races. Your code doesn't use shared memory.

The race itself can be made obvious if your kernel is written like this:

__global__ void AddDoubles(double *a, double *b, int count)
{
    int id = blockIdx.x * blockDim.x + threadIdx.x;
    if(id < count)
    {
        double x = a[0];  // 1. load a[0] to thread local register
        double y = b[id]; // 2. load b[id] to thread local register
        double z = x + y; // 3. perform addition in thread local register
        a[0] = z;         // 4. store thread local register sum to a[0] 
    }
}

This can only ever be correct if execution is serialised. If any thread stores to a[0] while another thread is between steps 1 and 4, then the contents of a[0]will be invalidated by the second write. In a massively parallel, pipelined execution model like CUDA, that will happen as a matter of course.

Note also that your use of __syncthreads() has no effect on this behaviour at all, and the kernel will malfunction identically with or without its inclusion in the code.

To get an understanding of how to do this type of reduction operation in parallel in CUDA, please refer to the CUDA reduction example, which includes an excellent white paper on the operation and performance tuning options.

查看更多
登录 后发表回答