CUDA, mutex and atomicCAS()

Recently I started to develop on CUDA and faced with the problem with atomicCAS(). To do some manipulations with memory in device code I have to create a mutex, so that only one thread could work with memory in critical section of code.

The device code below runs on 1 block and several threads.

__global__ void cudaKernelGenerateRandomGraph(..., int* mutex)
{
    int i = threadIdx.x;
    ...

    do 
    {
        atomicCAS(mutex, 0, 1 + i);
    }
    while (*mutex != i + 1);

    //critical section
    //do some manipulations with objects in device memory

    *mutex = 0;

    ...
}

When first thread executes

atomicCAS(mutex, 0, 1 + i);

mutex is 1. After that first thread changes its status from Active to Inactive, and line

*mutex = 0;

is not executed. Other threads stays forever in loop. I have tried many variants of this cycle like while(){};, do{}while();, with temp variable = *mutex inside loop, even variant with if(){} and goto. But result is the same.

The host part of code:

...
int verticlesCount = 5;
int *mutex;
cudaMalloc((void **)&mutex, sizeof(int));
cudaMemset(mutex, 0, sizeof(int));
cudaKernelGenerateRandomGraph<<<1, verticlesCount>>>(..., mutex);

I use Visual Studio 2012 with CUDA 5.5.

The device is NVidia GeForce GT 240 with compute capability 1.2.

Thanks in advance.

UPD: After some time working on my diploma project this spring, I found a solution for critical section on cuda. This is a combination of lock-free and mutex mechanisms. Here is working code. Used it to impelment atomic dynamic-resizable array.

// *mutex should be 0 before calling this function
__global__ void kernelFunction(..., unsigned long long* mutex) 
{
    bool isSet = false; 
    do 
    {
        if (isSet = atomicCAS(mutex, 0, 1) == 0) 
        {
            // critical section goes here
        }
        if (isSet) 
        {
            mutex = 0;
        }
    } 
    while (!isSet);
}

The loop in question

do 
{
    atomicCAS(mutex, 0, 1 + i);
}
while (*mutex != i + 1);

would work fine if it were running on the host (CPU) side; once thread 0 sets *mutex to 1, the other threads would wait exactly until thread 0 sets *mutex back to 0.

However, GPU threads are not as independent as their CPU counterparts. GPU threads are grouped into groups of 32, commonly referred to as warps. Threads in the same warp will execute instructions in complete lock-step. If a control statement such as if or while causes some of the 32 threads to diverge from the rest, the remaining threads will wait (i.e. sleeps) for the divergent threads to finish. [1]

Going back to the loop in question, thread 0 becomes inactive because threads 1, 2, ..., 31 are still stuck in the while loop. So thread 0 never reaches the line *mutex = 0, and the other 31 threads loops forever.

A potential solution is to make a local copy of the shared resource in question, let 32 threads modify the copy, and then pick one thread to 'push' the change back to the shared resource. A __shared__ variable is ideal in this situation: it will be shared by the threads belonging to the same block but not other blocks. We can use __syncthreads() to fine-control the access of this variable by the member threads.

[1] CUDA Best Practices Guide - Branching and Divergence

Avoid different execution paths within the same warp.

Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.