Does atomic functions really make variables volati

I've written a very simple code ask thread 0 to update a global variable while other threads keep reading that variable.But I found other threads don't really get the value.

Code is here, it is quite simple. Could anyone give me any suggestion how to fix it? Thanks a lot

__global__ void addKernel(int *c)
{
int i = threadIdx.x;
int j = 0;
if (i == 0)
{
    while(*c < 2000){
        int temp = *c;
        printf("*c = %d\n",*c);
        atomicCAS(c,temp, temp+1);
    }       
}else{
    while(*c < 1000)
    {
        j++;
    }
}

}

I'd like to make an analogy: imagine for a second that atomic operations are mutexes: for a program to be well-defined, two threads accessing a shared resource must both agree to use the mutex to access the resource exclusively. If one of the threads accesses the resource without first holding the mutex, the result is undefined.

The same thing is true for atomics: if you decide to treat a particular location in memory as an atomic variable, then all threads accessing that location should agree and treat it as such for your program to have meaning. You should only be manipulating it through atomic loads and stores, not a combination of non-atomic and atomic operations.

In other words, this:

atomicCAS(c,temp, temp+1);

Contains an atomic load-compare-store. The resulting instruction will go all the way down to global memory to load c, do the comparison, and go all the way down to global memory to store the new value.

But this:

while(*c < 2000)

Is not atomic by any means. The compiler (and the hardware) has no idea that c may have been modified by another thread. So instead of going all the way down to global memory, it will simply read from the fastest available cache. Possibly the compiler will even put the variable in a register, because it doesn't see anyone else modifying it in the current thread.

What you would want is something like (imaginary):

while (atomicLoad(c) < 2000)

But to the best of my knowledge there is no such construct in CUDA at the time of writing.

In this regard, the volatile qualifier may help: it tells the compiler to not optimize the variable, and consider it as "modifiable from external sources". This will trigger a load for every read of the variable, although I am not sure this load bypasses all the caches. In practice, it may work, but in theory I don't think you should rely on it. Besides, this will also disable any optimizations on that variable (such as constant propagation or promoting the variable to a register for better performance).

You may want to try the following hack (I haven't tried it):

while(atomicAdd(c, 0) < 2000)

This will emit an atomic instruction that does load from global memory, and therefore should see the most recent value of c. However, it also introduces an (useless in this case) atomic store.