Busy spin in CUDA

How can I implement a busy spin mechanism of the form

while(variable == 0);

where variable is updated to 1 by some other CUDA thread after some event has occured.

I tried to just write it like above but the code just seems to get ignored and the calling thread just runs past it without waiting at all. I'm absolutely sure that the value is 0, but the thread does not wait at all. Also, if I write:

while(variable == 0) __threadfence();

in order to not risk having the variable cached, the thread blocks indefinitely even thought the variable gets set to 1 eventually. This is all very strange behavior to me, since replicating this code on the CPU produces the correct behavior.

Edit: Oddly, this seems to work correctly if I have blocks of 1 thread each, but not if I have several threads within one block. So threads from one block can see writes done by threads from other blocks, but not writes done by threads from the same block. Strange...

Busy-spinning requires a lot of attention and you have to be really careful about it!

You have to keep in mind, that 32 threads, forming a warp work in perfect sync. If you encounter a branch, threads not taking it become disabled, until the threads executing the branch - exit from it. That is why, trying to busy-spin within a warp can lead to a deadlock: 31 threads will be waiting forever for the single, disabled thread to do its work.

Secondly, if you try to synchronise between blocks, you must know that both blocks are running in parallel. In theory, you don't know how many blocks are running; in practice, you can read the specs of your GPU and launch just as many as it can handle (there are some bugs in the driver and/or hardware, which can cause some problems too)

Thirdly, you have to remember that CUDA compiler tries to optimise. You have to set your shared or global variable as 'volatile' to ensure that it is always being read.