How can I implement a busy spin mechanism of the form
while(variable == 0);
where variable is updated to 1 by some other CUDA thread after some event has occured.
I tried to just write it like above but the code just seems to get ignored and the calling thread just runs past it without waiting at all. I'm absolutely sure that the value is 0, but the thread does not wait at all. Also, if I write:
while(variable == 0) __threadfence();
in order to not risk having the variable cached, the thread blocks indefinitely even thought the variable gets set to 1 eventually. This is all very strange behavior to me, since replicating this code on the CPU produces the correct behavior.
Edit: Oddly, this seems to work correctly if I have blocks of 1 thread each, but not if I have several threads within one block. So threads from one block can see writes done by threads from other blocks, but not writes done by threads from the same block. Strange...