可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am aware that block sync is not possible, the only way is launching a new kernel.

BUT, let's suppose that I launch X blocks, where X corresponds to the number of the SM on my GPU. I should aspect that the scheduler will assign a block to each SM...right? And if the GPU is being utilized as a secondary graphic card (completely dedicated to CUDA), this means that, theoretically, no other process use it... right?

My idea is the following: implicit synchronization.

Let's suppose that sometimes I need only one block, and sometimes I need all the X blocks. Well, in those cases where I need just one block, I can configure my code so that the first block (or the first SM) will work on the "real" data while the other X-1 blocks (or SMs) on some "dummy" data, executing exactly the same instruction, just with some other offset.

So that all of them will continue to be synchronized, until I am going to need all of them again.

Is the scheduler reliable under this conditions? Or can you be never sure?

回答1:

You've got several questions in one, so I'll try to address them separately.

One block per SM

I asked this a while back on nVidia's own forums, as I was getting results that indicated that this is not what happens. Apparently, the block scheduler will not assign a block per SM if the number of blocks is equal to the number of SMs.

Implicit synchronization

No. First of all, you cannot guarantee that each block will have its own SM (see above). Secondly, all blocks cannot access the global store at the same time. If they run synchronously at all, they will loose this synchronicity as of the first memory read/write.

Block synchronization

Now for the good news: Yes, you can. The atomic instructions described in Section B.11 of the CUDA C Programming Guide can be used to create a barrier. Assume that you have N blocks executing concurrently on your GPU.

__device__ int barrier = N;

__global__ void mykernel ( ) {

    /* Do whatever it is that this block does. */
    ...

    /* Make sure all threads in this block are actually here. */
    __syncthreads();

    /* Once we're done, decrease the value of the barrier. */
    if ( threadIdx.x == 0 )
        atomicSub( &barrier , 1 );

    /* Now wait for the barrier to be zero. */
    if ( threadIdx.x == 0 )
        while ( atomicCAS( &barrier , 0 , 0 ) != 0 );

    /* Make sure everybody has waited for the barrier. */
    __syncthreads();

    /* Carry on with whatever else you wanted to do. */
    ...

    }

The instruction atomicSub(p,i) computes *p -= i atomically and is only called by the zeroth thread in the block, i.e. we only want to decrement barrier once. The instruction atomicCAS(p,c,v) sets *p = v iff *p == c and returns the old value of *p. This part just loops until barrier reaches 0, i.e. until all blocks have crossed it.

Note that you have to wrap this part in calls to __synchtreads() as the threads in a block do not execute in strict lock-step and you have to force them all to wait for the zeroth thread.

Just remember that if you call your kernel more than once, you should set barrier back to N.

Update

In reply to jHackTheRipper's answer and Cicada's comment, I should have pointed out that you should not try to start more blocks than can be concurrently scheduled on the GPU! This is limited by a number of factors, and you should use the CUDA Occupancy Calculator to find the maximum number of blocks for your kernel and device.

Judging by the original question, though, only as many blocks as there are SMs are being started, so this point is moot.

回答2:

@Pedro is definitely wrong!

Achieving global synchronization has been the subject of several research works recently and, at last for non-Kepler architectures (I don't have one yet). The conclusion is always the same (or should be): it is not possible to achieve such a global synchronization across the whole GPU.

The reason is simple: CUDA blocks cannot be preempted, so given that you fully occupy the GPU, threads waiting for the barrier rendez-vous will never allow the block to terminate. Thus, it will not be removed from the SM, and will prevent the remaining blocks to run.

As a consequence, you will just freeze the GPU that will never be able to escape from this deadlock state.

-- edit to answer Pedro's remarks --

Such shortcomings have been noticed by other authors such as: http://www.openclblog.com/2011/04/eureka.html

by the author of OpenCL in action

-- edit to answer Pedro's second remarks --

The same conclusion is made by @Jared Hoberock in this SO post: Inter-block barrier on CUDA