Is it safe to implement cuda gridsync() in Numba l

Numba lacks the cuda-C command gridsync() so there is not a canned method for syncing across an entire grid. Only block level syncs is available.

If cudaKernal1 is a very fast execution time then the following code would run 1000x faster

for i in range(10000):
   X = X + cudaKernel1[(100,100),(32,32)] (X)

by putting the loop into the same kernel, to avoid the gpu kernel setup time. But you can't because you require all of the grid to finish before the next iteration can start, and there is no gridsync() command in Numba.

Here's an obvious way to do a gridsync() in numba so you'd think people would use this method, but I can't find any examples of this.

However I have found lots of comments on stackoverflow stating--without explanation-- that trying to use atomic counters to sync blocks across a grid is pointless, unsafe or will deadlock in race conditions. Instead they recommend exiting the kernel between the two steps. However if each step is very fast, then it takes longer to invoke a kernel than to do it, so it can be 1000 times faster if you can loop over the steps without exiting.

I can't figure out what is unsafe or why there would be a race condition that would be a pitfall.

What is wrong with something like the following.

@numba.cuda.jit('void()')
def gpu_initGridSync():
    if ( cuda.threadIdx.x == 0): 
        Global_u[0] = 0
        Global_u[1] = 0

@numba.cuda.jit('void(int32)'device=True)
def gpu_fakeGridSync(i):
    ###wait till the the entire grid has finished doSomething()
    # in Cuda-C we'd call gridsync()
    # but lack that in Numba so do the following instead.

    #Syncthreads in current block
    numba.cuda.syncthreads()

    #increment global counter, once per block
    if ( cuda.threadIdx.x == 0 ):  numba.atomic.add( Global_u, 0, 1 )

    # idle in a loop
    while ( Global_u[0] < (i+1)*cuda.gridDim.x-1 ) ):  pass   #2

    #regroup the block threads after the slow global memory reads.
    numba.cuda.syncthreads()

    # now, to avoid a race condition of blocks re-entering the above while
    # loop before other blocks have exited we do this global sync a second time

     #increment global counter, once per block
    if ( cuda.threadIdx.x == 0 ):  numba.atomic.add( Global_u,1, 1 )

    # idle in a loop
    while ( Global_u[1] > (i+2)*cuda.gridDim.x ) ):  pass   #2

    #regroup the block threads after the slow global memory reads.
    numba.cuda.syncthreads()

This is then used like this:

@numba.cuda.jit('void(float32[:])')):
def ReallyReallyFast(X):
    i = numba.cuda.grid(1)
    for h in range(1,40000,4):
        temp = calculateSomething(X)
        gpu_fakeGridSync(h)
        X[i] = X[i]+temp
        gpu_fakeGridSync(h+2)

gpu_initGridSync[(1,),(1,)]()
ReallyReallyFast[(1000,), (32,) ](X)


@numba.cuda.jit('float32(float32[:])',device=True):
def calculateSomething(X):  # A dummy example of a very fast kernel operation
    i = numba.cuda.grid(1)
    if (i>0):
        return (X[i]-X[i-1])/2.0
    return 0.0

It seems to me this is logically sound. There is one subtle step to initialize the global counter. That has to be done in its own kernel call to avoid a race condition. But after that I can freely call the fakeGridSync without reinitializing it. I do have to keep track of how what loop iteration I am calling it in (hence the passed in parameter to gridSync).

I admit I can see that there's some wasted effort, but is that a deal killer? For example, in statement #2, this while loop means all the threads in all the completed blocks are spinning their wheels with wasted effort. I suppose that might mildly slow down the gridblocks that are still trying to execute "doSomething". I'm not sure how bad that wasted effort is however. A second nitpick on statement #2 is that all the threads are contending for the same global memory, so they will be slow to access it. That might even be a good thing if it means the scheduler defers their execution and lets the useful threads execute more often. One could improve this naive code by only having thread(0) in each block check if that collision is a problem.

I think that Robert Crovella's comment points to the correct answer to why this method will fail.

I was incorrectly assuming the scheduler did pre-emptive multi-tasking so that all blocks would get a time slice to run in.

Currently Nvidia GPU's do not have pre-emptive multi-taking schedulers. Jobs run to completion.

Thus it is possible that once enough blocks enter the while loop to wait, that remaining blocks will not be launched by the scheduler. Thus the wait loop will wait forever.

I see there are research papers suggesting how to Nvidia could make it's scheduler pre-emptive. https://www.computer.org/csdl/proceedings/snpd/2012/2120/00/06299288.pdf But evidently that's not the case right now.

I am left wondering how the cuda-C managed to pull off the gridSync() command. If it can be done in C, there must be some generic way to work around these limitations. This is a mystery I hope someone comments on below

It's really a shame to leave a 1000x speedup on the table.