Numba lacks the cuda-C command gridsync() so there is not a canned method for syncing across an entire grid. Only block level syncs is available.
If cudaKernal1 is a very fast execution time then the following code would run 1000x faster
for i in range(10000):
X = X + cudaKernel1[(100,100),(32,32)] (X)
by putting the loop into the same kernel, to avoid the gpu kernel setup time. But you can't because you require all of the grid to finish before the next iteration can start, and there is no gridsync() command in Numba.
Here's an obvious way to do a gridsync() in numba so you'd think people would use this method, but I can't find any examples of this.
However I have found lots of comments on stackoverflow stating--without explanation-- that trying to use atomic counters to sync blocks across a grid is pointless, unsafe or will deadlock in race conditions. Instead they recommend exiting the kernel between the two steps. However if each step is very fast, then it takes longer to invoke a kernel than to do it, so it can be 1000 times faster if you can loop over the steps without exiting.
I can't figure out what is unsafe or why there would be a race condition that would be a pitfall.
What is wrong with something like the following.
@numba.cuda.jit('void()')
def gpu_initGridSync():
if ( cuda.threadIdx.x == 0):
Global_u[0] = 0
Global_u[1] = 0
@numba.cuda.jit('void(int32)'device=True)
def gpu_fakeGridSync(i):
###wait till the the entire grid has finished doSomething()
# in Cuda-C we'd call gridsync()
# but lack that in Numba so do the following instead.
#Syncthreads in current block
numba.cuda.syncthreads()
#increment global counter, once per block
if ( cuda.threadIdx.x == 0 ): numba.atomic.add( Global_u, 0, 1 )
# idle in a loop
while ( Global_u[0] < (i+1)*cuda.gridDim.x-1 ) ): pass #2
#regroup the block threads after the slow global memory reads.
numba.cuda.syncthreads()
# now, to avoid a race condition of blocks re-entering the above while
# loop before other blocks have exited we do this global sync a second time
#increment global counter, once per block
if ( cuda.threadIdx.x == 0 ): numba.atomic.add( Global_u,1, 1 )
# idle in a loop
while ( Global_u[1] > (i+2)*cuda.gridDim.x ) ): pass #2
#regroup the block threads after the slow global memory reads.
numba.cuda.syncthreads()
This is then used like this:
@numba.cuda.jit('void(float32[:])')):
def ReallyReallyFast(X):
i = numba.cuda.grid(1)
for h in range(1,40000,4):
temp = calculateSomething(X)
gpu_fakeGridSync(h)
X[i] = X[i]+temp
gpu_fakeGridSync(h+2)
gpu_initGridSync[(1,),(1,)]()
ReallyReallyFast[(1000,), (32,) ](X)
@numba.cuda.jit('float32(float32[:])',device=True):
def calculateSomething(X): # A dummy example of a very fast kernel operation
i = numba.cuda.grid(1)
if (i>0):
return (X[i]-X[i-1])/2.0
return 0.0
It seems to me this is logically sound. There is one subtle step to initialize the global counter. That has to be done in its own kernel call to avoid a race condition. But after that I can freely call the fakeGridSync without reinitializing it. I do have to keep track of how what loop iteration I am calling it in (hence the passed in parameter to gridSync).
I admit I can see that there's some wasted effort, but is that a deal killer? For example, in statement #2, this while loop means all the threads in all the completed blocks are spinning their wheels with wasted effort. I suppose that might mildly slow down the gridblocks that are still trying to execute "doSomething". I'm not sure how bad that wasted effort is however. A second nitpick on statement #2 is that all the threads are contending for the same global memory, so they will be slow to access it. That might even be a good thing if it means the scheduler defers their execution and lets the useful threads execute more often. One could improve this naive code by only having thread(0) in each block check if that collision is a problem.