I've searched a month for this problem. I cannot synchronize blocks in CUDA.
I've read a lot of posts about atomicAdd, cooperative groups, etc. I decided to use an global array so a block could write on one element of global array. After this writing, a thread of block waits(i.e. trapped in a while loop) until all blocks write global array.
When I used 3 blocks my synchronization works well (because I have 3 SM). But using 3 blocks gives me 12% occupancy. So I need to use more blocks, but they can't be synchronized. The problem is: a block on a SM waits for other blocks, so the SM can't get another block.
What can I do? How can synchronize blocks when there are blocks more than the number of SMs?
CUDA-GPU specification: CC. 6.1, 3 SM, windows 10, VS2015, GeForce MX150 graphic card. Please help me for this problem. I used a lot of codes but none of them works.
The CUDA programming model methods to do inter-block synchronization are
(implicit) Use the kernel launch itself. Before the kernel launch, or after it completes, all blocks (in the launched kernel) are synchronized to a known state. This is conceptually true whether the kernel is launched from host code or as part of CUDA Dynamic Parallelism launch.
(explicit) Use a grid sync in CUDA Cooperative groups. This has a variety of requirements for support, which you are starting to explore in your other question. The simplest definition of support is if the appropriate property is set (
cooperativeLaunch
). You can query the property programmatically, usingcudaGetDeviceProperties()
.