As well known OpenCL barrier() function works only for single workgroup, and there is no direct possibility to synchronize workgroups. If it possible what's best approach for global synchronization today? Using atomics, OpenCL 2.0 features, etc.?
Github links, examples are welcome!
Thankx!
Global syncronization within a kernel is not possible. This is because work groups are not gauranteed to be run at the same time. You can achieve a sort of global sync in the host application if you break your kernel into pieces. This is not suitable for many kernels, espeically if you use a lot of local memory or have a bit of initialization code before your kernel does any real work.
Break you kernel into two pars -- kernelA and kernelB for example. Global syncronization is simply a matter of running the NDRange for kernelA, then finish(), and NDRange for kernelB. The global data will remain in memory between the two calls.
Again, not pretty and not necessarily high performance, but if you really must have global sync, this is the only way to get it.
While global synchronization has no succinct in-kernel API call, if the compute device supports the OpenCL extension cl_khr_global_int32_base_atomics, it may be implemented using atomics.
Please see Xiao et al.'s paper that evaluates lock and lock-free approaches to global synchronization on GPUs.
http://synergy.cs.vt.edu/pubs/papers/xiao-ipdps2010-gpusync.pdf
This is mentioned in another stackoverflow post found here: OpenCL and GPU global synchronization