I just started to use OpenCL via the PyOpenCL interface from Python. I tried to create a very simple "recurrent" program where the outcome of each loop in every kernel depends on the output of another kernel from the last loop-cycle, but I am running into synchronization problems:
__kernel void part1(__global float* a, __global float* c)
{
unsigned int i = get_global_id(0);
c[i] = 0;
barrier(CLK_GLOBAL_MEM_FENCE);
if (i < 9)
{
for(int t = 0; t < 2; t++){
c[i] = c[i+1] + a[i];
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
}
The host application is
import pyopencl as cl
from numpy import *
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
#read in the OpenCL source file as a string
f = open('recurrent.cl', 'r')
fstr = "".join(f.readlines())
#create the program
program = cl.Program(ctx, fstr).build()
mf = cl.mem_flags
#initialize client side (CPU) arrays
a = array(range(10), dtype=float32)
#create OpenCL buffers
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, a.nbytes)
#execute program
program.part1(queue, a.shape, None, a_buf, dest_buf)
c = empty_like(a)
cl.enqueue_read_buffer(queue, dest_buf, c).wait()
print "a", a
print "c", c
The outcome is
a [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
c [ 0. 1. 5. 3. 4. 18. 13. 7. 8. 0.]
As you can see, some of the outcome values are correct. E.g. the third position = 5 = 3 + 2 but e.g. the second position is 2 = 0 + 2. So the summation was over the outcome of other threads at different time-points despite the barrier being in place. I thought the barrier would make sure all threads have reached it and have written their outcome to the global memory?
It's probably something very simple and I would appreciate any hints and comments!
PS: I am running this on a Sandy Bridge CPU using the Intel SDK.
I think I have the answer now. The OpenCL code was actually completely fine. However, the barriers only kick in if all threads are in one workgroup. This has not been the case, which is easy to check by reading out the local_id using get_local_id(0) (as suggested by Huseyin). In my case the host created a workgroup for every thread - instead of putting all threads in one workgroup. Performance-wise that makes sense, compare
Questions about global and local work size
In our case, however, we need to make sure the data is synchronized between the threads so all of them should be in one workgroup. To this end we need to change the execution of program 1,
The second argument refers to the global_size of the job (so the number of threads created), whereas the third seems to refer to the local_size, i.e. the number of threads per workgroup. Thus, this line should read
This creates a workgroup with all threads (but keep an eye on the maximum size of workers allowed in one workgroup!). Now, the code still doesn't work. The last problem is concerned with the barriers in the OpenCL code: the last thread with id = 10 does not see the barriers in the loop and so all threads are waiting for the last one to hit the barrier (though I wonder why that doesn't throw an exception?). So we simply need to reduce the total number of threads (to get rid of the last one),
That works! Learned some lessons in the process...
Thanks again to Huseyin! blue2script
Edit: user
blue2script
was right, it was an issue of "barrier not being hit by all local threads". On top of that, barrier can't synchronize between compute units / workgroups.My answer doesn't add anything nor solve any problem here. So don't see the
if
in below kernel functions. It's wrong.Incomplete
Using global
Using locals(for workgroup size is 256 and total work size is a multiple of that):
Workgroup:
Using private