Open CL no synchronization despite barrier

2019-07-16 03:34发布

问题:

I just started to use OpenCL via the PyOpenCL interface from Python. I tried to create a very simple "recurrent" program where the outcome of each loop in every kernel depends on the output of another kernel from the last loop-cycle, but I am running into synchronization problems:

__kernel void part1(__global float* a, __global float* c)
{
    unsigned int i = get_global_id(0);

    c[i] = 0;
    barrier(CLK_GLOBAL_MEM_FENCE);

    if (i < 9)
    {
        for(int t = 0; t < 2; t++){
            c[i] = c[i+1] + a[i];
            barrier(CLK_GLOBAL_MEM_FENCE);
       }
    }
}

The host application is

import pyopencl as cl
from numpy import *

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

#read in the OpenCL source file as a string
f = open('recurrent.cl', 'r')
fstr = "".join(f.readlines())

#create the program
program = cl.Program(ctx, fstr).build()

mf = cl.mem_flags

#initialize client side (CPU) arrays
a = array(range(10), dtype=float32)

#create OpenCL buffers
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, a.nbytes)

#execute program
program.part1(queue, a.shape, None, a_buf, dest_buf)
c = empty_like(a)
cl.enqueue_read_buffer(queue, dest_buf, c).wait()

print "a", a
print "c", c

The outcome is

a [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
c [  0.   1.   5.   3.   4.  18.  13.   7.   8.   0.]

As you can see, some of the outcome values are correct. E.g. the third position = 5 = 3 + 2 but e.g. the second position is 2 = 0 + 2. So the summation was over the outcome of other threads at different time-points despite the barrier being in place. I thought the barrier would make sure all threads have reached it and have written their outcome to the global memory?

It's probably something very simple and I would appreciate any hints and comments!

PS: I am running this on a Sandy Bridge CPU using the Intel SDK.

回答1:

I think I have the answer now. The OpenCL code was actually completely fine. However, the barriers only kick in if all threads are in one workgroup. This has not been the case, which is easy to check by reading out the local_id using get_local_id(0) (as suggested by Huseyin). In my case the host created a workgroup for every thread - instead of putting all threads in one workgroup. Performance-wise that makes sense, compare

Questions about global and local work size

In our case, however, we need to make sure the data is synchronized between the threads so all of them should be in one workgroup. To this end we need to change the execution of program 1,

program.part1(queue, a.shape, None, a_buf, dest_buf)

The second argument refers to the global_size of the job (so the number of threads created), whereas the third seems to refer to the local_size, i.e. the number of threads per workgroup. Thus, this line should read

program.part1(queue, a.shape, a.shape, a_buf, dest_buf)

This creates a workgroup with all threads (but keep an eye on the maximum size of workers allowed in one workgroup!). Now, the code still doesn't work. The last problem is concerned with the barriers in the OpenCL code: the last thread with id = 10 does not see the barriers in the loop and so all threads are waiting for the last one to hit the barrier (though I wonder why that doesn't throw an exception?). So we simply need to reduce the total number of threads (to get rid of the last one),

program.part1(queue, (a.shape[0]-1,), (a.shape[0]-1,), a_buf, dest_buf)

That works! Learned some lessons in the process...

Thanks again to Huseyin! blue2script



回答2:

Edit: user blue2script was right, it was an issue of "barrier not being hit by all local threads". On top of that, barrier can't synchronize between compute units / workgroups.

My answer doesn't add anything nor solve any problem here. So don't see the if in below kernel functions. It's wrong.


Incomplete

 __kernel void part1(__global float* a, __global float* c)
 {
      unsigned int i = get_global_id(0);

      c[i] = 0;
      barrier(CLK_GLOBAL_MEM_FENCE);

      if (i < 9)
      {
          for(int t = 0; t < 2; t++)
          {
              c[i] = c[i+1] + a[i];//c[i+1] is neighbour thread's variable
                                   //and there is no guarantee that
                                   //which one(ith or (i+1)st) computes first
                                   //so you need to get a copy of c[] first
              barrier(CLK_GLOBAL_MEM_FENCE);//thats why this line is not helping
          }
      }
 }

Using global

 __kernel void part1(__global float* a, __global float* c,__global float* d)
 {
      unsigned int i = get_global_id(0);

      c[i] = 0;
      d[i]=c[i]; 
      barrier(CLK_GLOBAL_MEM_FENCE);

      if (i < 9)
      {
          for(int t = 0; t < 2; t++)
          {
              d[i] = c[i+1] + a[i];//it is guaranteed that no neighbour thread can
                                   //change this threads d[i] element before/after
                                   //execution
              barrier(CLK_GLOBAL_MEM_FENCE);
              c[i]=d[i];
              barrier(CLK_GLOBAL_MEM_FENCE);
          }
      }
      barrier(CLK_GLOBAL_MEM_FENCE);

 }

Using locals(for workgroup size is 256 and total work size is a multiple of that):

 __kernel void part1(__global float* a, __global float* c)
 {
      unsigned int i = get_global_id(0);
      unsigned int Li=get_local_id(0);
      __local d[256];
      c[i] = 0;
      barrier(CLK_GLOBAL_MEM_FENCE);
      d[Li]=c[i]; 
      barrier(CLK_LOCAL_MEM_FENCE);

      if (i < 9)
      {
          for(int t = 0; t < 2; t++)
          {
              d[Li] = c[i+1] + a[i];//it is guaranteed that no neighbour thread can
                                   //change this threads d[i] element before/after
                                   //execution

             barrier(CLK_LOCAL_MEM_FENCE);
             c[i]=d[Li]; //guaranteed they dont interfere each other
             barrier(CLK_LOCAL_MEM_FENCE);
          }
      }

 }

Workgroup:

Using private

 __kernel void part1(__global float* a, __global float* c)
 {
      unsigned int i = get_global_id(0);
      unsigned int Li=get_local_id(0);
      __private f1;
      c[i] = 0;

      if (i < 9)
      {
          for(int t = 0; t < 2; t++)
          {
              f1 = c[i+1] + a[i];

             barrier(CLK_GLOBAL_MEM_FENCE);
             c[i]=f1; //guaranteed they dont interfere each other
             barrier(CLK_GLOBAL_MEM_FENCE);
          }
      }

 }