OpenCL ND-Range boundaries?

2019-08-31 04:50发布

Consider a kernel which performs vector addition:

__kernel void vecAdd(__global double *a,
                     __global double *b,
                     __global double *c,
                     const unsigned int n)
{                                           
    //Get our global thread ID              
    int id = get_global_id(0);              

    //Make sure we do not go out of bounds  
    if (id < n)                             
        c[id] = a[id] + b[id];              
}

Is it really necessary to pass the size n to the function, and do a check on the boundaries ?

I have seen the same version without the check on n. Which one is correct?

More generally, I wonder what happens if the data size to process is different than the user defined NR-Range.

Will the remaining, out-of-bounds, data be processed or not?

  • Is so, how is it processed ?
  • If not, does that mean that the user have to consider boundaries when programming a Kernel ?

Does OpenCL specifies any of that?

Thanks

2条回答
贪生不怕死
2楼-- · 2019-08-31 05:28

This is typical when the host code specifies the workgroup size, because in OpenCL 1.x the global size must be a multiple of the work group size. So if your data size is 1000 and your workgroup size is 128 then the global size needs to be rounded up to 1024. Hence the check. In OpenCL 2.0 this requirement has been removed.

查看更多
干净又极端
3楼-- · 2019-08-31 05:34

The check against n is a good idea if you aren't certain to have a multiple of n work items. When you know you will only ever call the kernel with n work items, the check is only taking up processing cycles, kernel size, and the instruction scheduler's attention.

Nothing will happen with the extra data you pass to the kernel. Although if you don't use the data at some point, you did waste time copying it to the device.

I like to make a kernel's work group and global size independent of the total work to be done. I need to pass in 'n' when this is the case.

For example:

__kernel void vecAdd(  __global double *a, __global double *b, __global double *c, const unsigned int n)
{                                           
    //Get our global thread ID and global size
    int gid = get_global_id(0);              
    int gsize = get_global_size(0);              

    //check vs n using for-loop condition
    for(int i=gid; i<n; i+= gsize){
        c[i] = a[i] + b[i];              
    }
}

The example will take an arbitrary value for n, as well as any global size. each work item will process every nth element, beginning at its own global id. The same idea works well with work groups too, sometimes outperforming the global version I have listed due to memory locality.

If you know the value of n to be constant, it is often better to hard code it (as a DEFINE at the top). This will let compilers optimize for that specific value and eliminate the extra parameter. Examples of such kernels include: DFT/FFT processing, bitonic sorting at a given stage, and image processing using constant dimensions.

查看更多
登录 后发表回答