Optimal workgroup size for sum reduction in OpenCL

2019-07-19 00:18发布

I am using the following kernel for sum reduciton.

__kernel void reduce(__global float* input, __global float* output, __local float* sdata)
{
    // load shared mem
    unsigned int tid = get_local_id(0);
    unsigned int bid = get_group_id(0);
    unsigned int gid = get_global_id(0);

    unsigned int localSize = get_local_size(0);
    unsigned int stride = gid * 2;
    sdata[tid] = input[stride] + input[stride + 1];

    barrier(CLK_LOCAL_MEM_FENCE);
    // do reduction in shared mem
    for(unsigned int s = localSize >> 2; s > 0; s >>= 1) 
    {
        if(tid < s) 
        {
            sdata[tid] += sdata[tid + s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // write result for this block to global mem
    if(tid == 0) output[bid] = sdata[0];
}

It works fine, but I don't know how to choose the optimal workgroup size or number of workgroups if I need more than one workgroup (for example if I want to calculate the sum of 1048576 elements). As far as I understand, the more workgroups I use, the more subresults I will get, which also means that I will need more global reductions at the end.

I've seen the answers to the general workgroup size question here. Are there any recommendations that concern reduction operations specifically?

标签: opencl gpu gpgpu
1条回答
我命由我不由天
2楼-- · 2019-07-19 01:22

This question is a possible duplicate of one I answered a while back: What is the algorithm to determine optimal work group size and number of workgroup.

Experimentation will be the best way to know for sure for any given device.

Update: I think you can safely stick to 1-dimensional work groups, as you have done in your sample code. On the host, you can try out the best values.

For each device:

1) query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.

2) loop over a few multiples and run the kernel with that group size. save the execution time for each test.

3) when you think you have an optimal value, hard code it into a new kernel for use with that specific device. This will give a further boost to performance. You can also eliminate your sdata parameter in the device-specific kernel.

//define your own context, kernel, queue here

int err;
size_t global_size; //set this somewhere to match your test data size
size_t preferred_size;
size_t max_group_size;

err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), preferred_size, NULL);
//check err
err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), max_group_size, NULL);
//check err

size_t test_size;

//your vars for hi-res timer go here

for (unsigned int i=preferred_size ; i<=max_group_size ; i+=preferred_size){
    //reset timer
    test_size = (size_t)i;
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, &test_size, 0, NULL, NULL);
    if(err){
        fail("Unable to enqueue kernel");  //implement your own fail function somewhere..
    }else{
        clfinish(queue);
        //stop timer, save value
        //output timer value and test_size
    }
}

The device-specific kernel can look like this, except the first line should have your optimal value substituted:

#define LOCAL_SIZE 32
__kernel void reduce(__global float* input, __global float* output)
{
    unsigned int tid = get_local_id(0);
    unsigned int stride = get_global_id(0) * 2;
    __local float sdata[LOCAL_SIZE];
    sdata[tid] = input[stride] + input[stride + 1];

    barrier(CLK_LOCAL_MEM_FENCE);

    for(unsigned int s = LOCAL_SIZE >> 2; s > 0; s >>= 1){
        if(tid < s){
            sdata[tid] += sdata[tid + s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    if(tid == 0) output[get_group_id(0)] = sdata[0];
}
查看更多
登录 后发表回答