Optimal workgroup size for sum reduction in OpenCL

I am using the following kernel for sum reduciton.

__kernel void reduce(__global float* input, __global float* output, __local float* sdata)
{
    // load shared mem
    unsigned int tid = get_local_id(0);
    unsigned int bid = get_group_id(0);
    unsigned int gid = get_global_id(0);

    unsigned int localSize = get_local_size(0);
    unsigned int stride = gid * 2;
    sdata[tid] = input[stride] + input[stride + 1];

    barrier(CLK_LOCAL_MEM_FENCE);
    // do reduction in shared mem
    for(unsigned int s = localSize >> 2; s > 0; s >>= 1) 
    {
        if(tid < s) 
        {
            sdata[tid] += sdata[tid + s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // write result for this block to global mem
    if(tid == 0) output[bid] = sdata[0];
}

It works fine, but I don't know how to choose the optimal workgroup size or number of workgroups if I need more than one workgroup (for example if I want to calculate the sum of 1048576 elements). As far as I understand, the more workgroups I use, the more subresults I will get, which also means that I will need more global reductions at the end.

I've seen the answers to the general workgroup size question here. Are there any recommendations that concern reduction operations specifically?

标签： opencl gpu gpgpu

1条回答

我命由我不由天

2楼-- · 2019-07-19 01:22

This question is a possible duplicate of one I answered a while back: What is the algorithm to determine optimal work group size and number of workgroup.

Experimentation will be the best way to know for sure for any given device.

Update: I think you can safely stick to 1-dimensional work groups, as you have done in your sample code. On the host, you can try out the best values.

For each device:

1) query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.

2) loop over a few multiples and run the kernel with that group size. save the execution time for each test.

3) when you think you have an optimal value, hard code it into a new kernel for use with that specific device. This will give a further boost to performance. You can also eliminate your sdata parameter in the device-specific kernel.

//define your own context, kernel, queue here

int err;
size_t global_size; //set this somewhere to match your test data size
size_t preferred_size;
size_t max_group_size;

err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), preferred_size, NULL);
//check err
err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), max_group_size, NULL);
//check err

size_t test_size;

//your vars for hi-res timer go here

for (unsigned int i=preferred_size ; i<=max_group_size ; i+=preferred_size){
    //reset timer
    test_size = (size_t)i;
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, &test_size, 0, NULL, NULL);
    if(err){
        fail("Unable to enqueue kernel");  //implement your own fail function somewhere..
    }else{
        clfinish(queue);
        //stop timer, save value
        //output timer value and test_size
    }
}

The device-specific kernel can look like this, except the first line should have your optimal value substituted:

#define LOCAL_SIZE 32
__kernel void reduce(__global float* input, __global float* output)
{
    unsigned int tid = get_local_id(0);
    unsigned int stride = get_global_id(0) * 2;
    __local float sdata[LOCAL_SIZE];
    sdata[tid] = input[stride] + input[stride + 1];

    barrier(CLK_LOCAL_MEM_FENCE);

    for(unsigned int s = LOCAL_SIZE >> 2; s > 0; s >>= 1){
        if(tid < s){
            sdata[tid] += sdata[tid + s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    if(tid == 0) output[get_group_id(0)] = sdata[0];
}

0人赞添加讨论(0) 举报

Optimal workgroup size for sum reduction in OpenCL

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间