I am using the following kernel for sum reduciton.
__kernel void reduce(__global float* input, __global float* output, __local float* sdata)
{
// load shared mem
unsigned int tid = get_local_id(0);
unsigned int bid = get_group_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
unsigned int stride = gid * 2;
sdata[tid] = input[stride] + input[stride + 1];
barrier(CLK_LOCAL_MEM_FENCE);
// do reduction in shared mem
for(unsigned int s = localSize >> 2; s > 0; s >>= 1)
{
if(tid < s)
{
sdata[tid] += sdata[tid + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// write result for this block to global mem
if(tid == 0) output[bid] = sdata[0];
}
It works fine, but I don't know how to choose the optimal workgroup size or number of workgroups if I need more than one workgroup (for example if I want to calculate the sum of 1048576 elements). As far as I understand, the more workgroups I use, the more subresults I will get, which also means that I will need more global reductions at the end.
I've seen the answers to the general workgroup size question here. Are there any recommendations that concern reduction operations specifically?
This question is a possible duplicate of one I answered a while back: What is the algorithm to determine optimal work group size and number of workgroup.
Experimentation will be the best way to know for sure for any given device.
Update: I think you can safely stick to 1-dimensional work groups, as you have done in your sample code. On the host, you can try out the best values.
For each device:
1) query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.
2) loop over a few multiples and run the kernel with that group size. save the execution time for each test.
3) when you think you have an optimal value, hard code it into a new kernel for use with that specific device. This will give a further boost to performance. You can also eliminate your sdata parameter in the device-specific kernel.
The device-specific kernel can look like this, except the first line should have your optimal value substituted: