Reduction of large arrays can be done by calling __reduce(); multiple times.
The following code however uses only two stages and is documented here:
However I am unable to understand the algorithm for this two stage reduction. can some give a simpler explanation?
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
int local_index = get_local_id(0);
scratch[local_index] = accumulator;
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2; offset > 0; offset = offset / 2) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}
It can also be well implemented using CUDA.
You create
N
threads. The first thread looks at values at positions 0, N, 2*N, ... The second thread looks at values 1, N+1, 2*N+1, ... That's the first loop. It reduceslength
values into N values.Then each thread saves its smallest value in shared/local memory. Then you have a synchronization instruction (
barrier(CLK_LOCAL_MEM_FENCE)
.) Then you have standard reduction in shared/local memory. When you're done the thread with local id 0 saves its result in the output array.All in all, you have a reduction from
length
toN/get_local_size(0)
values. You'd need to do one last pass after this code is done executing. However, this gets most of the job done, for example, you might have length ~ 10^8, N = 2^16, get_local_size(0) = 256 = 2^8, and this code reduces 10^8 elements into 256 elements.Which parts do you not understand?