OpenCL/CUDA: Two-stage reduction Algorithm

Reduction of large arrays can be done by calling __reduce(); multiple times.

The following code however uses only two stages and is documented here:

However I am unable to understand the algorithm for this two stage reduction. can some give a simpler explanation?

__kernel
void reduce(__global float* buffer,
        __local float* scratch,
        __const int length,
        __global float* result) {

    int global_index = get_global_id(0);
    float accumulator = INFINITY;
    // Loop sequentially over chunks of input vector
    while (global_index < length) {
        float element = buffer[global_index];
        accumulator = (accumulator < element) ? accumulator : element;
        global_index += get_global_size(0);
    }

    // Perform parallel reduction
    int local_index = get_local_id(0);
    scratch[local_index] = accumulator;
    barrier(CLK_LOCAL_MEM_FENCE);
    for(int offset = get_local_size(0) / 2; offset > 0; offset = offset / 2) {
        if (local_index < offset) {
            float other = scratch[local_index + offset];
            float mine = scratch[local_index];
            scratch[local_index] = (mine < other) ? mine : other;
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    if (local_index == 0) {
        result[get_group_id(0)] = scratch[0];
    }
}

It can also be well implemented using CUDA.

You create N threads. The first thread looks at values at positions 0, N, 2*N, ... The second thread looks at values 1, N+1, 2*N+1, ... That's the first loop. It reduces length values into N values.

Then each thread saves its smallest value in shared/local memory. Then you have a synchronization instruction (barrier(CLK_LOCAL_MEM_FENCE).) Then you have standard reduction in shared/local memory. When you're done the thread with local id 0 saves its result in the output array.

All in all, you have a reduction from length to N/get_local_size(0) values. You'd need to do one last pass after this code is done executing. However, this gets most of the job done, for example, you might have length ~ 10^8, N = 2^16, get_local_size(0) = 256 = 2^8, and this code reduces 10^8 elements into 256 elements.

Which parts do you not understand?