Let's say I'm trying to do a simple reduction over an array size n, say kept within one work unit... say adding all the elements. The general strategy seems to be to spawn a number of work items on each GPU, which reduce items in a tree. Naively this would seem to take log n steps, but it's not as if the first wave of threads all do these threads go in one shot, is it? They get scheduled in warps.
for(int offset = get_local_size(0) / 2;
offset > 0;
offset >>= 1) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
So 32 items get added in parallel, and then that thread waits at the barrier. Another 32 go and we wait at the barrier. Another 32 go and we wait at the barrier until all the threads have done the n/2 additions necessary to go at the topmost level of the tree, and we go around the loop. Cool.
This seems good, but perhaps complicated? I understand instruction level parallelism is a big deal, so why not spawn ONE thread and do something like
while(i<array size){
scratch[0] += scratch[i+16]
scratch[1] += scratch[i+17]
scratch[2] += scratch[i+17]
...
i+=16
}
...
int accum = 0;
accum += scratch[0]
accum += scratch[1]
accum += scratch[2]
accum += scratch[3]
...
such that all the adds happen within a warp. Now you have ONE thread going keeping the gpu as busy as you like.
Now assume instruction level parallelism isn't really a thing. What about the following, with the work size set to 32 (number of warps).
for(int i = get_local_id(0);i += 32;i++){
scratch[get_local_id(0)] += scratch[i+get_local_id(0)]
}
and then add the first 32 items together. I imagine that those 32 threads would keep firing again and again.
If you're not adverse to giving up the generality of OpenCL, why bother reducing in a tree when you KNOW how many adds will fire per cycle?