How does instruction level parallelism and thread

Let's say I'm trying to do a simple reduction over an array size n, say kept within one work unit... say adding all the elements. The general strategy seems to be to spawn a number of work items on each GPU, which reduce items in a tree. Naively this would seem to take log n steps, but it's not as if the first wave of threads all do these threads go in one shot, is it? They get scheduled in warps.

for(int offset = get_local_size(0) / 2;
      offset > 0;
      offset >>= 1) {
     if (local_index < offset) {
       float other = scratch[local_index + offset];
       float mine = scratch[local_index];
       scratch[local_index] = (mine < other) ? mine : other;
     }
     barrier(CLK_LOCAL_MEM_FENCE);
   }

So 32 items get added in parallel, and then that thread waits at the barrier. Another 32 go and we wait at the barrier. Another 32 go and we wait at the barrier until all the threads have done the n/2 additions necessary to go at the topmost level of the tree, and we go around the loop. Cool.

This seems good, but perhaps complicated? I understand instruction level parallelism is a big deal, so why not spawn ONE thread and do something like

while(i<array size){
    scratch[0] += scratch[i+16]
    scratch[1] += scratch[i+17]
    scratch[2] += scratch[i+17]
    ...
    i+=16
}
...
int accum = 0;
accum += scratch[0]
accum += scratch[1]
accum += scratch[2]
accum += scratch[3]
...

such that all the adds happen within a warp. Now you have ONE thread going keeping the gpu as busy as you like.

Now assume instruction level parallelism isn't really a thing. What about the following, with the work size set to 32 (number of warps).

for(int i = get_local_id(0);i += 32;i++){
    scratch[get_local_id(0)] += scratch[i+get_local_id(0)]
}

and then add the first 32 items together. I imagine that those 32 threads would keep firing again and again.

If you're not adverse to giving up the generality of OpenCL, why bother reducing in a tree when you KNOW how many adds will fire per cycle?

One thread cannot keep the GPU busy. That's roughly the same as saying one thread can keep an 8-core CPU busy.

In order to get maximum utilization of the compute resources as well as the available memory bandwidth, it's necessary to utilize the entire machine (i.e. all available resources that can execute threads).

With most newer GPUs, you can certainly get improved performance through instruction level parallelism, by having your thread code have multiple independent instructions in sequence. But you can't throw all that into a single thread and expect it to give good performance.

When you have 2 instructions in sequence, like this:

scratch[0] += scratch[i+16]
scratch[1] += scratch[i+17]

That is good for ILP because those two operations are completely independent of each other. But, because of the way GPUs issue memory transactions, the first line of code will take part in a particular memory transaction, and the second line of code will necessarily take part in a different memory transaction.

When we have a warp working together, a line of code like this:

float other = scratch[local_index + offset];

will cause all members of the warp to generate a request, but those requests will all be combined into a single or perhaps two memory transactions. That is how you can achieve full bandwidth utilization.

Although most modern GPUs have caches, and the caches will tend to bridge the gap somewhat between these two approaches, they will by no means make up for the large disparity in transactions between having all warp members issue a combined request, vs. a single warp member issue a set of requests in sequence.

You may want to read up on GPU memory coalescing. Since your question appears to be OpenCL-centric, you may be interested in this document.