CUDA: Avoiding serial execution on branch divergen

2019-04-02 04:52发布

问题:

Assume a CUDA kernel executed by a single warp (for simplicity) reaches an if-else statement, where 20 of the threads within the warp satisfy condition and 32 - 20 = 12 threads do not:

if (condition){
    statement1;     // executed by 20 threads
else{
    statement2;     // executed by 12 threads
}

According to the CUDA C Programming Guide:

A warp executes one common instruction at a time [...] if threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path.

And therefore the two statements would be executed sequentially in separate cycles.

The Kepler architecture contains 2 instruction dispatch units per warp scheduler, and therefore has the ability to issue 2 independent instructions per warp to be at each cycle.

My question is: in this setting with only two branches, why could statement1 and statement2 not be issued by the two instruction dispatch units for simultaneous execution by the 32 threads within the warp, i.e. 20 threads execute statement1 while the 12 others simultaneously execute statement2? If the instruction scheduler is not the reason why a warp executes a single common instruction at a time, what is? Is it the instruction set that only provides 32-thread wide instructions? Or a hardware-related reason?

回答1:

Each and every kernel instruction is always executed for all of the threads within a warp. Therefore it is logically not possible to carry out different instructions on different threads within the same warp at the same time. This would be against the SIMT execution model upon which GPUs are built. To your question:

The Kepler architecture contains 2 instruction dispatch units per warp scheduler, and therefore has the ability to issue 2 independent instructions per warp to be at each cycle.

...

why could statement1 and statement2 not be issued by the two instruction dispatch units for simultaneous execution by the 32 threads within the warp, i.e. 20 threads execute statement1 while the 12 others simultaneously execute statement2?

I am not sure whether you realize this, but if statement1 and statement2 are computationally independent, then they can be executed in one cycle:

  1. Instruction from statement1 will be carried out on all threads,
  2. Instruction from statement2 will be carried out on all threads within the same cycle as it was dispatched as well thanks to the second dispatch unit.

Thats how the branch divergence works in GPUs in general, some further reading can be found e.g. here. As a result I believe you are already getting what you ask for for free - both statements are executed within the same cycle (or can be).

EDIT:

As talonmies stated in the comment, it may be worth mentioning conditional execution as it sometimes help to prevent penalty from branch divergence. More on this topic can be found e.g. in this SO thread, quoting:

For simpler conditionals, NVIDIA GPUs support conditional evaluation at the ALU, which causes no divergence, and for conditionals where the whole warp follows the same path, there is also obviously no penalty.



标签: c++ cuda simd