For simple intra-warp thread divergence in CUDA, what I know is that SM selects a re-convergence point (PC address), and executes instructions in both/multiple paths while disabling effects of execution for the threads that haven't taken the path.
For example, in below piece of code:
if( threadIdx.x < 16 ) {
A:
// do something.
} else {
B:
// do something else.
}
C:
// rest of code.
C
is the re-convergence point, warp scheduler schedules instructions at both A
and B
, while disabling instructions at A
for upper half-warp and disabling instructions at B
for lower half-warp. When it reaches C
, instructions will be enabled for all the threads inside the warp.
My question is will SM be able to handle the code including the goto
instruction properly like above? Or there's no guarantee that chosen re-convergence point is the optimum?
For instance, if I have below control flow in my CUDA code implemented using goto
A:
// some code here.
B:
// some code here too.
if( threadIdx.x < 16 ) {
C:
// do something.
goto A;
}
// do something else.
goto B;
will SM be smart enough to decide B
as the re-convergence point for intra-warp divergence caused by if
instruction?