The impact of goto instruction at intra-warp diver

2019-05-18 20:28发布

问题:

For simple intra-warp thread divergence in CUDA, what I know is that SM selects a re-convergence point (PC address), and executes instructions in both/multiple paths while disabling effects of execution for the threads that haven't taken the path.
For example, in below piece of code:

if( threadIdx.x < 16 ) {
    A:
    // do something.
} else {
    B:
    // do something else.
}
C:
// rest of code.

C is the re-convergence point, warp scheduler schedules instructions at both A and B, while disabling instructions at A for upper half-warp and disabling instructions at B for lower half-warp. When it reaches C, instructions will be enabled for all the threads inside the warp.

My question is will SM be able to handle the code including the goto instruction properly like above? Or there's no guarantee that chosen re-convergence point is the optimum?
For instance, if I have below control flow in my CUDA code implemented using goto

A:
// some code here.
B:
// some code here too.
if( threadIdx.x < 16 ) {
    C:
    // do something.
    goto A;
}
// do something else.
goto B;

will SM be smart enough to decide B as the re-convergence point for intra-warp divergence caused by if instruction?

回答1:

In general, goto is unstructured control flow that interferes with many compiler optimizations, regardless of platform. The CUDA C compiler should handle code with goto in a functionally correct way, but performance may be suboptimal.

Part of that suboptimal performance may be the compiler's placement of convergence points. You can examine the convergence points in the generated machine code (SASS) with cuobjdump --dump-sass. An SSY instruction records a convergence points, and a .S suffix on an instruction indicates that control is transferred to the last recorded convergence point.