When should I use CUDA's built-in warpSize, as

2019-09-20 16:44发布

问题:

nvcc device code has access to a built-in value, warpSize, which is set to the warp size of the device executing the kernel (i.e. 32 for the foreseeable future). Usually you can't tell it apart from a constant - but if you try to declare an array of length warpSize you get a complaint about it being non-const... (with CUDA 7.5)

So, at least for that purpose you are motivated to have something like (edit):

enum : unsigned int { warp_size  = 32 };

somewhere in your headers. But now - which should I prefer, and when? : warpSize, or warp_size?

Edit: warpSize is apparently a compile-time constant in PTX. Still, the question stands.

回答1:

Contrary to talonmies's answer I find warp_size constant perfectly acceptable. The only reason to use warpSize is to make the code forward-compatibly with a possible future hardware that may have warps of different size. However, when such hardware arrives, the kernel code will most likely require other alterations as well in order to remain efficient. CUDA is not a hardware-agnostic language - on the contrary, it is still quite a low-level programming language. Production code uses various intrinsic functions that come and go over time (e.g. __umul24).

The day we get a different warp size (e.g. 64) many things will change:

  • The warpSize will have to be adjusted obviously
  • Many warp-level intrinsic will need their signature adjusted, or a new version produced, e.g. int __ballot, and while int does not need to be 32-bit, it is most commonly so!
  • Iterative operations, such as warp-level reductions, will need their number of iterations adjusted. I have never seen anyone writing:

    for (int i = 0; i < log2(warpSize); ++i) ...
    

    that would be overly complex in something that is usually a time-critical piece of code.

  • warpIdx and laneIdx computation out of threadIdx would need to be adjusted. Currently, the most typical code I see for it is:

    warpIdx = threadIdx.x/32;
    laneIdx = threadIdx.x%32;
    

    which reduces to simple right-shift and mask operations. However, if you replace 32 with warpSize this suddenly becomes a quite expensive operation!

At the same time, using warpSize in the code prevents optimization, since formally it is not a compile-time known constant. Also, if the amount of shared memory depends on the warpSize this forces you to use the dynamically allocated shmem (as per talonmies's answer). However, the syntax for that is inconvenient to use, especially when you have several arrays -- this forces you to do pointer arithmetic yourself and manually compute the sum of all memory usage.

Using templates for that warp_size is a partial solution, but adds a layer of syntactic complexity needed at every function call:

deviceFunction<warp_size>(params)

This obfuscates the code. The more boilerplate, the harder the code is to read and maintain.


My suggestion would be to have a single header that control all the model-specific constants, e.g.

#if __CUDA_ARCH__ <= 600
//all devices of compute capability <= 6.0
static const int warp_size = 32; 
#endif

Now the rest of your CUDA code can use it without any syntactic overhead. The day you decide to add support for newer architecture, you just need to alter this one piece of code.



回答2:

Let's get a couple of points straight. The warp size isn't a compile time constant and shouldn't be treated as one. It is an architecture specific runtime immediate constant (and its value just happens to be 32 for all architectures to date). Once upon a time, the old Open64 compiler did emit a constant into PTX, however that changed at least 6 years ago if my memory doesn't fail me.

The value is available:

  1. In CUDA C via warpSize, where is is not a compile time constant (the PTX WARP_SZ variable is emitted by the compiler in such cases).
  2. In PTX assembler via WARP_SZ, where it is a runtime immediate constant
  3. From the runtime API as a device property

Don't declare you own constant for the warp size, that is just asking for trouble. The normal use case for an in-kernel array dimensioned to be some multiple of the warp size would be to use dynamically allocated shared memory. You can read the warp size from the host API at runtime to get it. If you have a statically declared in-kernel you need to dimension from the warp size, use templates and select the correct instance at runtime. The latter might seem like unnecessary theatre, but it is the right thing to do for a use case that almost never arises in practice. The choice is yours.