What's the mechanism of the warps and the bank

2019-09-16 02:58发布

问题:

I'm a rookie in learning CUDA parallel programming. Now I'm confused in the global memory access of device. It's about the warp model and coalescence.

There are some points:

  1. It's said that threads in one block are split into warps. In each warp there are at most 32 threads. That means all these threads of the same warp will execute simultaneously with the same processor. So what's the senses of half-warp?

  2. When it comes to the shared memory of one block, it would be split into 16 banks. To avoid bank conflicts, multiple threads can READ one bank at the same time rather than write in the same bank. Is this a correct interpretation?

Thanks in advance!

回答1:

  1. The principal usage of "half-warp" was applied to CUDA processors prior to the Fermi generation (e.g. the "Tesla" or GT200 generation, and the original G80/G92 generation). These GPUs were architected with a SM (streaming multiprocessor -- a HW block inside the GPU) that had fewer than 32 thread processors. The definition of warp was still the same, but the actual HW execution took place in "half-warps" at a time. Actually the granular details are more complicated than this, but suffice it to say that the execution model caused memory requests to be issued according to the needs of a half-warp, i.e. 16 threads within the warp. A full warp that hit a memory transaction would thus generate a total of 2 requests for that transaction.

    Fermi and newer GPUs have at least 32 thread processors per SM. Therefore a memory transaction is immediately visible across a full warp. As a result, memory requests are issued at the per-warp level, rather than per-half-warp. However, a full memory request can only retrieve 128 bytes at a time. Therefore, for data sizes larger than 32 bits per thread per transaction, the memory controller may still break the request down into a half-warp size.

    My view is that, especially for a beginner, it's not necessary to have a detailed understanding of half-warp. It's generally sufficient to understand that it refers to a group of 16 threads executing together and it has implications for memory requests.

  2. Shared memory for example on the Fermi-class GPUs is broken into 32 banks. On previous GPUs it was broken into 16 banks. Bank conflicts occur any time an individual bank is accessed by more than one thread in the same memory request (i.e. originating from the same code instruction). To avoid bank conflicts, basic strategies are very similar to the strategies for coalescing memory requests, eg. for global memory. On Fermi and newer GPUs, multiple threads can read the same address without causing a bank conflict, but in general the definition of a bank conflict is when multiple threads read from the same bank. For further understanding of shared memory and how to avoid bank conflicts, I would recommend the NVIDIA webinar on this topic.