How does a GPU group threads into warps/wavefronts

2019-07-20 19:46发布

问题:

My understanding is that warp is a group of threads that defined at runtime through the task scheduler, one performance critical part of CUDA is the divergence of threads within a warp, is there a way to make a good guess of how the hardware will construct warps within a thread block?

For instance I have start a kernel with 1024 threads in a thread block, how is the warps be arranged, can I tell that (or at least make a good guess) from the thread index?

Since by doing this, one can minimize the divergence of threads within a given warp.

回答1:

The thread arrangement inside the warp is implementation dependant but atm I have experienced always the same behavior:

A warp is composed by 32 threads but the warp scheduller will issue 1 instruction for halp a warp each time (16 threads)

  • If you use 1D blocks (only threadIdx.x dimension is valid) then the warp scheduller will issue 1 instruction for threadIdx.x = (0..15) (16..31) ... etc

  • If you use 2D blocks (threadIdx.x and threadIdx.y dimension are valid) then the warp scheduller will try to issue following this fashion:

threadIdx.y = 0 threadIdx.x = (0 ..15) (16..31) ... etc

so, the threads with consecutive threadIdx.x component will execute the same instruction in groups of 16.



回答2:

A warp consists of 32 threads that will be executed at the same time. At any given time a batch of 32 will be executing on the GPU, and this is called a warp.

I haven't found anywhere that states that you can control what warp is going to execute next, the only thing you know is that it consists of 32 threads and that a threadblock should always be a multiple of that number.

Threads in a single block will be executed on a single multiprocessor, sharing the software data cache, and can synchronize and share data with threads in the same block; a warp will always be a subset of threads from a single block.

There is also this, with regards to memory operations and latency:

When the threads in a warp issue a device memory operation, that instruction will take a very long time, perhaps hundreds of clock cycles, due to the long memory latency. Mainstream architectures would add a cache memory hierarchy to reduce the latency, and Fermi does include some hardware caches, but mostly GPUs are designed for stream or throughput computing, where cache memories are ineffective. Instead, these GPUs tolerate memory latency by using a high degree of multithreading. A Tesla supports up to 32 active warps on each multiprocessor, and a Fermi supports up to 48. When one warp stalls on a memory operation, the multiprocessor selects another ready warp and switches to that one. In this way, the cores can be productive as long as there is enough parallelism to keep them busy.

source

With regards to dividing up threadblocks into warps, I have found this:

if the block is 2D or 3D, the threads are ordered by first dimension, then second, then third – then split into warps of 32

source