CUDA __syncthreads() usage within a warp

If it was absolutely required for all the threads in a block to be at the same point in the code, do we require the __syncthreads function if the number of threads being launched is equal to the number of threads in a warp?

Note: No extra threads or blocks, just a single warp for the kernel.

Example code:

shared _voltatile_ sdata[16];

int index = some_number_between_0_and_15;
sdata[tid] = some_number;
output[tid] = x ^ y ^ z ^ sdata[index];

标签： parallel-processing cuda synchronization

2条回答

聊天终结者

2楼-- · 2020-07-16 13:48

You still need __syncthreads() even if warps are being executed in parallel. The actual execution in hardware may not be parallel because the number of cores within a SM (Stream Multiprocessor) can be less than 32. For example, GT200 architecture has 8 cores in each SM, so you can never be sure all threads are in the same point in the code.

0人赞添加讨论(0) 举报

SAY GOODBYE

3楼-- · 2020-07-16 13:52

Updated with more information about using volatile

Presumably you want all threads to be at the same point since they are reading data written by other threads into shared memory, if you are launching a single warp (in each block) then you know that all threads are executing together. On the face of it this means you can omit the __syncthreads(), a practice known as "warp-synchronous programming". However, there are a few things to look out for.

Remember that a compiler will assume that it can optimise providing the intra-thread semantics remain correct, including delaying stores to memory where the data can be kept in registers. __syncthreads() acts as a barrier to this and therefore ensures that the data is written to shared memory before other threads read the data. Using volatile causes the compiler to perform the memory write rather than keep in registers, however this has some risks and is more of a hack (meaning I don't know how this will be affected in the future)
- Technically, you should always use __syncthreads() to conform with the CUDA Programming Model
The warp size is and always has been 32, but you can:
- At compile time use the special variable warpSize in device code (documented in the CUDA Programming Guide, under "built-in variables", section B.4 in the 4.1 version)
- At run time use the warpSize field of the cudaDeviceProp struct (documented in the CUDA Reference Manual)

Note that some of the SDK samples (notably reduction and scan) use this warp-synchronous technique.

0人赞添加讨论(0) 举报

CUDA __syncthreads() usage within a warp

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间