CUDA: reduction or atomic operations?

2019-02-19 03:16发布

I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is:

Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices)

I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by synchthreads.

Any other idea come into your mind?

7条回答
该账号已被封号
2楼-- · 2019-02-19 03:47

NVIDIA has a CUDA demo that does reduction: here. There's a whitepaper that goes along with it that explains some motivations behind the design.

查看更多
▲ chillily
3楼-- · 2019-02-19 03:47

Actually, the problem you described is not really about matrices. The two-dimensional view of the input data is not significant (assuming the matrix data is layed out contiguously in memory). It's just a reduction over a sequence of values, being all matrix elements in whatever order they appear in memory.

Assuming the matrix representation is contiguous in memory, you just want to perform a simple reduction. And the best available implementation these days - as far as I can tell - is the excellent libcub by nVIDIA's Duane Merill. Here is the documentation on its device-wide Maximum-calculating function.

Note, though, that unless the matrix is small, for most of the computation it will simply be threads reading data and updating their own thread-specific maximum. Only when a thread has finished reading through a large swatch of the matrix (or rather, a large strided swath) will it write its local maximum anywhere - typically into shared memory for a block-level reduction. And as for atomics, you will probably be making an atomicMax() call once every obscenely large number of matrix element reads - tens of thousands if not more.

查看更多
贼婆χ
4楼-- · 2019-02-19 03:54

This is the usual way to perform reductions in CUDA

Within each block,

1) Keep a running reduced value in shared memory for each thread. Hence each thread will read n (I personally favor between 16 and 32), values from global memory and updates the reduced value from these

2) Perform the reduction algorithm within the block to get one final reduced value per block.

This way you will not need more shared memory than (number of threads) * sizeof (datatye) bytes.

Since each block a reduced value, you will need to perform a second reduction pass to get the final value.

For example, if you are launching 256 threads per block, and are reading 16 values per thread, you will be able to reduce (256 * 16 = 4096) elements per block.

So given 1 million elements, you will need to launch around 250 blocks in the first pass, and just one block in the second.

You will probably need a third pass for cases when the number of elements > (4096)^2 for this configuration.

You will have to take care that the global memory reads are coalesced. You can not coalesce global memory writes, but that is one performance hit you need to take.

查看更多
Root(大扎)
5楼-- · 2019-02-19 04:03

I found this document very useful for learning the basics of parallel reduction with CUDA. It's kind of old, so there must be additional tricks to boost performance further.

查看更多
forever°为你锁心
6楼-- · 2019-02-19 04:04

If you have K20 or Titan, I suggest dynamic parallelism: lunching a single thread kernel, which lunches #items worker kernel threads to produce data, then lunches #items/first-round-reduction-factor threads for first round reduction, and keep lunching till result coming out.

查看更多
贼婆χ
7楼-- · 2019-02-19 04:06

The atomicAdd function could also be used, but it is much less efficient than the approaches mentioned above. http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/

查看更多
登录 后发表回答