可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is:

Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices)

I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by synchthreads.

Any other idea come into your mind?

回答1:

This is the usual way to perform reductions in CUDA

Within each block,

1) Keep a running reduced value in shared memory for each thread. Hence each thread will read n (I personally favor between 16 and 32), values from global memory and updates the reduced value from these

2) Perform the reduction algorithm within the block to get one final reduced value per block.

This way you will not need more shared memory than (number of threads) * sizeof (datatye) bytes.

Since each block a reduced value, you will need to perform a second reduction pass to get the final value.

For example, if you are launching 256 threads per block, and are reading 16 values per thread, you will be able to reduce (256 * 16 = 4096) elements per block.

So given 1 million elements, you will need to launch around 250 blocks in the first pass, and just one block in the second.

You will probably need a third pass for cases when the number of elements > (4096)^2 for this configuration.

You will have to take care that the global memory reads are coalesced. You can not coalesce global memory writes, but that is one performance hit you need to take.

回答2:

You may also want to use the reduction routines that comes w/ CUDA Thrust which is a part of CUDA 4.0 or available here.

The library is written by a pair of nVidia engineers and compares favorably with heavily hand optimized code. I believe there is also some auto-tuning of grid/block size going on.

You can interface with your own kernel easily by wrapping your raw device pointers.

This is strictly from a rapid integration point of view. For the theory, see tkerwin's answer.

回答3:

NVIDIA has a CUDA demo that does reduction: here. There's a whitepaper that goes along with it that explains some motivations behind the design.

回答4:

I found this document very useful for learning the basics of parallel reduction with CUDA. It's kind of old, so there must be additional tricks to boost performance further.

回答5:

Actually, the problem you described is not really about matrices. The two-dimensional view of the input data is not significant (assuming the matrix data is layed out contiguously in memory). It's just a reduction over a sequence of values, being all matrix elements in whatever order they appear in memory.

Assuming the matrix representation is contiguous in memory, you just want to perform a simple reduction. And the best available implementation these days - as far as I can tell - is the excellent libcub by nVIDIA's Duane Merill. Here is the documentation on its device-wide Maximum-calculating function.

Note, though, that unless the matrix is small, for most of the computation it will simply be threads reading data and updating their own thread-specific maximum. Only when a thread has finished reading through a large swatch of the matrix (or rather, a large strided swath) will it write its local maximum anywhere - typically into shared memory for a block-level reduction. And as for atomics, you will probably be making an atomicMax() call once every obscenely large number of matrix element reads - tens of thousands if not more.

回答6:

The atomicAdd function could also be used, but it is much less efficient than the approaches mentioned above. http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/

回答7:

If you have K20 or Titan, I suggest dynamic parallelism: lunching a single thread kernel, which lunches #items worker kernel threads to produce data, then lunches #items/first-round-reduction-factor threads for first round reduction, and keep lunching till result coming out.