CUDA: reduction or atomic operations?-第2页回答

CUDA: reduction or atomic operations?

2019-02-19 03:16发布

I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is:

Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices)

I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by synchthreads.

Any other idea come into your mind?

标签： algorithm matrix cuda reduction gpu-atomics

7条回答

干净又极端

2楼-- · 2019-02-19 04:10

You may also want to use the reduction routines that comes w/ CUDA Thrust which is a part of CUDA 4.0 or available here.

The library is written by a pair of nVidia engineers and compares favorably with heavily hand optimized code. I believe there is also some auto-tuning of grid/block size going on.

You can interface with your own kernel easily by wrapping your raw device pointers.

This is strictly from a rapid integration point of view. For the theory, see tkerwin's answer.

0人赞添加讨论(0) 举报

上一页 1 2

CUDA: reduction or atomic operations?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间