shared memory optimization confusion

I have written an application in cuda , which uses 1kb of shared memory in each block. Since there is only 16kb of shared memory in each SM, so only 16 blocks can be accommodated overall ( am i understanding it correctly ?), though at a time only 8 can be scheduled, but now if some block is busy in doing memory operation, so other block will be scheduled on gpu, but all the shared memory is used by other 16 blocks which already been scheduled there, so will cuda will not scheduled more blocks on the same sm , unless previous allocated blocks are completely finished ? or it will move some block's shared memory to global memory, and allocated other block there (in this case should we worry about global memory access latency ?)

标签： cuda memory-optimization

1条回答

等我变得足够好

2楼-- · 2019-05-05 07:42

It does not work like that. The number of blocks which will be scheduled to run at any given moment on a single SM will always be the minimum of the following:

8 blocks
The number of blocks whose sum of static and dynamically allocated shared memory is less than 16kb or 48kb, depending on GPU architecture and settings. There is also shared memory page size limitations which mean per block allocations get rounded up to the next largest multiple of the page size
The number of blocks whose sum of per block register usage is less than 8192/16384/32678 depending on architecture. There is also register file page sizes which mean that per block allocations get rounded up to the next largest multiple of the page size.

That is all there is to it. There is no "paging" of shared memory to accomodate more blocks. NVIDIA produce a spreadsheet for computing occupancy which ships with the toolkit and is available as a separate download. You can see the exact rules in the formulas it contains. They are also discussed in section 4.2 of the CUDA programming guide.

0人赞添加讨论(0) 举报

shared memory optimization confusion

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间