I have written an application in cuda , which uses 1kb of shared memory in each block. Since there is only 16kb of shared memory in each SM, so only 16 blocks can be accommodated overall ( am i understanding it correctly ?), though at a time only 8 can be scheduled, but now if some block is busy in doing memory operation, so other block will be scheduled on gpu, but all the shared memory is used by other 16 blocks which already been scheduled there, so will cuda will not scheduled more blocks on the same sm , unless previous allocated blocks are completely finished ? or it will move some block's shared memory to global memory, and allocated other block there (in this case should we worry about global memory access latency ?)
相关问题
- Achieving the equivalent of a variable-length (loc
- The behavior of __CUDA_ARCH__ macro
- Setting Nsight to run with existing Makefile proje
- Usage of anonymous functions in arrayfun with GPU
- Does CUDA allow multiple applications on same gpu
相关文章
- How to downgrade to cuda 10.0 in arch linux?
- What's the relation between nvidia driver, cud
- How can I use 100% of VRAM on a secondary GPU from
- NVidia CUDA toolkit 7.5.27 failing to install on O
- How can I find row to all rows distance matrix bet
- thrust: fill isolate space
- How to get the real and imaginary parts of a compl
- Matrix Transpose (with shared Memory) with arbitar
It does not work like that. The number of blocks which will be scheduled to run at any given moment on a single SM will always be the minimum of the following:
That is all there is to it. There is no "paging" of shared memory to accomodate more blocks. NVIDIA produce a spreadsheet for computing occupancy which ships with the toolkit and is available as a separate download. You can see the exact rules in the formulas it contains. They are also discussed in section 4.2 of the CUDA programming guide.