I have been using CUDA for a month, now i'm trying to make it clear that how many warps/blocks are needed to hide the latency of memory accesses. I think it is related to the maximum of resident warps on a multiprocessor.
According to Table.13 in CUDA_C_Programming_Guide (v-7.5),the maximum of resident warps per multiprocessor is 64. Then, my question is : what is the resident warp? is it refer to those warps with the data read from memory of GPUs and are ready to be processed by SPs? Or refer to either the warps that can read momory for datar or warps that are ready to be processed by SPs,which means that the rest warps except those 64 can neither read memory nor be processed by SPs untill some of those 64 resident warps are done.
The maximum amount of resident warp is the maximum number of warps that can be processed in parallel on the multiprocessor. A warp is active when it is scheduled by warp scheduler and registers have been allocated.
If you achieve to have this amount of warps running in parallel, this the theoretical maximum occupancy (100%, or 1:1). If not, the occupancy ratio is lower.
Other warps will have to wait.
Might be related to this question on SO.
Edited answer for further questions :
About the maximum amount of warps that can be processed : the SM (streaming multi-processors) have a maximum of processing cores, and the GPU has a limited amount of SMs. Even if this webinar is not up-to-date with new architectures, it gives some good examples :
And :
First, for some terms they are not always clearly official, see for example this topic from Nvidia DevTalk.
As explained on this topic, a given warp is active once it has been allocated on the SM with its resources. Then it can be :
This is possible because we have a SIMT architecture there, meaning Single Instruction Multiple Threads. You will find lots of readings on this topic that can be very useful if you plan on tweaking occupancy.