My GPU has 2 multiprocessors with 48 CUDA cores each. Does this mean that I can execute 96 thread blocks in parallel?
相关问题
- Achieving the equivalent of a variable-length (loc
- The behavior of __CUDA_ARCH__ macro
- Setting Nsight to run with existing Makefile proje
- Usage of anonymous functions in arrayfun with GPU
- AMD CPU versus Intel CPU openCL
相关文章
- How to downgrade to cuda 10.0 in arch linux?
- What's the relation between nvidia driver, cud
- How can I use 100% of VRAM on a secondary GPU from
- NVidia CUDA toolkit 7.5.27 failing to install on O
- How can I find row to all rows distance matrix bet
- thrust: fill isolate space
- How to get the real and imaginary parts of a compl
- Matrix Transpose (with shared Memory) with arbitar
The number of concurrently executed threads depends on your code and type of your CUDA device. For example Fermi has 2 thread schedulers for each stream multiprocessor and for current CPU clock will be scheduled 2 half-warps for calculation or memory load or transcendent function calculation. While one half-warp wait load or executed transcendent function CUDA cores may execute anything else. So you can get 96 threads on cores but if your code may get it. And, of course, you must have enough memory.
No it doesn't.
From chapter 4 of the CUDA C programming guide:
Get the guide at: http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
To check the limits for your specific device compile and execute the cudaDeviceQuery example from the SDK.
So far the maximum number of resident blocks per multiprocessor is the same across all compute capabilities and is equal to 8.
This comes down to semantics. What does "execute" and "running in parallel" really mean?
At a basic level, having 96 CUDA cores really means that you have a potential throughput of 96 results of calculations per cycle of the core clock.
A core is mainly an Arithmetic Logic Unit (ALU), it performs basic arithmetic and logical operations. Aside from access to an ALU, a thread needs other resources, such as registers, shared memory and global memory to run. The GPU will keep many threads "in flight" to keep all these resources utilized to the fullest. The number of threads "in flight" will typically be much higher than the number of cores. On one hand, these threads can be seen as being "executed in parallel" because they are all consuming resources on the GPU at the same time. But on the other hand, most of them are actually waiting for something, such as data to arrive from global memory or for results of arithmetic to go through the pipelines in the cores. The GPU puts threads that are waiting for something on the "back burner". They are consuming some resources, but are they actually running? :)