How to evaluate CUDA performance?

2020-03-03 06:08发布

I programmed CUDA kernel my own. Compare to CPU code, my kernel code is 10 times faster than CPUs.

But I have question with my experiments.

Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?

How can I evaluate my kernel code's performance?

How can I calcuate CUDA's maximum throughput theoretically?

Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?

Thanks in advance.

2条回答
【Aperson】
2楼-- · 2020-03-03 06:50

The preferred measure of performance is elapsed time. GFLOPs can be used as a comparison method but it is often difficult to compare between compilers and architectures due to differences in instruction set, compiler code generation, and method of counting FLOPs.

The best method is to time the performance of the application. For the CUDA code you should time all code that will occur per launch. This includes memory copies and synchronization.

Nsight Visual Studio Edition and the Visual Profiler provide the most accurate measurement of each operation. Nsight Visual Studio Edition provides theoretical bandwidth and FLOPs values for each device. In addition the Achieved FLOPs experiment can be used to capture the FLOP count both for single and double precision.

查看更多
家丑人穷心不美
3楼-- · 2020-03-03 07:01

Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?

To find this out, you use one of the CUDA profilers. See How Do You Profile & Optimize CUDA Kernels?

How can I calcuate CUDA's maximum throughput theoretically?

That math is slightly involved, different for each architecture and easy to get wrong. Better to look the numbers up in the specs for your chip. There are tables on Wikipedia, such as this one, for the GTX500 cards. For instance, you can see from the table that a GTX580 has a theoretical peak bandwidth of 192.4GB/s and compute throughput of 1581.1GFLOPs.

Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?

If I understand correctly, you are asking if the number of theoretical peak GFLOPs on a GPU can be directly compared with the corresponding number on a CPU. There are some things to consider when comparing these numbers:

  • Older GPUs did not support double precision (DP) floating point, only single precision (SP).

  • GPUs that do support DP do so with a significant performance degradation as compared to SP. The GFLOPs number I quoted above was for SP. On the other hand, numbers quoted for CPUs are often for DP, and there is less difference between the performance of SP and DP on a CPU.

  • CPU quotes can be for rates that are achievable only when using SIMD (single instruction, multiple data) vectorized instructions, and is typically very hard to write algorithms that can approach the theoretical maximum (and they may have to be written in assembly). Sometimes, CPU quotes are for a combination of all computing resources available through different types of instructions and it often virtually impossible to write a program that can exploit them all simultaneously.

  • The rates quoted for GPUs assume that you have enough parallel work to saturate the GPU and that your algorithm is not bandwidth bound.

查看更多
登录 后发表回答