How to determine if my GPU does 16/32/64 bit arith

2019-08-02 01:17发布

问题:

I am trying find the throughput of native arithmetic operations on my Nvidia card. On this page, Nvidia have documented the throughput values for various arithmetic operations. The problem is how do I determine if my card does 16 or 32 or 64 bit operations, since the values are different for each? Further, I also want to calculate the latency values of these instructions for my card. Is there some way to do it? As far as my research goes, they are not documented like throughput. Is there some benchmark suite for this purpose?

Thanks!

回答1:

how do I determine if my card does 16 or 32 or 64 bit operations, since the values are different for each?

On the page you linked, is listed compute capabilities across the top of the table (for each column). Your GPU has a compute capability. You can use the deviceQuery cuda sample app to figure out what it is, or look it up here.

For example, suppose I had a GTX 1060 GPU. If you run deviceQuery on it, will report a compute capability major version of 6 and a minor version of 1, so it is a compute capability 6.1 GPU. You can also see that here.

Now, going back to the table you linked, that means the column labelled 6.1 is the one of interest. It looks like this:

                                            Compute Capability
                                                    6.1 
16-bit floating-point add, multiply, multiply-add   2     ops/SM/clock
32-bit floating-point add, multiply, multiply-add   128   ops/SM/clock
64-bit floating-point add, multiply, multiply-add   4     ops/SM/clock
...

This means a GTX 1060 is capable of all 3 types of operations (floating point multiply, or multiply-add, or add) at 3 different precisions (16-bit, 32-bit, 64-bit) at differing rates or throughputs for each precision. With respect to the table, these numbers are per clock and per SM.

In order to determine the aggregate peak theoretical throughput for the entire GPU, We must multiply the above numbers by the clock rate of the GPU and by the number of SMs (streaming multiprocessors) in the GPU. The CUDA deviceQuery app can also tell you this information, or you can look it up on line.

Further, I also want to calculate the latency values of these instructions for my card. Is there some way to do it? As far as my research goes, they are not documented like throughput.

As I already mentioned on your previous question, these latency values are not published or specified, and in fact they may (and do) change from GPU to GPU, from one instruction type to another (e.g. floating point multiply and floating point add may have different latencies), and may even change from CUDA version to CUDA version, for certain operation types which are emulated via a sequence of multiple SASS instructions.

In order to discover this latency data, then, it's necessary to do some form of micro-benchmarking. An early and oft-cited paper demonstrating how this may be done for CUDA GPUs is here. There is not one single canonical reference for latency micro-benchmark data for GPUs, nor is there a single canonical reference for the benchmark programs to do it. It is a fairly difficult undertaking.

Is there some benchmark suite for this purpose?

This sort of question is explicitly off-topic for SO. Please read here where it states:

"Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow..."