I am trying find the throughput of native arithmetic operations on my Nvidia card. On this page, Nvidia have documented the throughput values for various arithmetic operations. The problem is how do I determine if my card does 16 or 32 or 64 bit operations, since the values are different for each? Further, I also want to calculate the latency values of these instructions for my card. Is there some way to do it? As far as my research goes, they are not documented like throughput. Is there some benchmark suite for this purpose?
Thanks!
On the page you linked, is listed compute capabilities across the top of the table (for each column). Your GPU has a compute capability. You can use the
deviceQuery
cuda sample app to figure out what it is, or look it up here.For example, suppose I had a GTX 1060 GPU. If you run
deviceQuery
on it, will report a compute capability major version of 6 and a minor version of 1, so it is a compute capability 6.1 GPU. You can also see that here.Now, going back to the table you linked, that means the column labelled 6.1 is the one of interest. It looks like this:
This means a GTX 1060 is capable of all 3 types of operations (floating point multiply, or multiply-add, or add) at 3 different precisions (16-bit, 32-bit, 64-bit) at differing rates or throughputs for each precision. With respect to the table, these numbers are per clock and per SM.
In order to determine the aggregate peak theoretical throughput for the entire GPU, We must multiply the above numbers by the clock rate of the GPU and by the number of SMs (streaming multiprocessors) in the GPU. The CUDA
deviceQuery
app can also tell you this information, or you can look it up on line.As I already mentioned on your previous question, these latency values are not published or specified, and in fact they may (and do) change from GPU to GPU, from one instruction type to another (e.g. floating point multiply and floating point add may have different latencies), and may even change from CUDA version to CUDA version, for certain operation types which are emulated via a sequence of multiple SASS instructions.
In order to discover this latency data, then, it's necessary to do some form of micro-benchmarking. An early and oft-cited paper demonstrating how this may be done for CUDA GPUs is here. There is not one single canonical reference for latency micro-benchmark data for GPUs, nor is there a single canonical reference for the benchmark programs to do it. It is a fairly difficult undertaking.
This sort of question is explicitly off-topic for SO. Please read here where it states:
"Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow..."