I want a measure of how much of the peak memory bandwidth my kernel archives.
Say I have a NVIDIA Tesla C1060, which has a max Bandwidth of 102.4 GB/s. In my kernel I have the following accesses to global memory:
...
for(int k=0;k>4000;k++){
float result = (in_data[index]-loc_mem[k]) * (in_data[index]-loc_mem[k]);
....
}
out_data[index]=result;
out_data2[index]=sqrt(result);
...
I count for each thread 4000*2+2 accesses to global memory. Having 1.000.000 threads and all accesses are float I have ~32GB of global memory accesses (inbound and outbound added). As my kernel only takes 0.1s I would archive ~320GB/s which is higher than the max bandwidth, thus there is an error in my calculations / assumptions. I assume, CUDA does some caching, so not all memory accesses count. Now my questions:
- What is my error?
- What accesses to global memory are cached and which are not?
- Is it correct that I don't count access to registers, local, shared and constant memory?
- Can I use the CUDA profiler for easier and more accurate results? Which counters would I need to use? How would I need to interpret them?
Profiler output:
method gputime cputime occupancy instruction warp_serial memtransfer
memcpyHtoD 10.944 17 16384
fill 64.32 93 1 14556 0
fill 64.224 83 1 14556 0
memcpyHtoD 10.656 11 16384
fill 64.064 82 1 14556 0
memcpyHtoD 1172.96 1309 4194304
memcpyHtoD 10.688 12 16384
cu_more_regT 93223.906 93241 1 40716656 0
memcpyDtoH 1276.672 1974 4194304
memcpyDtoH 1291.072 2019 4194304
memcpyDtoH 1278.72 2003 4194304
memcpyDtoH 1840 3172 4194304
New question:
- When 4194304Bytes = 4Bytes * 1024*1024 data points = 4MB and gpu_time
~= 0.1 s then I achieve a bandwidth of 10*40MB/s = 400MB/s. That seems very low. Where is the error?
p.s. Tell me if you need other counters for your answer.
sister question: How to calculate Gflops of a kernel
Default counters in Visual Profiler gives you enough information to get an idea about your kernel (memory bandwidth, shared memory bank conflicts, instructions executed...).
Regarding to your question, to calculate the achieved global memory throughput:
Hope this help.