I'm using nvprof to get the number of global memory accesses for the following CUDA code. The number of loads in the kernel is 36 (accessing d_In array) and the number of stores in the kernel is 36+36 (for accessing d_Out array and d_rows array). So, the total number of global memory loads is 36 and the number of global memory stores is 72. However, when I profile the code with nvprof CUDA profiler, it reports the following: (Basically I want to compute the Compute to Global Memory Access (CGMA) ratio)
1 gld_transactions Global Load Transactions 6 6 6
1 gst_transactions Global Store Transactions 11 11 11
1 l2_read_transactions L2 Read Transactions 133 133 133
1 l2_write_transactions L2 Write Transactions 24 24 24
#include <stdio.h>
#include "cuda_profiler_api.h"
__constant__ int crows;
__global__ void kernel(double *d_In, double *d_Out, int *d_rows){
int tx=threadIdx.x;
int bx=blockIdx.x;
int n=bx*blockDim.x+tx;
if(n < 36){
d_Out[n]=d_In[n]+1;
d_rows[n]=crows;
}
return;
}
int main(int argc,char **argv){
double I[36]={1,5,9,2,6,10,3,7,11,4,8,12,13,17,21,14,18,22,15,19,23,16,20,24,25,29,33,26,30,34,27,31,35,28,32,36};
double *d_In;
double *d_Out;
int *d_rows;
double Iout[36];
int rows=5;
int h_rows[36];
cudaMemcpyToSymbol(crows,&rows,sizeof(int));
cudaMalloc(&d_In,sizeof(double)*36);
cudaMalloc(&d_Out,sizeof(double)*36);
cudaMalloc(&d_rows,sizeof(int)*36);
cudaMemcpy(d_In,I,sizeof(double)*36,cudaMemcpyHostToDevice);
dim3 dimGrid(4,1,1);
dim3 dimBlock(10,1,1);
cudaProfilerStart();
kernel<<<dimGrid,dimBlock>>>(d_In,d_Out,d_rows);
cudaProfilerStop();
cudaMemcpy(Iout,d_Out,sizeof(double)*36,cudaMemcpyDeviceToHost);
cudaMemcpy(h_rows,d_rows,sizeof(int)*36,cudaMemcpyDeviceToHost);
int i;
for(i=0;i<36;i++)
printf("%f %d\n",Iout[i],h_rows[i]);
}
Can someone help me? Thank you
It's customary to ask a question, something more specific than "Can someone help me?" Your code as shown has no floating point operations (+, *, etc.) so there is no CGMA to compute (it is zero).
Regarding the memory transactions, your code has 4 threadblocks:
Each threadblock may run on a separate multiprocessor. You have 10 threads in each block. The following line of code:
will generate at least one global load transaction (
d_In
) and one global store transaction (d_Out
) to service the threads. The fourth block will have threads whose global indices (n
) for the active threads will be 30-35. When this block executes the above line of code, it will generate two global load and two global store transactions, because the threads require two cachelines to service their requests. So this one line of code may generate 5 global load transactions and 5 global store transactions.For similar reasons, the next line of code:
may generate 5 additional global store transactions. So of your profiler output:
I believe I have explained 5 of the 6 global load transactions, and 10 of the 11 global store transactions. Hopefully that is enough to give you an idea of the origin of these numbers.