Using CUDA Profiler nvprof for memory accesses

I'm using nvprof to get the number of global memory accesses for the following CUDA code. The number of loads in the kernel is 36 (accessing d_In array) and the number of stores in the kernel is 36+36 (for accessing d_Out array and d_rows array). So, the total number of global memory loads is 36 and the number of global memory stores is 72. However, when I profile the code with nvprof CUDA profiler, it reports the following: (Basically I want to compute the Compute to Global Memory Access (CGMA) ratio)

      1                gld_transactions        Global Load Transactions           6           6           6
      1                gst_transactions       Global Store Transactions          11          11          11
      1            l2_read_transactions            L2 Read Transactions         133         133         133
      1           l2_write_transactions           L2 Write Transactions          24          24          24


#include <stdio.h>
#include "cuda_profiler_api.h"
__constant__ int crows;

__global__ void kernel(double *d_In, double *d_Out, int *d_rows){
        int tx=threadIdx.x;
        int bx=blockIdx.x;
        int n=bx*blockDim.x+tx;
        if(n < 36){
                d_Out[n]=d_In[n]+1;
                d_rows[n]=crows;
        }
        return;
}

int main(int argc,char **argv){

     double I[36]={1,5,9,2,6,10,3,7,11,4,8,12,13,17,21,14,18,22,15,19,23,16,20,24,25,29,33,26,30,34,27,31,35,28,32,36};

     double *d_In;
     double *d_Out;
     int *d_rows;

     double Iout[36];
     int rows=5;
     int h_rows[36];

     cudaMemcpyToSymbol(crows,&rows,sizeof(int));
     cudaMalloc(&d_In,sizeof(double)*36);
     cudaMalloc(&d_Out,sizeof(double)*36);
     cudaMalloc(&d_rows,sizeof(int)*36);

     cudaMemcpy(d_In,I,sizeof(double)*36,cudaMemcpyHostToDevice);

     dim3 dimGrid(4,1,1);
     dim3 dimBlock(10,1,1);

     cudaProfilerStart();
     kernel<<<dimGrid,dimBlock>>>(d_In,d_Out,d_rows);
     cudaProfilerStop();

     cudaMemcpy(Iout,d_Out,sizeof(double)*36,cudaMemcpyDeviceToHost);
      cudaMemcpy(h_rows,d_rows,sizeof(int)*36,cudaMemcpyDeviceToHost);


    int i;
     for(i=0;i<36;i++)
       printf("%f %d\n",Iout[i],h_rows[i]);


}

Can someone help me? Thank you

It's customary to ask a question, something more specific than "Can someone help me?" Your code as shown has no floating point operations (+, *, etc.) so there is no CGMA to compute (it is zero).

Regarding the memory transactions, your code has 4 threadblocks:

 dim3 dimGrid(4,1,1);

Each threadblock may run on a separate multiprocessor. You have 10 threads in each block. The following line of code:

            d_Out[n]=d_In[n]+1;

will generate at least one global load transaction (d_In) and one global store transaction (d_Out) to service the threads. The fourth block will have threads whose global indices (n) for the active threads will be 30-35. When this block executes the above line of code, it will generate two global load and two global store transactions, because the threads require two cachelines to service their requests. So this one line of code may generate 5 global load transactions and 5 global store transactions.

For similar reasons, the next line of code:

            d_rows[n]=crows;

may generate 5 additional global store transactions. So of your profiler output:

  1                gld_transactions        Global Load Transactions           6           6           6
  1                gst_transactions       Global Store Transactions          11

I believe I have explained 5 of the 6 global load transactions, and 10 of the 11 global store transactions. Hopefully that is enough to give you an idea of the origin of these numbers.