nvprof事件“fb_subp0_read_sectors”和“fb_subp1_read_sec

我试图计算DRAM（全局内存）的访问次数为简单的矢量添加内核。

__global__ void AddVectors(const float* A, const float* B, float* C, int N)
{
    int blockStartIndex  = blockIdx.x * blockDim.x * N;
    int threadStartIndex = blockStartIndex + threadIdx.x;
    int threadEndIndex   = threadStartIndex + ( N * blockDim.x );
    int i;

    for( i=threadStartIndex; i<threadEndIndex; i+=blockDim.x ){
        C[i] = A[i] + B[i];
    }
}

网格大小= 180块大小= 128

阵列的大小= 180 * 128 * N漂浮，其中N是输入参数（每线程元素）

当N = 1时，阵列的大小= 180 * 128个* 1个漂浮= 90KB

所有阵列A，B和C应从DRAM中读取。

因此从理论上说，

DRAM写（C）= 2880（32字节访问）DRAM读取（A，B）= 2880 + 2880 = 5760（32字节访问）

但是，当我使用nvprof

DRAM写入= fb_subp0_write_sectors + fb_subp1_write_sectors = 1440 + 1440 = 2880（32字节访问）DRAM读取= fb_subp0_read_sectors + fb_subp1_read_sectors = 23 + 7 = 30（32字节访问）

现在，这就是问题所在。理论上应该有5760 DRAM读，但nvprof仅报告30，对我来说，这看起来是不可能的。进一步，如果加倍矢量的大小（N = 2），还有该DRAM在30访问遗迹。

这将是巨大的，如果有人能提供一些线索。

我已经用编译器选项禁用L1缓存“ -Xptxas -dlcm=cg ”

谢谢，Waruna

如果你做了cudaMemcpy的内核启动到源缓冲区从主机复制到设备中，获取二级缓存源缓冲器，因此内核没有看到任何L2为失误读，你会得到以下（数前fb_subp0_read_sectors + fb_subp1_read_sectors ）。

如果您注释掉cudaMemcpy的内核启动之前，你会看到的事件值fb_subp0_read_sectors和fb_subp1_read_sectors包括你期望的值。