Why CUDA memory copy speed behaves like this, some

I always have a strange 0.04 ms overhead when working with memory in CUDA on my old GeForce 8800GT. I need to transfer ~1-2K to constant memory of my device, work with that data on it and get only one float value from the device.

I have a typical code using GPU calculation:

//allocate all the needed memory: pinned, device global
for(int i = 0; i < 1000; i++)
{
    //Do some heavy cpu logic (~0.005 ms long)        
    cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
    my_kernel<<<128, 128>>>(output);
    //several other calls of different kernels
    cudaMemcpy((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
    // Do some logic with returned value 
}

I decided to measure the speed of work with GPU memory with this code (commented all kernel calls, added cudaDeviceSynchronize call):

//allocate all the needed memory: pinned, device global
for(int i = 0; i < 1000; i++)
{
    //Do some heavy cpu logic (~0.001 ms long)        
    cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
    cudaMemcpyAsync((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    // Do some logic with returned value 
}

I've measured the execution time of the cycle and got ~0.05 sec (so, 0.05 ms per iteration). The strange thing is that when I try to do some more memory work (adding additional cudaMemcpyToSymbolAsync and cudaMemcpyAsync calls) I get additional <0.01 ms time per call. It corresponds with the research of this guy: http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_transfer_overhead.html

He also got these 0.01 ms per transfer of 1K block to GPU. So where that 0.04 ms (0.05 - 0.01) overhead came from? Any ideas? May be I should try this code on a newer card?

It seems to me that after cudaDeviceSynchronize and CPU code my GeForce goes to some power saving mode or something like this.

标签： memory cuda overhead

1条回答

何必那么认真

2楼-- · 2019-07-04 00:39

I recommend you to increase the number of threads you are implementing

    //Use malloc() to allocate memory on CPU. 
    //Change mem_size to the total memory to be tranferred to GPU.        
    cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
    dim3 dimBlock(128,2);
    dim3 dimGrid(64000,1);
    my_kernel<<<dimGrid, dimBlock>>>(output);
    //several other calls of different kernels
    //change size field to 1000*sizeof(FLOAT_T)
    cudaMemcpy((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
    // Do some logic with returned value

If the code crashes (because of more threads or more GPU memory), use loops. But, make them less.

0人赞添加讨论(0) 举报

Why CUDA memory copy speed behaves like this, some

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间