Nvidia has offered an example about how to profile bandwidth between Host and Device, you can find codes here: https://developer.nvidia.com/opencl (search "bandwidth"). The experiment is carried on in an Ubuntu 12.04 64-bits computer. I am inspecting pinned memory and mapped accessing mode, which can be tested by invoke: ./bandwidthtest --memory=pinned --access=mapped
The core test loop on Host-to-Device bandwidth is at around line 736~748. I also list them here and add some comments and context code:
//create a buffer cmPinnedData in host
cmPinnedData = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, memSize, NULL, &ciErrNum);
....(initialize cmPinnedData with some data)....
//create a buffer in device
cmDevData = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, memSize, NULL, &ciErrNum);
// get pointer mapped to host buffer cmPinnedData
h_data = (unsigned char*)clEnqueueMapBuffer(cqCommandQueue, cmPinnedData, CL_TRUE, CL_MAP_READ, 0, memSize, 0, NULL, NULL, &ciErrNum);
// get pointer mapped to device buffer cmDevData
void* dm_idata = clEnqueueMapBuffer(cqCommandQueue, cmDevData, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);
// copy data from host to device by memcpy
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
memcpy(dm_idata, h_data, memSize);
}
//unmap device buffer.
ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);
The measured host-to-device bandwidth is 6430.0MB/s when transfer size is 33.5MB. When the transfer size is reduced to 1MB by: ./bandwidthtest --memory=pinned --access=mapped --mode=range --start=1000000 --end=1000000 --increment=1000000 (MEMCOPY_ITERATIONS is changed from 100 to 10000 in case timer is not so precise.) The reported bandwidth becomes 12540.5MB/s.
We all know that the highest bandwidth of PCI-e x16 Gen2 interface is 8000MB/s. So I doubt there is some problems on the profiling method.
Let's recapture the core profiling code:
// get pointer mapped to device buffer cmDevData
void* dm_idata = clEnqueueMapBuffer(cqCommandQueue, cmDevData, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);
// copy data from host to device by memcpy
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
memcpy(dm_idata, h_data, memSize);
//can we call kernel after memcpy? I don't think so.
}
//unmap device buffer.
ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);
I think the problem is that memcpy can't grantee that data has really been transfered into device, because there isn't any explicit synchronization API inside the loop. So If we try to call kernel after the memcpy, the kernel may or may not get valid data.
If we do map and unmap operation inside the profiling loop, I think we can call kernel safely after the unmap operation, because this operation guarantees data has been in device safely. The new code is given here:
// copy data from host to device by memcpy
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
// get pointer mapped to device buffer cmDevData
void* dm_idata = clEnqueueMapBuffer(cqCommandQueue, cmDevData, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);
memcpy(dm_idata, h_data, memSize);
//unmap device buffer.
ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);
//we can call kernel here safely?
}
But, if we use this new profiling method, the reported bandwidth becomes very low: 915.2MB/s@block-size-33.5MB. 881.9MB/s@block-size-1MB. The overhead of map and unmap operation seems not as so little as "zero-copy" declare.
This map-unmap is even much slower than 2909.6MB/s@block-size-33.5MB, which is gotten by using normal way of clEnqueueWriteBuffer():
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
clEnqueueWriteBuffer(cqCommandQueue, cmDevData, CL_TRUE, 0, memSize, h_data, 0, NULL, NULL);
clFinish(cqCommandQueue);
}
So, my final question is that what is the correct and most efficient way to use mapped(zero-copy) mechanism in Nvidia OpenCL environment?
According to @DarkZeros 's suggestion, I did more tests on the map-unmap method.
Method 1 is just as @DarkZeros 's method:
//create N buffers in device
for(int i=0; i<MEMCOPY_ITERATIONS; i++)
cmDevData[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, memSize, NULL, &ciErrNum);
// get pointers mapped to device buffers cmDevData
void* dm_idata[MEMCOPY_ITERATIONS];
for(int i=0; i<MEMCOPY_ITERATIONS; i++)
dm_idata[i] = clEnqueueMapBuffer(cqCommandQueue, cmDevData[i], CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);
//Measure the STARTIME
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
// copy data from host to device by memcpy
memcpy(dm_idata[i], h_data, memSize);
//unmap device buffer.
ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData[i], dm_idata[i], 0, NULL, NULL);
}
clFinish(cqCommandQueue);
//Measure the ENDTIME
The above method got 1900MB/s. It is still lower than normal write-buffer method significantly. And more important, this method actually is not close to real case between host and device, because the map operation is outside profiling interval. So we can't run the profiling interval for many times. If we want to run the profiling interval for many times, we have to put the map operation inside the profiling interval. Because if we want to use the profiling interval/block as an sub-function which transfer data, we have to do map operation before each time we call this sub-function (because there is unmap inside the sub-function). So the map operation should be counted in the profiling interval. So I did the second test:
//create N buffers in device
for(int i=0; i<MEMCOPY_ITERATIONS; i++)
cmDevData[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, memSize, NULL, &ciErrNum);
void* dm_idata[MEMCOPY_ITERATIONS];
//Measure the STARTIME
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
// get pointers mapped to device buffers cmDevData
dm_idata[i] = clEnqueueMapBuffer(cqCommandQueue, cmDevData[i], CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);
// copy data from host to device by memcpy
memcpy(dm_idata[i], h_data, memSize);
//unmap device buffer.
ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData[i], dm_idata[i], 0, NULL, NULL);
}
clFinish(cqCommandQueue);
//Measure the ENDTIME
And this generates 980MB/s, just the same as before result. Seems that Nvida's OpenCL implementation hardly can achieve the same performance as CUDA from the perspective of data transfer.