what's the correct and most efficient way to u

Nvidia has offered an example about how to profile bandwidth between Host and Device, you can find codes here: https://developer.nvidia.com/opencl (search "bandwidth"). The experiment is carried on in an Ubuntu 12.04 64-bits computer. I am inspecting pinned memory and mapped accessing mode, which can be tested by invoke: ./bandwidthtest --memory=pinned --access=mapped

The core test loop on Host-to-Device bandwidth is at around line 736~748. I also list them here and add some comments and context code:

    //create a buffer cmPinnedData in host
    cmPinnedData = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, memSize, NULL, &ciErrNum);

    ....(initialize cmPinnedData with some data)....

    //create a buffer in device
    cmDevData = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, memSize, NULL, &ciErrNum);

    // get pointer mapped to host buffer cmPinnedData
    h_data = (unsigned char*)clEnqueueMapBuffer(cqCommandQueue, cmPinnedData, CL_TRUE, CL_MAP_READ, 0, memSize, 0, NULL, NULL, &ciErrNum);

    // get pointer mapped to device buffer cmDevData
    void* dm_idata = clEnqueueMapBuffer(cqCommandQueue, cmDevData, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);

    // copy data from host to device by memcpy
    for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
    {
        memcpy(dm_idata, h_data, memSize);
    }
    //unmap device buffer.
    ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);

The measured host-to-device bandwidth is 6430.0MB/s when transfer size is 33.5MB. When the transfer size is reduced to 1MB by: ./bandwidthtest --memory=pinned --access=mapped --mode=range --start=1000000 --end=1000000 --increment=1000000 (MEMCOPY_ITERATIONS is changed from 100 to 10000 in case timer is not so precise.) The reported bandwidth becomes 12540.5MB/s.

We all know that the highest bandwidth of PCI-e x16 Gen2 interface is 8000MB/s. So I doubt there is some problems on the profiling method.

Let's recapture the core profiling code:

    // get pointer mapped to device buffer cmDevData
    void* dm_idata = clEnqueueMapBuffer(cqCommandQueue, cmDevData, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);

    // copy data from host to device by memcpy
    for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
    {
        memcpy(dm_idata, h_data, memSize);
        //can we call kernel after memcpy? I don't think so.
    }
    //unmap device buffer.
    ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);

I think the problem is that memcpy can't grantee that data has really been transfered into device, because there isn't any explicit synchronization API inside the loop. So If we try to call kernel after the memcpy, the kernel may or may not get valid data.

If we do map and unmap operation inside the profiling loop, I think we can call kernel safely after the unmap operation, because this operation guarantees data has been in device safely. The new code is given here:

// copy data from host to device by memcpy
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
    // get pointer mapped to device buffer cmDevData
    void* dm_idata = clEnqueueMapBuffer(cqCommandQueue, cmDevData, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);

    memcpy(dm_idata, h_data, memSize);

    //unmap device buffer.
    ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);

    //we can call kernel here safely?
}

But, if we use this new profiling method, the reported bandwidth becomes very low: 915.2MB/s@block-size-33.5MB. 881.9MB/s@block-size-1MB. The overhead of map and unmap operation seems not as so little as "zero-copy" declare.

This map-unmap is even much slower than 2909.6MB/s@block-size-33.5MB, which is gotten by using normal way of clEnqueueWriteBuffer():

    for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
    {
        clEnqueueWriteBuffer(cqCommandQueue, cmDevData, CL_TRUE, 0, memSize, h_data, 0, NULL, NULL);
        clFinish(cqCommandQueue);
    }

So, my final question is that what is the correct and most efficient way to use mapped(zero-copy) mechanism in Nvidia OpenCL environment?

According to @DarkZeros 's suggestion, I did more tests on the map-unmap method.

Method 1 is just as @DarkZeros 's method:

//create N buffers in device
for(int i=0; i<MEMCOPY_ITERATIONS; i++)
    cmDevData[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, memSize, NULL, &ciErrNum);

// get pointers mapped to device buffers cmDevData
void* dm_idata[MEMCOPY_ITERATIONS];
for(int i=0; i<MEMCOPY_ITERATIONS; i++)
    dm_idata[i] = clEnqueueMapBuffer(cqCommandQueue, cmDevData[i], CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);

//Measure the STARTIME
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
    // copy data from host to device by memcpy
    memcpy(dm_idata[i], h_data, memSize);

    //unmap device buffer.
    ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData[i], dm_idata[i], 0, NULL, NULL);
}
clFinish(cqCommandQueue);
//Measure the ENDTIME

The above method got 1900MB/s. It is still lower than normal write-buffer method significantly. And more important, this method actually is not close to real case between host and device, because the map operation is outside profiling interval. So we can't run the profiling interval for many times. If we want to run the profiling interval for many times, we have to put the map operation inside the profiling interval. Because if we want to use the profiling interval/block as an sub-function which transfer data, we have to do map operation before each time we call this sub-function (because there is unmap inside the sub-function). So the map operation should be counted in the profiling interval. So I did the second test:

//create N buffers in device
for(int i=0; i<MEMCOPY_ITERATIONS; i++)
    cmDevData[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, memSize, NULL, &ciErrNum);

void* dm_idata[MEMCOPY_ITERATIONS];

//Measure the STARTIME
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
    // get pointers mapped to device buffers cmDevData
    dm_idata[i] = clEnqueueMapBuffer(cqCommandQueue, cmDevData[i], CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);

    // copy data from host to device by memcpy
    memcpy(dm_idata[i], h_data, memSize);

    //unmap device buffer.
    ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData[i], dm_idata[i], 0, NULL, NULL);
}
clFinish(cqCommandQueue);
//Measure the ENDTIME

And this generates 980MB/s, just the same as before result. Seems that Nvida's OpenCL implementation hardly can achieve the same performance as CUDA from the perspective of data transfer.

First thing to note here is, OpenCL does not allow pinned zero-copy (in 2.0 it is available, but not yet ready to use). This mean you will have to perform a copy anyway to the GPU memory.

There are 2 ways to perform the memory copy:

clEnqueueWriteBuffer()/clEnqueueReadBuffer(): These perform a direct copy from/to an OpenCL object in the context (typically in the device) to a host side pointer. The efficiency is high, but maybe they are not efficient for small quantities of bytes.
clEnqueueMapBuffer()/clEnqueueUnmapBuffer(): These calls first map a device memory zone to the host memory zone. This map generates a 1:1 copy of the memory. Then, after the map you can play with that memory using memcopy() or other approaches. After you finish with the memory editing you call the unmap, which then transfers this memory back to the device. Typically this option is faster, since OpenCL gives you the pointer when you map. It is likely you are already writing in the host cache of the context. But the counterpart is that when you call map the memory transfer is occurring the other way around (GPU->host)

EDIT: In this last case if you select the flag CL_WRITE_ONLY for mapping, it is probably NOT triggering a device to host copy on the map operation. The same thing happens with the read only, that will NOT trigger a device copy on the unmap.

In your example, it is clear that using the Map/Unmap approach the operation is going to be faster. However, if you do memcpy() inside a loop without calling the unmap, that is effectively NOT copying anything to the device side. If you put a loop of map/unmap the performance is going to decrease, and if the buffer size is small (1MB) the transfer rates will be very poor. However this will also happen in the Write/Read case if you perform the writes in a for loop with small sizes.

In general, you should not use 1MB sizes, since the overhead will be very high in this cases (unless you queue many write call in a non-blocking mode).

PD: My personal recomendation is, simply to use the normal Write/Read, since the difference is not noticeable for most common uses. Specially with overlapped I/O and kernel executions. But if you really need the performance, use map/unmap or pinned memory with Read/Write it should give 10-30% better transfer rates.

EDIT: Related to the behaviour you are experiencing, after examining nVIDIA code I can explain it to you. The problem you see is mainly generated by blocking and non-blocking calls that "hide" the overheads of the OpenCL calls

The first code: (nVIDIA)

Is queueing a BLOCKING map once
Then performing many memcpys (but only the last one will go to the GPU side).
Then unmapping it in a non-blocking manner.
Reading the result without clFinish()

This code example is WRONG! It is not really measuring the HOST-GPU copy speed. Because the memcpy() does not ensure a GPU copy and because there is a clFinish() missing. That's why you even see speeds over the limit.

The second code: (your's)

Is queueing a BLOCKING map many times in a loop.
Then performing 1 memcpy() for each map.
Then unmapping it in a non-blocking manner.
Reading the result without clFinish()

Your code only lacks of the clFinish(). However as the map in the loop is blocking the results are almost correct. However, the GPU is idle until the CPU attends the next iteration, so you are seeing a non-realistic very low performance.

The Write/Read code: (nVIDIA)

Is queueing a nonblocking write many times.
Reading the result with clFinish()

This code is properly doing the copy, in parallel, and you are seeing the real bandwidth here.

In order to convert the map example into something comparable to the Write/Read case. You should so it like this (this is without pinned memory):

//create N buffers in device
for(int i=0; i<MEMCOPY_ITERATIONS; i++)
    cmDevData[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, memSize, NULL, &ciErrNum);

// get pointers mapped to device buffers cmDevData
void* dm_idata[MEMCOPY_ITERATIONS];
dm_idata[i] = clEnqueueMapBuffer(cqCommandQueue, cmDevData[i], CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);

//Measure the STARTIME
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
    // copy data from host to device by memcpy
    memcpy(dm_idata[i], h_data, memSize);

    //unmap device buffer.
    ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);
}
clFinish(cqCommandQueue);

//Measure the ENDTIME

You can't reuse the same buffer in the mapped case since otherwise after each iteration you would block. And the GPU will be idle until the CPU requeues the next copy job.