In CUDA C Best Practices Guide Version 5.0, Section 6.1.2, it is written that:

In contrast with cudaMemcpy(), the asynchronous transfer version requires pinned host memory (see Pinned Memory), and it contains an additional argument, a stream ID.

It means the cudaMemcpyAsync function should fail if I use simple memory.

But this is not what happened.

Just for testing purpose, I tried the following program:

Kernel:

__global__ void kernel_increment(float* src, float* dst, int n)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    if(tid<n)   
        dst[tid] = src[tid] + 1.0f;
}

Main:

int main()
{
    float *hPtr1, *hPtr2, *dPtr1, *dPtr2;

    const int n = 1000;

    size_t bytes = n * sizeof(float);

    cudaStream_t str1, str2;

    hPtr1 = new float[n];
    hPtr2 = new float[n];

    for(int i=0; i<n; i++)
        hPtr1[i] = static_cast<float>(i);

    cudaMalloc<float>(&dPtr1,bytes);
    cudaMalloc<float>(&dPtr2,bytes);

    dim3 block(16);
    dim3 grid((n + block.x - 1)/block.x);

    cudaStreamCreate(&str1);
    cudaStreamCreate(&str2);

    cudaMemcpyAsync(dPtr1,hPtr1,bytes,cudaMemcpyHostToDevice,str1);
    kernel_increment<<<grid,block,0,str2>>>(dPtr1,dPtr2,n);
    cudaMemcpyAsync(hPtr2,dPtr2,bytes,cudaMemcpyDeviceToHost,str1);

    printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));

    cudaDeviceSynchronize();

    printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));

    cudaStreamDestroy(str1);
    cudaStreamDestroy(str2);

    cudaFree(dPtr1);
    cudaFree(dPtr2);

    for(int i=0; i<n; i++)
        std::cout<<hPtr2[i]<<std::endl;

    delete[] hPtr1;
    delete[] hPtr2;

    return 0;
}

The program gave correct output. The array incremented successfully.

How did cudaMemcpyAsync execute without page locked memory? Am I missing something here?

cudaMemcpyAsync is fundamentally an asynchronous version of cudaMemcpy. This means that it doesn't block the calling host thread when the copy call is issued. That is the basic behaviour of the call.

Optionally, if the call is launched into the non default stream, and if the host memory is a pinned allocation, and the device has a free DMA copy engine, the copy operation can happen while the GPU simultaneously performs another operation: either kernel execution or another copy (in the case of a GPU with two DMA copy engines). If any of these conditions are not satisfied, the operation on the GPU is functionally identical to a standard cudaMemcpy call, ie. it serialises operations on the GPU, and no simultaneous copy-kernel execution or simultaneous multiple copies can occur. The only difference is that the operation doesn't block the calling host thread.

In your example code, the host source and destination memory are not pinned. So the memory transfer cannot overlap with kernel execution (ie. they serialise operations on the GPU). The calls are still asynchronous on the host. So what you have is functionally equivalent to:

cudaMemcpy(dPtr1,hPtr1,bytes,cudaMemcpyHostToDevice);
kernel_increment<<<grid,block>>>(dPtr1,dPtr2,n);
cudaMemcpy(hPtr2,dPtr2,bytes,cudaMemcpyDeviceToHost);

with the exception that all the calls are asynchronous on the host, so the host thread blocks at the cudaDeviceSynchronize() call rather than at each of the memory transfer calls.

This is absolutely expected behaviour.