I have tried to overlap kernel executions with memcpyasync but it doesn't work. I follow all recommendations in programming guide, using pinned memory, different streams, etc. I see kernel execution do overlap but it doesn't with mem transfers. I know my card has only one copy engine and one execution engine, but execution and tranfers should overlap, right?
It seems the "copy engine" and "execution engine" always enforce the order I call the functions. Work consists on 4 streams performing [HtoD x2, Kernel, DtoH]. If I issue HtoDx2,Kernel,DtoH serie on each stream, I see in profiler like the stream2 HtoD first operation will not start until the first DtoH operation ends. If I issue first the HtoD on each stream, then the second HtoD, then kernel and then DtoH (breadth) I see no overlap and the issue-order is also enforced by the GPU.
I have tried with the simpleStreams example given in CUDA SDK and I also see the same behavior.
I attach some screen captures showing the issue in both, visual profiler and Nsight for VS2008.
ps. I don't have set the CUDA_LAUNCH_BLOCKING env
Simple Streams Visual Profiler
MyApp Nsight timeline breadth first
MyApp Nsight timeline depth first
edit:
puting extra x4 kernels (total 2HtoD, 5 kernel, 1DtoH per stream) --> If I run nvprof with and without --concurrent-kernels-off, the elapsed time is the same. If I Set the env CUDA_LAUNCH_BLOCKING=1 then I see a performance improvement (from the command-line) of 7.5%!
System specification:
- Windows 7
- NVIDIA 6800 VGA in first PCI-E slot
- GTX480 in second PCI-E slot
- NVIDIA Driver: 306.94
- Visual studio 2008
- CUDA v5.0
- Visual Profiler 5.0
- Nsight 3.0
Read below for other (older) steps followed until i came to the solution above, and some other possible causes.
I just recently were able to partially solve this problem! It is specific to windows and aero i think. Please try these steps and post your results to help others! I have tried it on GTX 650 and GT 640.
This will disable aero and almost all visual effects. If this configuration works, you can try enabling one-by-one the boxes for visual effects until you find the precise one that causes problems!
This will also work as the above, but with more visual options enabled. For my two devices, this setting also works, so i kept it.
For me, it solved the problem for most cases (a tiled dgemm i have made),but NOTE THAT i still can't run "simpleStreams" properly and achieve concurrency...I will try to find a less radical way of solving this problem, maybe restoring just the registry will be enough.
As said in my comment, there is indeed a BUG with CUDA drivers and it makes streaming not working with my Setup. I have tested 1.1 capabilites card (8800 GTS) and 3.5 capabilities card (GTX Titan) and both cards works fine. It seems there is a problem with some Fermi cards (my GTX 480 does not work).
I just incurred with the same problem. I agree with your that there is a BUG. I think the bug is either in CUDA driver for Windows, or in the Windows itself. I have tested my code and it works well (with overlapping) in Linux.
In fact, you could test the "simpleStreams" example in SDK. I found that the "simpleStreams" running in Windows doesn't have overlapping between kernel and memory copy at all, but when in Linux it works perfectly.
I am using CUDA 5.0 and Fermi GTX570. With your test on 8800GT and GTX Titan, I would agree it is a bug in the CUDA driver for Windows. Hopefully it will be fixed soon.