Why am I failing to overlap data transfers and com

I have tried to overlap kernel executions with memcpyasync but it doesn't work. I follow all recommendations in programming guide, using pinned memory, different streams, etc. I see kernel execution do overlap but it doesn't with mem transfers. I know my card has only one copy engine and one execution engine, but execution and tranfers should overlap, right?

It seems the "copy engine" and "execution engine" always enforce the order I call the functions. Work consists on 4 streams performing [HtoD x2, Kernel, DtoH]. If I issue HtoDx2,Kernel,DtoH serie on each stream, I see in profiler like the stream2 HtoD first operation will not start until the first DtoH operation ends. If I issue first the HtoD on each stream, then the second HtoD, then kernel and then DtoH (breadth) I see no overlap and the issue-order is also enforced by the GPU.

I have tried with the simpleStreams example given in CUDA SDK and I also see the same behavior.

I attach some screen captures showing the issue in both, visual profiler and Nsight for VS2008.

ps. I don't have set the CUDA_LAUNCH_BLOCKING env

Simple Streams Visual Profiler

MyApp Nsight timeline breadth first

MyApp Nsight timeline depth first

edit:

puting extra x4 kernels (total 2HtoD, 5 kernel, 1DtoH per stream) --> If I run nvprof with and without --concurrent-kernels-off, the elapsed time is the same. If I Set the env CUDA_LAUNCH_BLOCKING=1 then I see a performance improvement (from the command-line) of 7.5%!

System specification:

Windows 7
NVIDIA 6800 VGA in first PCI-E slot
GTX480 in second PCI-E slot
NVIDIA Driver: 306.94
Visual studio 2008
CUDA v5.0
Visual Profiler 5.0
Nsight 3.0

标签： concurrency cuda overlapping nsight

3条回答

萌系小妹纸

2楼-- · 2019-02-22 12:22

TL;DR: The issue is caused by the WDDM TDR delay option in Nsight Monitor! When set to false, the issue appears. Instead, if you set the TDR delay value to a very high number, and the "enabled" option to true, the issue goes away. Please, try the options described below (more common), because they are also related to the problem!

Read below for other (older) steps followed until i came to the solution above, and some other possible causes.

I just recently were able to partially solve this problem! It is specific to windows and aero i think. Please try these steps and post your results to help others! I have tried it on GTX 650 and GT 640.

Before you do anything, consider using both onboard gpu(as display) and the discrete gpu (for computations), because there are verified issues with the nvidia driver for windows! When you use onboard gpu, said drivers don't get fully loaded, so many bugs are evaded. Also, system responsiveness is maintained while working!

Make sure your concurrency problem is not related to other issues like old drivers (including bios), wrong code, incapable device, etc.
Go to computer>properties
Select advanced system settings on the left side
Go to the Advanced tab
On Performance click settings
In the Visual Effects tab, select the "adjust for best performance" bullet.

This will disable aero and almost all visual effects. If this configuration works, you can try enabling one-by-one the boxes for visual effects until you find the precise one that causes problems!

Alternatively, you can:

Right click on desktop, select personalize
Select a theme from basic themes, that doesn't have aero.

This will also work as the above, but with more visual options enabled. For my two devices, this setting also works, so i kept it.

Please, when you try these solutions, come back here and post your findings!

For me, it solved the problem for most cases (a tiled dgemm i have made),but NOTE THAT i still can't run "simpleStreams" properly and achieve concurrency...

UPDATE: The problem is fully solved with a new windows installation!! The previous steps improved the behavior for some cases, but a fresh install solved all the problems!

I will try to find a less radical way of solving this problem, maybe restoring just the registry will be enough.

0人赞添加讨论(0) 举报

成全新的幸福

3楼-- · 2019-02-22 12:33

As said in my comment, there is indeed a BUG with CUDA drivers and it makes streaming not working with my Setup. I have tested 1.1 capabilites card (8800 GTS) and 3.5 capabilities card (GTX Titan) and both cards works fine. It seems there is a problem with some Fermi cards (my GTX 480 does not work).

0人赞添加讨论(0) 举报

小情绪 Triste *

4楼-- · 2019-02-22 12:40

I just incurred with the same problem. I agree with your that there is a BUG. I think the bug is either in CUDA driver for Windows, or in the Windows itself. I have tested my code and it works well (with overlapping) in Linux.

In fact, you could test the "simpleStreams" example in SDK. I found that the "simpleStreams" running in Windows doesn't have overlapping between kernel and memory copy at all, but when in Linux it works perfectly.

I am using CUDA 5.0 and Fermi GTX570. With your test on 8800GT and GTX Titan, I would agree it is a bug in the CUDA driver for Windows. Hopefully it will be fixed soon.

0人赞添加讨论(0) 举报

Why am I failing to overlap data transfers and com

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间