Stream scheduling order

2019-06-11 15:49发布

问题:

The way I see both Process One & Process Two (below), are equivalent in that they take the same amount of time. Am I wrong?

allOfData_A= data_A1 + data_A2
allOfData_B= data_B1 + data_B2
allOFData_C= data_C1 + data_C2
Data_C is the output of the kernel operation of both Data_A & Data_B.  (Like C=A+B)
The HW supports one DeviceOverlap (concurrent) operation.

Process One:

MemcpyAsync data_A1 stream1 H->D
MemcpyAsync data_A2 stream2 H->D
MemcpyAsync data_B1 stream1 H->D
MemcpyAsync data_B2 stream2 H->D
sameKernel stream1
sameKernel stream2
MemcpyAsync result_C1 stream1 D->H
MemcpyAsync result_C2 stream2 D->H

Process Two: (Same operation, different order)

MemcpyAsync data_A1 stream1 H->D
MemcpyAsync data_B1 stream1 H->D
sameKernel stream1
MemcpyAsync data_A2 stream2 H->D
MemcpyAsync data_B2 stream2 H->D
sameKernel stream2
MemcpyAsync result_C1 stream1 D->H
MemcpyAsync result_C2 stream2 D->H

回答1:

Using CUDA streams allows the programmer to express work dependencies by putting dependent operations in the same stream. Work in different streams is independent and can be executed concurrently.

On GPUs without HyperQ (compute capability 1.0 to 3.0) you can get false dependencies because the work for a DMA engine or for computation gets put into a single hardware pipe. Compute capability 3.5 brings HyperQ which allows for multiple hardware pipes and there you shouldn't get the false dependencies. The simpleHyperQ example illustrates this, and the documentation shows diagrams to explain what is going on more clearly.

Putting it simply, on devices without HyperQ you would need to do a breadth-first launch of your work to get maximum concurrency, whereas for devices with HyperQ you can do a depth-first launch. Avoiding the false dependencies is pretty easy, but not having to worry about it is easier!