The way I see both Process One & Process Two (below), are equivalent in that they take the same amount of time. Am I wrong?
allOfData_A= data_A1 + data_A2
allOfData_B= data_B1 + data_B2
allOFData_C= data_C1 + data_C2
Data_C is the output of the kernel operation of both Data_A & Data_B. (Like C=A+B)
The HW supports one DeviceOverlap (concurrent) operation.
Process One:
MemcpyAsync data_A1 stream1 H->D
MemcpyAsync data_A2 stream2 H->D
MemcpyAsync data_B1 stream1 H->D
MemcpyAsync data_B2 stream2 H->D
sameKernel stream1
sameKernel stream2
MemcpyAsync result_C1 stream1 D->H
MemcpyAsync result_C2 stream2 D->H
Process Two: (Same operation, different order)
MemcpyAsync data_A1 stream1 H->D
MemcpyAsync data_B1 stream1 H->D
sameKernel stream1
MemcpyAsync data_A2 stream2 H->D
MemcpyAsync data_B2 stream2 H->D
sameKernel stream2
MemcpyAsync result_C1 stream1 D->H
MemcpyAsync result_C2 stream2 D->H