In my current project I use GPUs for signal processing and visualization. I'm already using streams to allow for asynchronous operation. The signal is processed in frames and for each frame the processing steps in a stream are as following
- memcpy to device
- signal conditioning
- image processing
- visualization
Right now the steps are happening on a single GPU, however my machine has a Multi-GPU card (GeForce GTX 690) and I'd like to distribute the operation between the two devices. Basically I'd like to perform steps 1 and 2 on device A and steps 3 and 4 on device B, while the operations 1, 2, 3 and 4 are still executed as a single asynchronous stream. The desired outcome is a streaming layout that looks like this
Device A Stream a 1 2 1 2 ...
Stream b 1 2 ...
Device B Stream a 3 4 3 4 ...
Stream b 3 4 ...
How can I do this?
My previous attempt was not correct, because a stream is associated with the device it was created on. So I think the most direct answer to your question as posed in the title is "it can't be done". You cannot create a single stream and issue commands to more than one GPU from it. From here:
However while researching it, I noted that events are a suggested way to synchronize two streams on two different devices:
So in that vein, I created the following code to illustrate this:
If you run the code as-is, you should get a
success
message. If you comment out the line that forces stream b (on Device B) to wait on stream a (on Device A), then you'll see aresults error
message. So this demonstrates how to sync a stream on one device to a stream on another. Hope it helps. Sorry for the confusion on the first go-round.cudaStreamWaitEvent()
enables inter-GPU synchronization, since you can insert a wait on a CUDA event belonging to another device.So what you need for inter-GPU synchronization between the producer and the consumer is to allocate a few events (at least 2) for each of the 2 GPUs, then have the producer
cudaEventRecord()
and the consumercudaStreamWaitEvent()
on the same event.cudaStreamWaitEvent()
inserts a command into the current device's command buffer that causes it to suspend execution until the given event has been recorded.Below see a code fragment where a peer-to-peer memcpy is implemented using
cudaStreamWaitEvent()
in this way. Once the pump is primed, the producer and consumer should both be doing PCIe transfers concurrently, each to one of the two staging buffers (which are allocated in portable pinned memory).Full source code in https://github.com/ArchaeaSoftware/cudahandbook/blob/master/memory/peer2peerMemcpy.cu