As the tensorflow paper states, Tensorflow' cross-device communication is achieved by adding "receive node" and "send node" into devices.
From my understanding, the device(Please considering only CPU devices are involved) is responsible for performing the computation of an operation. However,the data(ex:Tensor produced from an operation, Variable buffer) resides in memory. I don't know how data transfer from one device to another device is achieved physically. I guess the data transfer is achieved by shared memory. Is that right?
I will appreciate any explanation/corresponding codes regarding how the data transfer is achieved. PS: TensorFlow paper link, Figure 4 shows the cross-device communication mechanism.
In TensorFlow, cross-device communication is achieved using the
Rendezvous
interface, which has multiple different implementations, depending on the deployment. The comment on that interface describes the general idea:As you noted in your question, TensorFlow represents communication in the dataflow graph using
Send
andRecv
ops that are added to the graph automatically when the graph is partitioned across devices. For each edge that has a source and destination on different devices, the graph partitioner inserts a pair ofSend
andRecv
ops that share the same "rendezvous key" (an automatically generated string name that is used as a key in the rendezvous' index of pending tensors to be communicated). The implementation of theSend
op is simple: it callsRendezvous::Send()
, passing in its rendezvous key and single input tensor, then returns immediately without blocking. The implementation of theRecv
op is slightly more complicated: it registers a callback to be called when the tensor with the given key becomes available. That callback is responsible for "producing" the output of theRecv
op, and unblocking subsequent computation.The
Rendezvous
implementations perform the actual work of transferring the data:IntraProcessRendezvous
handles the transfer of data between devices in the same process. In the (unlikely) event that the transfer is between two CPU devices in the same process, the transfer can be achieved by a simpleTensor
assignment. Otherwise, TensorFlow kicks off a device-specific DMA routine to transfer data between a CPU and GPU device.The
BaseRemoteRendezvous
class and its subclasses handle cross-device communication in the case that the send and receiver can be in different processes. The main implementation of this class isRpcRemoteRendezvous
, which uses gRPC to handle the remote transfers.