I have been trying to understand the Segemented Ring Allreduce in OpenMPI (V2.0.2). But I failed to figure out this pipelined ring allreduce, especially how the phases are pipelined. (i.e. COMPUTATION PHASE 1 (b) seems to perform the two phases concurrently instead of "pipelinely".) Could MPI experts provide the motivation behind this Segmented Ring Allreduce and details about the pipeline?
Really appreciated, Leo