MPI_SEND stops working after MPI_BARRIER

2020-03-26 12:13发布

问题:

I'm building a distributed web server in C/MPI and it seems like point-to-point communication completely stops working after the first MPI_BARRIER in my code. Standard C code works after the barrier, so I know that each of the threads makes it through the barrier. Point-to-point communication also works just fine before the barrier. However, when I copy-paste the same code that worked the line before the barrier into the line after the barrier it stops working entirely. The SEND will just wait forever. When I try using an ISEND instead, it makes it through the line, but the message is never received. I've been googling this problem a lot and everyone who has problems with MPI_BARRIER is told the barrier works correctly and their code is wrong, but I cannot for the life of me figure out why my code is wrong. What could be causing this behavior?

Here is a sample program that demonstrates this:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
  int procID;
  int val;
  MPI_Status status;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &procID);
  MPI_Barrier(MPI_COMM_WORLD);

  if (procID == 0)
  {
    val = 4;
    printf("Before send\n");
    MPI_Send(&val, 1, MPI_INT, 1, 4, MPI_COMM_WORLD);
    printf("after send\n");
  }

  if (procID == 1)
  {
    val = 1;
    printf("before: val = %d\n", val);
    MPI_Recv(&val, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
    printf("after: val = %d\n", val);
  }

  MPI_Finalize();
  return 0;
}

Moving the two if statements before the barrier causes this program to run correctly.

EDIT - It appears that the first communication, regardless of type, works, and all future communications fail. This is much more general that I thought at first. It doesn't matter if the first communication is a barrier or some other message, no future communications work properly.

回答1:

Open MPI has a know feature when it uses TCP/IP for communications: it tries to use all configured network interfaces that are in "UP" state. This presents as a problem if some of the other nodes are not reachable through all those interfaces. This is part of the greedy communication optimisation that Open MPI employs and sometimes, like in your case, leads to problems.

It seems that at least the second node has more than one interfaces that are up and that this fact was introduced to the first node during the negotiation phase:

  • one configured with 128.2.100.167
  • one configured with 192.168.109.1 (do you have a tunnel or Xen running on the machine?)

The barrier communication happens over the first network and then the next MPI_Send tries to send to the second address over the second network which obviously does not connect all nodes.

The easiest solution is to tell Open MPI only to use the nework that connects your nodes. You can tell it do so using the following MCA parameter:

--mca btl_tcp_if_include 128.2.100.0/24

(or whatever your communication network is)

You can also specify the list of network interfaces if it is the same on all machines, e.g.

--mca btl_tcp_if_include eth0

or you can tell Open MPI to specifically exclude certain interfaces (but you must always tell it to exclude the loopback "lo" if you do so):

--mca btl_tcp_if_exclude lo,virt0

Hope that helps you and many others that appears to have the same problems around here at SO. It looks like that recently almost all Linux distros has started bringing up various network interfaces by default and that is likely to cause problems with Open MPI.

P.S. Put those nodes behind a firewall, please!