I wrote the C application below to help me understand MPI, and why MPI_Barrier() isn't functioning in my huge C++ application. However, I was able to reproduce my problem in my huge application with a much smaller C application. Essentially, I call MPI_Barrier() inside a for loop, and MPI_Barrier() is visible to all nodes, yet after 2 iterations of the loop, the program becomes deadlocked. Any thoughts?
#include <mpi.h>
#include <stdio.h>
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int i=0, numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
printf("%s: Rank %d of %d\n", processor_name, rank, numprocs);
for(i=1; i <= 100; i++) {
if (rank==0) printf("Before barrier (%d:%s)\n",i,processor_name);
MPI_Barrier(MPI_COMM_WORLD);
if (rank==0) printf("After barrier (%d:%s)\n",i,processor_name);
}
MPI_Finalize();
return 0;
}
The output:
alienone: Rank 1 of 4
alienfive: Rank 3 of 4
alienfour: Rank 2 of 4
alientwo: Rank 0 of 4
Before barrier (1:alientwo)
After barrier (1:alientwo)
Before barrier (2:alientwo)
After barrier (2:alientwo)
Before barrier (3:alientwo)
I am using GCC 4.4, Open MPI 1.3 from the Ubuntu 10.10 repositories.
Also, in my huge C++ application, MPI Broadcasts don't work. Only half the nodes receive the broadcast, the others are stuck waiting for it.
Thank you in advance for any help or insights!
Update: Upgraded to Open MPI 1.4.4, compiled from source into /usr/local/.
Update: Attaching GDB to the running process shows an interesting result. It looks to me that the MPI system died at the barrier, but MPI still thinks the program is running:
Attaching GDB yields an interesting result. It seems all nodes have died at the MPI barrier, but MPI still thinks they are running:
0x00007fc235cbd1c8 in __poll (fds=0x15ee360, nfds=8, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83
83 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
in ../sysdeps/unix/sysv/linux/poll.c
(gdb) bt
#0 0x00007fc235cbd1c8 in __poll (fds=0x15ee360, nfds=8, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83
#1 0x00007fc236a45141 in poll_dispatch () from /usr/local/lib/libopen-pal.so.0
#2 0x00007fc236a43f89 in opal_event_base_loop () from /usr/local/lib/libopen-pal.so.0
#3 0x00007fc236a38119 in opal_progress () from /usr/local/lib/libopen-pal.so.0
#4 0x00007fc236eff525 in ompi_request_default_wait_all () from /usr/local/lib/libmpi.so.0
#5 0x00007fc23141ad76 in ompi_coll_tuned_sendrecv_actual () from /usr/local/lib/openmpi/mca_coll_tuned.so
#6 0x00007fc2314247ce in ompi_coll_tuned_barrier_intra_recursivedoubling () from /usr/local/lib/openmpi/mca_coll_tuned.so
#7 0x00007fc236f15f12 in PMPI_Barrier () from /usr/local/lib/libmpi.so.0
#8 0x0000000000400b32 in main (argc=1, argv=0x7fff5883da58) at barrier_test.c:14
(gdb)
Update: I also have this code:
#include <mpi.h>
#include <stdio.h>
#include <math.h>
int main( int argc, char *argv[] ) {
int n = 400, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
printf("MPI Rank %i of %i.\n", myid, numprocs);
while (1) {
h = 1.0 / (double) n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs) {
x = h * ((double)i - 0.5);
sum += (4.0 / (1.0 + x*x));
}
mypi = h * sum;
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0)
printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT));
}
MPI_Finalize();
return 0;
}
And despite the infinite loop, there is only one output from the printf() in the loop:
mpirun -n 24 --machinefile /etc/machines a.out
MPI Rank 0 of 24.
MPI Rank 3 of 24.
MPI Rank 1 of 24.
MPI Rank 4 of 24.
MPI Rank 17 of 24.
MPI Rank 15 of 24.
MPI Rank 5 of 24.
MPI Rank 7 of 24.
MPI Rank 16 of 24.
MPI Rank 2 of 24.
MPI Rank 11 of 24.
MPI Rank 9 of 24.
MPI Rank 8 of 24.
MPI Rank 20 of 24.
MPI Rank 23 of 24.
MPI Rank 19 of 24.
MPI Rank 12 of 24.
MPI Rank 13 of 24.
MPI Rank 21 of 24.
MPI Rank 6 of 24.
MPI Rank 10 of 24.
MPI Rank 18 of 24.
MPI Rank 22 of 24.
MPI Rank 14 of 24.
pi is approximately 3.1415931744231269, Error is 0.0000005208333338
Any thoughts?
What interconnect do you have? Is it a specialisied one like InfiniBand or Myrinet or are you just using plain TCP over Ethernet? Do you have more than one configured network interfaces if running with the TCP transport?
Besides, Open MPI is modular -- there are many modules that provide algorithms implementing the various collective operations. You can try to fiddle with them using MCA parameters, e.g. you can start debugging your application's behaviour with increasing the verbosity of the btl component by passing
mpirun
something like--mca btl_base_verbose 30
. Look for something similar to:In that case some (or all) nodes have more than one configured network interface that is up but not all nodes are reachable through all the interfaces. This might happen, e.g. if nodes run recent Linux distro with per-default enabled Xen support (RHEL?) or have other virtualisation software installed on them that brings up virtual network interfaces.
By default Open MPI is lazy, that is connectinos are opened on demand. The first send/receive communication may succeed if the right interface is picked up, but subsequent operations are likely to pick up one of the alternate paths in order to maximise the bandwidth. If the other node is unreachable through the second interface a time out is likely to occur and the communication will fail as Open MPI will consider the other node down or problematic.
The solution is to isolate the non-connecting networks or network interfaces using MCA parameters of the TCP
btl
module:--mca btl_tcp_if_include 192.168.2.0/24
--mca btl_tcp_if_include eth0,eth1
lo
):--mca btl_tcp_if_exclude lo,virt0
Refer to the Open MPI run-time TCP tuning FAQ for more details.
MPI_Barrier() in OpenMPI sometimes hangs when processes come across the barrier at different times passed after last barrier, however that's not your case as I can see. Anyway, try using MPI_Reduce() instead or before the real call to MPI_Barrier(). This is not a direct equivalent to barrier, but any synchronous call with almost no payload involving all processes in a communicator should work like a barrier. I haven't seen such behavior of MPI_Barrier() in LAM/MPI or MPICH2 or even WMPI, but it was a real issue with OpenMPI.