Unable to run MPI when transfering large data

2020-03-31 03:25发布

问题:

I used MPI_Isend to transfer an array of chars to slave node. When the size of the array is small it worked, but when I enlarge the size of the array, it hanged there.

Code running on the master node (rank 0) :

MPI_Send(&text_length,1,MPI_INT,dest,MSG_TEXT_LENGTH,MPI_COMM_WORLD);
MPI_Isend(text->chars, 360358,MPI_CHAR,dest,MSG_SEND_STRING,MPI_COMM_WORLD,&request);
MPI_Wait(&request,&status);

Code running on slave node (rank 1):

MPI_Recv(&count,1,MPI_INT,0,MSG_TEXT_LENGTH,MPI_COMM_WORLD,&status);
MPI_Irecv(host_read_string,count,MPI_CHAR,0,MSG_SEND_STRING,MPI_COMM_WORLD,&request);
MPI_Wait(&request,&status);

You see the count param in MPI_Isend is 360358. It seemed too large for MPI. When I set the param 1024, it worked well.

Actually this problem has confused me a few days, I have known that there's limit on the size of data transferred by MPI. But as far as I know, the MPI_Send is used to send short messages, and the MPI_Isend can send larger messages. So I use MPI_Isend.

The network configure in rank 0 is:

  [12t2007@comp01-mpi.gpu01.cis.k.hosei.ac.jp ~]$ ifconfig -a
eth0      Link encap:Ethernet  HWaddr 00:1B:21:D9:79:A5  
          inet addr:192.168.0.101  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:393267 errors:0 dropped:0 overruns:0 frame:0
          TX packets:396421 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:35556328 (33.9 MiB)  TX bytes:79580008 (75.8 MiB)

eth0.2002 Link encap:Ethernet  HWaddr 00:1B:21:D9:79:A5  
          inet addr:10.111.2.36  Bcast:10.111.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:133577 errors:0 dropped:0 overruns:0 frame:0
          TX packets:127677 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:14182652 (13.5 MiB)  TX bytes:17504189 (16.6 MiB)

eth1      Link encap:Ethernet  HWaddr 00:1B:21:D9:79:A4  
          inet addr:192.168.1.101  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:206981 errors:0 dropped:0 overruns:0 frame:0
          TX packets:303185 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:168952610 (161.1 MiB)  TX bytes:271792020 (259.2 MiB)

eth2      Link encap:Ethernet  HWaddr 00:25:90:91:6B:56  
          inet addr:10.111.1.36  Bcast:10.111.1.255  Mask:255.255.254.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:26459977 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15700862 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:12533940345 (11.6 GiB)  TX bytes:2078001873 (1.9 GiB)
          Memory:fb120000-fb140000 

eth3      Link encap:Ethernet  HWaddr 00:25:90:91:6B:57  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Memory:fb100000-fb120000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:1894012 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1894012 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:154962344 (147.7 MiB)  TX bytes:154962344 (147.7 MiB)

The network configure in rank 1 is:

[12t2007@comp02-mpi.gpu01.cis.k.hosei.ac.jp ~]$ ifconfig -a
eth0      Link encap:Ethernet  HWaddr 00:1B:21:D9:79:5F  
          inet addr:192.168.0.102  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:328449 errors:0 dropped:0 overruns:0 frame:0
          TX packets:278631 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:47679329 (45.4 MiB)  TX bytes:39326294 (37.5 MiB)

eth0.2002 Link encap:Ethernet  HWaddr 00:1B:21:D9:79:5F  
          inet addr:10.111.2.37  Bcast:10.111.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:94126 errors:0 dropped:0 overruns:0 frame:0
          TX packets:53782 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:8313498 (7.9 MiB)  TX bytes:6929260 (6.6 MiB)

eth1      Link encap:Ethernet  HWaddr 00:1B:21:D9:79:5E  
          inet addr:192.168.1.102  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:121527 errors:0 dropped:0 overruns:0 frame:0
          TX packets:41865 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:158117588 (150.7 MiB)  TX bytes:5084830 (4.8 MiB)

eth2      Link encap:Ethernet  HWaddr 00:25:90:91:6B:50  
          inet addr:10.111.1.37  Bcast:10.111.1.255  Mask:255.255.254.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:26337628 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15500750 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:12526923258 (11.6 GiB)  TX bytes:2032767897 (1.8 GiB)
          Memory:fb120000-fb140000 

eth3      Link encap:Ethernet  HWaddr 00:25:90:91:6B:51  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Memory:fb100000-fb120000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:1895944 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1895944 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:154969511 (147.7 MiB)  TX bytes:154969511 (147.7 MiB)

回答1:

The peculiarities of using TCP/IP with Open MPI are described in the FAQ. I'll try to give an executive summary here.

Open MPI uses a greedy approach when it comes to utilising network interfaces for data exchange. In particular, the TCP/IP BTL (Byte Transfer Layer) and OOB (Out-Of-Band) components tcp will try to use all configured network interfaces with matching address families. In your case each node has many interfaces with addresses from the IPv4 address family:

comp01-mpi                     comp02-mpi
----------------------------------------------------------
eth0       192.168.0.101/24    eth0       192.168.0.102/24
eth0.2002  10.111.2.36/24      eth0.2002  10.111.2.37/24
eth1       192.168.1.101/24    eth1       192.168.1.102/24
eth2       10.111.1.36/23      eth2       10.111.1.37/23
lo         127.0.0.1/8         lo         127.0.0.1/8

Open MPI assumes that each interface on comp02-mpi is reachable from any interface on comp01-mpi and vice versa. This is never the case with the loopback interface lo, therefore by default Open MPI excludes lo. Network sockets are then opened lazily (e.g. on demand) when information has to be transported.

What happens in your case is that when transporting messages, Open MPI chops them down into fragments and then tries to send the different segments over different connections in order to maximise the bandwidth. By default the fragments are of size 128 KiB, which only holds 32768 int elements, also the very first (eager) fragment is of size 64 KiB and holds twice as less elements. It might happen that the assumption that each interface on comp01-mpi is reachable from each interface on comp02-mpi (and vice versa) is wrong, e.g. if some of them are connected to separate isolated networks. In that case the library will be stuck in trying to make a connection that can never happen and the program will hang. This should usually happen for messages of more than 16384 int elements.

To prevent the above mentioned situation, one can restrict the interfaces or networks that Open MPI uses for TCP/IP communication. The btl_tcp_if_include MCA parameter can be used to provide the library with the list of interfaces that it should use. The btl_tcp_if_exclude can be used to instruct the library which interfaces to exclude. That one is set to lo by default and if one would like to exclude specific interface(s), then one should explicitly add lo to the list.

Everything from above also applies to the out-of-band communication used to transport special information. The parameters for selecting or deselecting interfaces for OOB are oob_tcp_if_include and oob_tcp_if_exclude conversely. Those are usually set together with the BTL parameters. Therefore you should try setting those to combinations that actually work. Start by narrowing the selection down the a single interface:

 mpiexec --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 ...

If it doesn't work with eth0, try other interfaces.

The presence of the virtual interface eth0.2002 is going to further confuse Open MPI 1.6.2 and newer.



回答2:

I think instead of 360358 you must use text_length while sending the message (in function MPI_Isend) from rank=0.

I believe the reason why you are sending the length first is to eliminate the need to hard-code the number, then why are you putting the number of elements when sending to other nodes? That could be the reason for inconsistency at the receiving end.

The signature of MPI_Isend is:

int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)

where count is the number of elements in send buffer. It might lead to segmentation fault if count is more than the array length, or will lead to sending less elements across if the count is less.



标签: mpi openmpi