bash: /usr/bin/hydra_pmi_proxy: No such file or di

2019-07-24 03:31发布

问题:

I am struggling to set up an MPI cluster, following the Setting Up an MPICH2 Cluster in Ubuntu tutorial. I have something running and my machine file is this:

pythagoras:2  # this will spawn 2 processes on pythagoras
geomcomp      # this will spawn 1 process on geomcomp

The tutorial states:

and run it (the parameter next to -n specifies the number of processes to spawn and distribute among nodes): mpiu@ub0:~$ mpiexec -n 8 -f machinefile ./mpi_hello

With -n 1 and -n 2 it runs fine, but with -n 3, it fails, as you can see below:

gsamaras@pythagoras:/mirror$ mpiexec -n 1 -f machinefile ./mpi_hello            
Hello from processor 0 of 1
gsamaras@pythagoras:/mirror$ mpiexec -n 2 -f machinefile ./mpi_hello
Hello from processor 0 of 2
Hello from processor 1 of 2
gsamaras@pythagoras:/mirror$ mpiexec -n 3 -f machinefile ./mpi_hello
bash: /usr/bin/hydra_pmi_proxy: No such file or directory
{hungs up}

Maybe that parameter next to -n specifies the number of machines? I mean the number of processes is stated in the machinefile, isn't it? Also, I have used 2 machines for the MPI cluster (hope this is the case and the output I am getting is not only from the master node (i.e. pythagoras), but also from the slave one (i.e. geomcomp)).

Edit_1

Well I think that the parameter next to -n actually specifies the number of processes, since in the tutorial I linked to, it uses 4 machines and the machine file implies that 8 processes will run. Then why we need that parameter next to -n though? Whatever the reason is, I still can't get why my run fails with -n 3.

Edit_2

Following Edit_1, it -n 3 is logical, since my machinefile implies 3 processes to be spawned.

Edit_3

I think that the problem lies when it tries to spawn a process in the slave node (i.e. geomcomp).

Edit_4

pythagoras runs on Debian 8, while geomcomp runs on Debian 6. The machines are of same architecture. The problem lies in geomcomp, since I tried mpiexec -n 1 ./mpi_hello there and said that no daemon runs.

So, I got, in pythagoras:

gsamaras@pythagoras:~$ mpichversion
MPICH Version:      3.1
MPICH Release date: Thu Feb 20 11:41:13 CST 2014
MPICH Device:       ch3:nemesis
MPICH configure:    --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --disable-maintainer-mode --disable-dependency-tracking --enable-shared --prefix=/usr --enable-fc --disable-rpath --disable-wrapper-rpath --sysconfdir=/etc/mpich --libdir=/usr/lib/x86_64-linux-gnu --includedir=/usr/include/mpich --docdir=/usr/share/doc/mpich --with-hwloc-prefix=system --enable-checkpointing --with-hydra-ckpointlib=blcr
MPICH CC:   gcc -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -g -O2 -fstack-protector-strong -Wformat -Werror=format-security  -O2
MPICH CXX:  g++ -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -g -O2 -fstack-protector-strong -Wformat -Werror=format-security
MPICH F77:  gfortran -g -O2 -fstack-protector-strong -g -O2 -fstack-protector-strong -O2
MPICH FC:   gfortran -g -O2 -fstack-protector-strong -g -O2 -fstack-protector-strong
gsamaras@pythagoras:~$ which mpiexec
/usr/bin/mpiexec
gsamaras@pythagoras:~$ which mpirun
/usr/bin/mpirun

where in geomcomp I got:

gsamaras@geomcomp:~$ mpichversion
-bash: mpichversion: command not found
gsamaras@geomcomp:~$ which mpiexec
/usr/bin/mpiexec
gsamaras@geomcomp:~$ which mpirun
/usr/bin/mpirun

I had installed MPICH2, like the tutorial instructed. What should I do? I am working on /mirror at the master node. It is mounted on the slave node.

1. This relevant question, mpiexec.hydra - how to run MPI process on machines where locations of hydra_pmi_proxy are different?, is different from mine, but it might be the case here too. 2. Damn it, the only Hydra I know is a Greek island, what am I missing? :/

回答1:

I'd say you've identified a genuine shortcomming of Hydra: there should be some way to tell it the paths on the other nodes are different.

Where is mpich installed on pythagoras? Where is mpich installed on geocomp?

In the simplest configuration, you would have, for example, a common home directory, and you would have installed mpich into ${HOME}/soft/mpich.

Hydra might not be starting a "login shell" on the remote machine. If you add the MPICH installation path to your PATH environment variable, you'll have to do so in a file like .bashrc (or whatever the equivalent for your shell is).

To test this, try 'ssh geocomp mpichversion' and 'ssh pythagoras mpichversion' and plain old 'mpichversion'. That should tell you something about how your environment is set up.

In your case, your environment is really strage! debian 8 and debian 6 and it looks like not even the same versions of MPICH.. I think, thanks to the ABI initiative, that MPICH-3.1 and newer will work with MPICH-3.1, but if you have a version of MPICH that pre-dates the "MPICH2 to MPICH" conversion, there are no such gaurantees.

And setting ABI aside, you've got an MPICH that expects the hydra launcher (the debian 8 version) and an MPICH that expects the MPD launcher. (the debian 6 version)

And even if you do have recent enough packages, the only way things can work is if you have the same architecture on all machines. ABI, as Ken points out, does not mean suppore for heterogeneous environments.

remove the distro packages and build MPICH yourself on both machines.