I am struggling to set up an MPI cluster, following the Setting Up an MPICH2 Cluster in Ubuntu tutorial. I have something running and my machine file is this:
pythagoras:2 # this will spawn 2 processes on pythagoras
geomcomp # this will spawn 1 process on geomcomp
The tutorial states:
and run it (the parameter next to -n specifies the number of processes to spawn and distribute among nodes): mpiu@ub0:~$ mpiexec -n 8 -f machinefile ./mpi_hello
With -n 1 and -n 2 it runs fine, but with -n 3, it fails, as you can see below:
gsamaras@pythagoras:/mirror$ mpiexec -n 1 -f machinefile ./mpi_hello
Hello from processor 0 of 1
gsamaras@pythagoras:/mirror$ mpiexec -n 2 -f machinefile ./mpi_hello
Hello from processor 0 of 2
Hello from processor 1 of 2
gsamaras@pythagoras:/mirror$ mpiexec -n 3 -f machinefile ./mpi_hello
bash: /usr/bin/hydra_pmi_proxy: No such file or directory
{hungs up}
Maybe that parameter next to -n specifies the number of machines? I mean the number of processes is stated in the machinefile, isn't it? Also, I have used 2 machines for the MPI cluster (hope this is the case and the output I am getting is not only from the master node (i.e. pythagoras), but also from the slave one (i.e. geomcomp)).
Edit_1
Well I think that the parameter next to -n actually specifies the number of processes, since in the tutorial I linked to, it uses 4 machines and the machine file implies that 8 processes will run. Then why we need that parameter next to -n though? Whatever the reason is, I still can't get why my run fails with -n 3.
Edit_2
Following Edit_1, it -n 3 is logical, since my machinefile implies 3 processes to be spawned.
Edit_3
I think that the problem lies when it tries to spawn a process in the slave node (i.e. geomcomp).
Edit_4
pythagoras runs on Debian 8, while geomcomp runs on Debian 6. The machines are of same architecture. The problem lies in geomcomp, since I tried mpiexec -n 1 ./mpi_hello
there and said that no daemon runs.
So, I got, in pythagoras:
gsamaras@pythagoras:~$ mpichversion
MPICH Version: 3.1
MPICH Release date: Thu Feb 20 11:41:13 CST 2014
MPICH Device: ch3:nemesis
MPICH configure: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --disable-maintainer-mode --disable-dependency-tracking --enable-shared --prefix=/usr --enable-fc --disable-rpath --disable-wrapper-rpath --sysconfdir=/etc/mpich --libdir=/usr/lib/x86_64-linux-gnu --includedir=/usr/include/mpich --docdir=/usr/share/doc/mpich --with-hwloc-prefix=system --enable-checkpointing --with-hydra-ckpointlib=blcr
MPICH CC: gcc -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -O2
MPICH CXX: g++ -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -g -O2 -fstack-protector-strong -Wformat -Werror=format-security
MPICH F77: gfortran -g -O2 -fstack-protector-strong -g -O2 -fstack-protector-strong -O2
MPICH FC: gfortran -g -O2 -fstack-protector-strong -g -O2 -fstack-protector-strong
gsamaras@pythagoras:~$ which mpiexec
/usr/bin/mpiexec
gsamaras@pythagoras:~$ which mpirun
/usr/bin/mpirun
where in geomcomp I got:
gsamaras@geomcomp:~$ mpichversion
-bash: mpichversion: command not found
gsamaras@geomcomp:~$ which mpiexec
/usr/bin/mpiexec
gsamaras@geomcomp:~$ which mpirun
/usr/bin/mpirun
I had installed MPICH2, like the tutorial instructed. What should I do? I am working on /mirror
at the master node. It is mounted on the slave node.
1. This relevant question, mpiexec.hydra - how to run MPI process on machines where locations of hydra_pmi_proxy are different?, is different from mine, but it might be the case here too. 2. Damn it, the only Hydra I know is a Greek island, what am I missing? :/