Executing hybrid OpenMP/MPI jobs in MPICH

2020-03-31 06:42发布

问题:

I am struggling to find the proper way to execute a hybrid OpenMP/MPI job with MPICH (hydra).

I am easily able to launch the processes and they do make threads, but they are stuck bound to the same core as their master thread whatever type of -bind-to I tried.

If I explicitly set GOMP_CPU_AFFINITY to 0-15 I get all threads spread but only provided if I have 1 process per node. I don't want that, I want one process per socket.

Setting OMP_PROC_BIND=false does not have a noticeable effect.

An example of many different combinations I tried

export OMP_NUM_THREADS=8
export OMP_PROC_BIND="false"
mpiexec.hydra -n 2 -ppn 2 -envall -bind-to numa  ./a.out

What I get is all process sitting on one of the cores 0-7 with 100% and several threads on cores 8-15 but only one of them close to 100% (they are waiting on the first process).

回答1:

Since libgomp is missing the equivalent of the respect clause of Intel's KMP_AFFINITY, you could hack it around by providing a wrapper script that reads the list of allowed CPUs from /proc/PID/status (Linux-specific):

#!/bin/sh

GOMP_CPU_AFFINITY=$(grep ^Cpus_allowed_list /proc/self/status | grep -Eo '[0-9,-]+')
export GOMP_CPU_AFFINITY
exec $*

This should work with -bind-to numa then.



回答2:

I do have a somewhat different solution for binding OpenMP threads to sockets / NUMA nodes when running a mixed MPI / OpenMP code, whenever the MPI library and the OpenMP runtime do not collaborate well by default. The idea is to use numactl and its binding properties. This has even the extra advantage of not only binding the threads to the socket, but also the memory, forcing good memory locality and maximising the bandwidth.

To that end, I first disable any MPI and/or OpenMP binding (with the corresponding mpiexec option for teh former, and with setting OMP_PROC_BIND to false for the later). Then I use the following omp_bind.sh shell script:

#!/bin/bash

numactl --cpunodebind=$(( $PMI_ID % 2 )) --membind=$(( $PMI_ID % 2 )) "$@"

And I run my code this way:

OMP_PROC_BIND="false" OMP_NUM_THREADS=8 mpiexec -ppn 2 -bind-to-none omp_bind.sh a.out args

Depending on the number of sockets on the machine, the 2 would need to be adjusted on the shell. Likewise, the PMI_ID depends on the version of mpiexec used. I saw sometimes MPI_RANK, PMI_RANK, etc.

But anyway, I always found a way of getting it to work and the memory binding comes very handy sometimes, especially to avoid the potential pitfall of the IO buffers eating up all memory on the first NUMA node, leading to the code's memory for the process running on the first socket, allocating memory on the second NUMA node.