MPI & pthreads: nodes with different numbers of co

Introduction

I want to write a hybrid MPI/pthreads code. My goal is to have one MPI process started on each node and have each of those processes split into multiple threads that will actually do the job, but with communication only happening between the separate MPI processes.

There are quite a few tutorials describing this situation, called hybrid programming, but they typically assume a homogeneous cluster. However, the one I am using has heterogeneous nodes: they have different processors and different numbers of cores, i.e. the nodes are a combination of 4/8/12/16 core machines.

I am aware that running an MPI process across this cluster will make my code slow down to the speed of the slowest CPU used; I accept that fact. With that I would like to get to my question.

Is there a way to start N MPI processes -- with one MPI process per node -- and let each know how many physical cores are available to it at that node?

The MPI implementation I have access to is OpenMPI. The nodes are a mix of Intel and AMD CPUs. I thought of using a machinefile with each node specified as having one slot, then figuring out the number of cores locally. However, there seem to be problems with doing that. I am surely not the first person with this problem, but somehow searching the web didn't point me in the right direction yet. Is there a standard way of solving this problem other than finding oneself a homogeneous cluster?

Launching one process only per node is very simple with Open MPI:

mpiexec -pernode ./mympiprogram

The -pernode argument is equivalent to -npernode 1 and it instructs the ORTE launcher to start one process per node present in the host list. This method has the advantage that it works regardless of how the actual host list is provided, i.e. works both when it comes from tight coupling with some resource manager (e.g. Torque/PBS, SGE, LSF, SLURM, etc.) and with manually provided hosts. It also works even if the host list contains nodes with multiple slots.

Knowing the number of cores is a bit tricky and very OS-specific. But Open MPI ships with the hwloc library which provides an abstract API to query the system components, including the number of cores:

hwloc_topology_t topology;

/* Allocate and initialize topology object. */
hwloc_topology_init(&topology);

/* Perform the topology detection. */
hwloc_topology_load(topology);

/* Get the number of cores */
unsigned nbcores = hwloc_get_nbobjs_by_type(topology, HWLOC_OBJ_CORE);

/* Destroy topology object. */
hwloc_topology_destroy(topology);

If you want to make the number of cores across the cluster available to each MPI process in your job, a simple MPI_Allgather is what you need:

/* Obtain the number or MPI processes in the job */
int nranks;
MPI_Comm_size(MPI_COMM_WORLD, &nranks);

unsigned cores[nranks];
MPI_Allgather(&nbcores, 1, MPI_UNSIGNED,
              cores, 1, MPI_UNSIGNED, MPI_COMM_WORLD);