a custom interrupt handler for mpirun

2019-02-20 12:40发布

问题:

Apparently, mpirun uses a SIGINT handler which "forwards" the SIGINT signal to each of the processes it spawned.

This means you can write an interrupt handler for your mpi-enabled code, execute mpirun -np 3 my-mpi-enabled-executable and then SIGINT will be raised for each of the three processes. Shortly after that, mpirun exits. This works fine when you have a small custom handler which only prints an error message and then exits. However, when your custom interrupt handler is doing a non-trivial job (e.g. doing serious computations or persisting data), the handler does not run to completion. I'm assuming this is because mpirun decided to exit too soon.

Here's the stderr upon pressing ctrl-c (i.e. causing SIGINT) after executing my-mpi-enabled-executable. This is the desirable expected behavior:

interrupted by signal 2.
running viterbi... done.
persisting parameters... done.
the master process will now exit.

Here's the stderr upon pressing ctrl-c after executing mpirun -np 1 my-mpi-enabled-executable. This is the problematic behavior:

interrupted by signal 2.
running viterbi... mpirun: killing job...

--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 8970 on node pharaoh exited on signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
mpirun: clean termination accomplished

Answering any of the following questions will solve my problem:

  • How to override the mpirun SIGINT handler (if at all possible)?
  • How to avoid the termination of the processes mpirun spawned right after mpirun terminates?
  • Is there another signal which mpirun may be sending to the children processes before mpirun terminates?
  • Is there a way to "capture" the so-called "signal 0 (Unknown signal 0)" (see the second stderr above)?

I'm running openmpi-1.6.3 on linux.

回答1:

As per the OpenMPI manpage you can send a SIGUSR1 or SIGUSR2 to mpirun which will forward it and not shut down itsself.



回答2:

When having the same issue, I came across this question and the answer by @Zulan.

In particular I wanted to catch a SIGINT (Ctrl+C) from the user, do some stuff and then exit in an orderly fashion. Thus, using SIGUSR1 was not an option. Reading the man page that @Zulan linked however, shows that mpirun (at least the OpenMPI version) catches a SIGINT and then sends a SIGTERM signal to the child processes. Thus, catching SIGTERM in my code allowed me to call the proper exit routines.

Note that signal handling is not save with MPI as noted here.