How do I debug an MPI program?

2020-01-24 03:39发布

I have an MPI program which compiles and runs, but I would like to step through it to make sure nothing bizarre is happening. Ideally, I would like a simple way to attach GDB to any particular process, but I'm not really sure whether that's possible or how to do it. An alternative would be having each process write debug output to a separate log file, but this doesn't really give the same freedom as a debugger.

Are there better approaches? How do you debug MPI programs?

标签: debugging mpi
16条回答
何必那么认真
3楼-- · 2020-01-24 03:43

Using screen together with gdb to debug MPI applications works nicely, especially if xterm is unavailable or you're dealing with more than a few processors. There were many pitfalls along the way with accompanying stackoverflow searches, so I'll reproduce my solution in full.

First, add code after MPI_Init to print out the PID and halt the program to wait for you to attach. The standard solution seems to be an infinite loop; I eventually settled on raise(SIGSTOP);, which requires an extra call of continue to escape within gdb.

}
    int i, id, nid;
    MPI_Comm_rank(MPI_COMM_WORLD,&id);
    MPI_Comm_size(MPI_COMM_WORLD,&nid);
    for (i=0; i<nid; i++) {
        MPI_Barrier(MPI_COMM_WORLD);
        if (i==id) {
            fprintf(stderr,"PID %d rank %d\n",getpid(),id);
        }
        MPI_Barrier(MPI_COMM_WORLD);
    }
    raise(SIGSTOP);
}

After compiling, run the executable in the background, and catch the stderr. You can then grep the stderr file for some keyword (here literal PID) to get the PID and rank of each process.

MDRUN_EXE=../../Your/Path/To/bin/executable
MDRUN_ARG="-a arg1 -f file1 -e etc"

mpiexec -n 1 $MDRUN_EXE $MDRUN_ARG >> output 2>> error &

sleep 2

PIDFILE=pid.dat
grep PID error > $PIDFILE
PIDs=(`awk '{print $2}' $PIDFILE`)
RANKs=(`awk '{print $4}' $PIDFILE`)

A gdb session can be attached to each process with gdb $MDRUN_EXE $PID. Doing so within a screen session allows easy access to any gdb session. -d -m starts the screen in detached mode, -S "P$RANK" allows you to name the screen for easy access later, and the -l option to bash starts it in interactive mode and keeps gdb from exiting immediately.

for i in `awk 'BEGIN {for (i=0;i<'${#PIDs[@]}';i++) {print i}}'`
do
    PID=${PIDs[$i]}
    RANK=${RANKs[$i]}
    screen -d -m -S "P$RANK" bash -l -c "gdb $MDRUN_EXE $PID"
done

Once gdb has started in the screens, you may script input to the screens (so that you don't have to enter every screen and type the same thing) using screen's -X stuff command. A newline is required at the end of the command. Here the screens are accessed by -S "P$i" using the names previously given. The -p 0 option is critical, otherwise the command intermittently fails (based on whether or not you have previously attached to the screen).

for i in `awk 'BEGIN {for (i=0;i<'${#PIDs[@]}';i++) {print i}}'`
do
    screen -S "P$i" -p 0 -X stuff "set logging file debug.$i.log
"
    screen -S "P$i" -p 0 -X stuff "set logging overwrite on
"
    screen -S "P$i" -p 0 -X stuff "set logging on
"
    screen -S "P$i" -p 0 -X stuff "source debug.init
"
done

At this point you can attach to any screen using screen -rS "P$i" and detach using Ctrl+A+D. Commands may be sent to all gdb sessions in analogy with the previous section of code.

查看更多
等我变得足够好
4楼-- · 2020-01-24 03:43

The "standard" way to debug MPI programs is by using a debugger which supports that execution model.

On UNIX, TotalView is said to have good suppoort for MPI.

查看更多
狗以群分
5楼-- · 2020-01-24 03:47

Quite a simple way to debug an MPI program.

In main () function add sleep (some_seconds)

Run the program as usual

$ mpirun -np <num_of_proc> <prog> <prog_args>

Program will start and get into the sleep.

So you will have some seconds to find you processes by ps, run gdb and attach to them.

If you use some editor like QtCreator you can use

Debug->Start debugging->Attach to running application

and find you processes there.

查看更多
Juvenile、少年°
6楼-- · 2020-01-24 03:51

As others have mentioned, if you're only working with a handful of MPI processes you can try to use multiple gdb sessions, the redoubtable valgrind or roll your own printf / logging solution.

If you're using more processes than that, you really start needing a proper debugger. The OpenMPI FAQ recommends both Allinea DDT and TotalView.

I work on Allinea DDT. It's a full-featured, graphical source-code debugger so yes, you can:

  • Debug or attach to (over 200k) MPI processes
  • Step and pause them in groups or individually
  • Add breakpoints, watches and tracepoints
  • Catch memory errors and leaks

...and so on. If you've used Eclipse or Visual Studio then you'll be right at home.

We added some interesting features specifically for debugging parallel code (be it MPI, multi-threaded or CUDA):

  • Scalar variables are automatically compared across all processes: Sparklines showing values across processes
    (source: allinea.com)

  • You can also trace and filter the values of variables and expressions over processes and time: Tracepoints log values over time

It's widely used amongst top500 HPC sites, such as ORNL, NCSA, LLNL, Jülich et. al.

The interface is pretty snappy; we timed stepping and merging the stacks and variables of 220,000 processes at 0.1s as part of the acceptance testing on Oak Ridge's Jaguar cluster.

@tgamblin mentioned the excellent STAT, which integrates with Allinea DDT, as do several other popular open source projects.

查看更多
Anthone
7楼-- · 2020-01-24 03:54

I have found gdb quite useful. I use it as

mpirun -np <NP> xterm -e gdb ./program 

This the launches xterm windows in which I can do

run <arg1> <arg2> ... <argN>

usually works fine

You can also package these commands together using:

mpirun -n <NP> xterm -hold -e gdb -ex run --args ./program [arg1] [arg2] [...]
查看更多
登录 后发表回答