Using GDB to debug an MPI program in Fortran

2019-07-26 16:58发布

问题:

I read this and arrived here, so now I think I should (if not so, please, tell me) rewrite the code

{
    int i = 0;
    char hostname[256];
    gethostname(hostname, sizeof(hostname));
    printf("PID %d on %s ready for attach\n", getpid(), hostname);
    fflush(stdout);
    while (0 == i)
        sleep(5);
}

in Fortran. From this answer I understood that in Fortran I could simply use MPI_Get_processor_name in place of gethostname. Everything else is simple but flush. What about it?

Where should I put it? In the main program after MPI_Init? And then? What should I do?

For what concerns the compile options, I referred to this and used -v -da -Q as options to the mpifort wrapper.

This solution doesn't fit my case, since I need to run the program on 27 processes as minimum, so I'd like to check one process only.

回答1:

Simplest approach:

What I actually often do is I just run the MPI job locally and see what it does. Without any of the above code. Then if it hangs I use top to find out the PIDof the processes and usually one can guess easily which rank is which from the PIDs (they tend to be consecutive and the lowest one is rank 0). Below rank 0 is process 1641 and than they are rank 1 pid 1642 and so on...

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                         
 1642 me        20   0  167328   7716   5816 R 100.0 0.047   0:25.02 a.out                                                                                                                                           
 1644 me        20   0  167328   7656   5756 R 100.0 0.047   0:25.04 a.out                                                                                                                                           
 1645 me        20   0  167328   7700   5792 R 100.0 0.047   0:24.97 a.out                                                                                                                                           
 1646 me        20   0  167328   7736   5836 R 100.0 0.047   0:25.00 a.out                                                                                                                                           
 1641 me        20   0  167328   7572   5668 R 99.67 0.046   0:24.95 a.out 

Then I just do gdb -pid and I examine the stack and local variables in the processes. (use help stack in the GDB console)

The most important is to get a backtrace, so just print bt in the console.

This will work well when examining deadlocks. Less well when you have to stop at some specific place. Then you have to attach the debugger early.


Your code:

I don't think the flush is necessary in Fortran. I think Fortran write and print flush as necessary at least in compilers I use.

But you definitely can use the flush statement

use iso_fortran_env

flush(output_unit)

just put that flush after your write where you print hostname and pid. But as I said I would just start with printing alone.

What you than do is that you login to that node and attach gdb to the righ process with something like

gdb -pid 12345

For sleep you can use the non-standard sleep intrinsic subroutine available in many compilers or write your own.

Whether before or after MPI_Init? If you want to print the rank, it must be after. Also for using MPI_Get_processor_name it must be after. It is normally recommended to call MPI_Init as early as possible in your program.

The code is then something like

  use mpi

  implicit none

  character(MPI_MAX_PROCESSOR_NAME) :: hostname

  integer :: rank, ie, pid, hostname_len

  integer, volatile :: i

  call MPI_Init(ie)

  call MPI_Get_processor_name(hostname, hostname_len, ie)

  !non-standard extension
  pid = getpid()

  call MPI_Comm_rank(MPI_COMM_WORLD, rank, ie)

  write(*,*) "PID ", pid,  " on ",  trim(hostname), " ready for attach is world rank ", rank

  !this serves to block the execution at a specific place until you unblock it in GDB by setting i=0
  i = 1
  do
    !non-standard extension
    call sleep(1)
    if (i==0) exit
  end do

end

Important note: if you compile with optimizations than the compiler can see that i==0 is never true and will remove the check completely. You must lower your optimizations or declare i as volatile. Volatile means that the value can change at any time and the compiler must reload its value from memory for the check. That requires Fortran 2003.

Attaching the right process:

The above code will print, for example,

> mpif90 -ggdb mpi_gdb.f90 
> mpirun -n 4 ./a.out

 PID         2356  on linux.site ready for attach is world rank            1
 PID         2357  on linux.site ready for attach is world rank            2
 PID         2358  on linux.site ready for attach is world rank            3
 PID         2355  on linux.site ready for attach is world rank            0

In top they look like

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                         
 2355 me        20   0  167328   7452   5564 R 100.0 0.045   1:42.55 a.out                                                                                                                                           
 2356 me        20   0  167328   7428   5548 R 100.0 0.045   1:42.54 a.out                                                                                                                                           
 2357 me        20   0  167328   7384   5500 R 100.0 0.045   1:42.54 a.out                                                                                                                                           
 2358 me        20   0  167328   7388   5512 R 100.0 0.045   1:42.51 a.out

and you just select which rank you want and execute

gdb -pid 2355

to attach rank 0 and so on. In a different terminal window, of course.

Then you get something like

MAIN__ () at mpi_gdb.f90:26
26          if (i==0) exit

(gdb) info locals
hostname = 'linux.site', ' ' <repeats 246 times>
hostname_len = 10
i = 1
ie = 0
pid = 2457
rank = 0

(gdb) set var i = 0

(gdb) cont
Continuing.
[Inferior 1 (process 2355) exited normally]


回答2:

The posted code is basically just an infinite loop designed to "pause" execution whilst you attach the debugger. You can then use the debugger controls to exit this loop and the program will continue. You can write an equivalent loop in fortran, so provided you're happy to get the hostname and pid from another method (see mpi_get_processor_name as mentioned by VladimirF in his answer and if you are happy to use compiler extensions both gnu and intel compilers provide a getpid extension), you could use something like the following (thanks to this answer for the sleep example).

module fortran_sleep
  !See https://stackoverflow.com/a/6932232                                                                                                                                                                           
  use, intrinsic :: iso_c_binding, only: c_int
  implicit none
  interface
     !  should be unsigned int ... not available in Fortran                                                                                                                                                         
     !  OK until highest bit gets set.                                                                                                                                                                              
     function FortSleep (seconds)  bind ( C, name="sleep" )
       import
       integer (c_int) :: FortSleep
       integer (c_int), intent (in), VALUE :: seconds
     end function FortSleep
  end interface
end module fortran_sleep

program mpitest
  use mpi
  use fortran_sleep
  use, intrinsic :: iso_c_binding, only: c_int
  implicit none
  integer :: rank,num_process,ierr, tmp
  integer :: i
  integer (c_int) :: wait_sec, how_long
  wait_sec = 5

  call mpi_init (ierr)
  call mpi_comm_rank (MPI_COMM_WORLD, rank, ierr)
  call mpi_comm_size (MPI_COMM_WORLD, num_process, ierr)
  call mpi_barrier (MPI_COMM_WORLD, ierr)
  print *, 'rank = ', rank
  call mpi_barrier (MPI_COMM_WORLD, ierr)

  i=0
  do while (i.eq.0)
     how_long = FortSleep(wait_sec)
  end do

  print*,"Rank ",rank," has escaped!"
  call mpi_barrier(MPI_COMM_WORLD, ierr)
  call mpi_finalize (ierr)
end program mpitest

Now compile with something like

> mpif90 prog.f90 -O0 -g -o prog.exe

If I now launch this on two cores of my local machine using

> mpirun -np 2 ./prog.exe

On screen I see just

 rank =            0
 rank =            1

Now in another terminal I connect to the relevant machine and find the relevant process id using

ps -ef | grep prog.exe

This gives me several process id values corresponding to the different ranks. I can then attach to one of these using

gdb --pid <pidFromPSCmd> ./prog.exe

Now we're in gdb we can see where we are in the program using bt (backtrace), it's likely we're in sleep. We then step through the program using s(tep) until we reach our main program. Now we set i to something non-zero and then c(ontinue) execution, which allows this ranks process to continue and we see the rank has escaped message etc. The gdb section will look something like:

(gdb) bt
#0  0x00007f01354a1d70 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x00007f01354a1c24 in sleep () from /lib64/libc.so.6
#2  0x0000000000400ef9 in mpitest () at prog.f90:35
#3  0x0000000000400fe5 in main (argc=1, argv=0x7ffecdc8d0ae) at prog.f90:17
#4  0x00007f013540cb05 in __libc_start_main () from /lib64/libc.so.6
#5  0x0000000000400d39 in _start () at ../sysdeps/x86_64/start.S:122
(gdb) s
Single stepping until exit from function __nanosleep_nocancel,
which has no line number information.
0x00007f01354a1c24 in sleep () from /lib64/libc.so.6
(gdb) s
Single stepping until exit from function sleep,
which has no line number information.
mpitest () at prog.f90:34
34    do while (i.eq.0)
(gdb) bt
#0  mpitest () at prog.f90:34
#1  0x0000000000400fe5 in main (argc=1, argv=0x7ffecdc8d0ae) at prog.f90:17
#2  0x00007f013540cb05 in __libc_start_main () from /lib64/libc.so.6
#3  0x0000000000400d39 in _start () at ../sysdeps/x86_64/start.S:122
(gdb) set var i = 1
(gdb) c
Continuing.

and in our original terminal we will see something like

 Rank            0  has escaped!