Interpreting segfault messages

2019-01-16 01:11发布

问题:

What is the correct interpretation of the following segfault messages?

segfault at 10 ip 00007f9bebcca90d sp 00007fffb62705f0 error 4 in libQtWebKit.so.4.5.2[7f9beb83a000+f6f000]
segfault at 10 ip 00007fa44d78890d sp 00007fff43f6b720 error 4 in libQtWebKit.so.4.5.2[7fa44d2f8000+f6f000]
segfault at 11 ip 00007f2b0022acee sp 00007fff368ea610 error 4 in libQtWebKit.so.4.5.2[7f2aff9f7000+f6f000]
segfault at 11 ip 00007f24b21adcee sp 00007fff7379ded0 error 4 in libQtWebKit.so.4.5.2[7f24b197a000+f6f000]

回答1:

This is a segfault due to following a null pointer trying to find code to run (that is, during an instruction fetch).

If this were a program, not a shared library

Run addr2line -e yourSegfaultingProgram 00007f9bebcca90d (and repeat for the other instruction pointer values given) to see where the error is happening. Better, get a debug-instrumented build, and reproduce the problem under a debugger such as gdb.

Since it's a shared library

You're hosed, unfortunately; it's not possible to know where the libraries were placed in memory by the dynamic linker after-the-fact. Reproduce the problem under gdb.

What the error means

Here's the breakdown of the fields:

  • address (after the at) - the location in memory the code is trying to access (it's likely that 10 and 11 are offsets from a pointer we expect to be set to a valid value but which is instead pointing to 0)
  • ip - instruction pointer, ie. where the code which is trying to do this lives
  • sp - stack pointer
  • error - An error code for page faults; see below for what this means on x86.

    /*
     * Page fault error code bits:
     *
     *   bit 0 ==    0: no page found       1: protection fault
     *   bit 1 ==    0: read access         1: write access
     *   bit 2 ==    0: kernel-mode access  1: user-mode access
     *   bit 3 ==                           1: use of reserved bit detected
     *   bit 4 ==                           1: fault was an instruction fetch
     */
    


回答2:

Error 4 means "The cause was a user-mode read resulting in no page being found.". There's a tool that decodes it here.

Here's the definition from the kernel. Keep in mind that 4 means that bit 2 is set and no other bits are set. If you convert it to binary that becomes clear.

/*
 * Page fault error code bits
 *      bit 0 == 0 means no page found, 1 means protection fault
 *      bit 1 == 0 means read, 1 means write
 *      bit 2 == 0 means kernel, 1 means user-mode
 *      bit 3 == 1 means use of reserved bit detected
 *      bit 4 == 1 means fault was an instruction fetch
 */
#define PF_PROT         (1<<0)
#define PF_WRITE        (1<<1)
#define PF_USER         (1<<2)
#define PF_RSVD         (1<<3)
#define PF_INSTR        (1<<4)

Now then, "ip 00007f9bebcca90d" means the instruction pointer was at 0x00007f9bebcca90d when the segfault happened.

"libQtWebKit.so.4.5.2[7f9beb83a000+f6f000]" tells you:

  • The object the crash was in: "libQtWebKit.so.4.5.2"
  • The base address of that object "7f9beb83a000"
  • How big that object is: "f6f000"

If you take the base address and subtract it from the ip, you get the offset into that object:

0x00007f9bebcca90d - 0x7f9beb83a000 = 0x49090D

Then you can run addr2line on it:

addr2line -e /usr/lib64/qt45/lib/libQtWebKit.so.4.5.2 -fCi 0x49090D
??
??:0

In my case it wasn't successful, either the copy I installed isn't identical to yours, or it's stripped.



回答3:

Let's go to the source -- 2.6.32, for example. The message is printed by show_signal_msg() function in arch/x86/mm/fault.c if the show_unhandled_signals sysctl is set.

"error" is not an errno nor a signal number, it's a "page fault error code" -- see definition of enum x86_pf_error_code.

"[7fa44d2f8000+f6f000]" is starting address and size of virtual memory area where offending object was mapped at the time of crash. Value of "ip" should fit in this region. With this info in hand, it should be easy to find offending code in gdb.