Segfault on declaring a variable of type vector

2019-03-17 04:42发布

问题:

Code

Here is the program that gives the segfault.

#include <iostream>
#include <vector>
#include <memory>

int main() 
{
    std::cout << "Hello World" << std::endl;

    std::vector<std::shared_ptr<int>> y {};  

    std::cout << "Hello World" << std::endl;
}

Of course, there is absolutely nothing wrong in the program itself. The root cause of the segfault depends on the environment in which its built and ran.


Background

We, at Amazon, use a build system which builds and deploys the binaries (lib and bin) in an almost machine independent way. For our case, that basically means it deploys the executable (built from the above program) into $project_dir/build/bin/ and almost all its dependencies (i.e the shared libraries) into $project_dir/build/lib/. Why I used the phrase "almost" is because for shared libraries such libc.so, libm.so, ld-linux-x86-64.so.2 and possibly few others, the executable picks from the system (i.e from /lib64 ). Note that it is supposed to pick libstdc++ from $project_dir/build/lib though.

Now I run it as follows:

$ LD_LIBRARY_PATH=$project_dir/build/lib ./build/bin/run

segmentation fault

However if I run it, without setting the LD_LIBRARY_PATH. It runs fine.


Diagnostics

1. ldd

Here are ldd informations for both cases (please note that I've edited the output to mention the full version of the libraries wherever there is difference)

$ LD_LIBRARY_PATH=$project_dir/build/lib ldd ./build/bin/run

linux-vdso.so.1 =>  (0x00007ffce19ca000)
libstdc++.so.6 => $project_dir/build/lib/libstdc++.so.6.0.20 
libgcc_s.so.1 =>  $project_dir/build/lib/libgcc_s.so.1 
libc.so.6 => /lib64/libc.so.6 
libm.so.6 => /lib64/libm.so.6 
/lib64/ld-linux-x86-64.so.2 (0x0000562ec51bc000)

and without LD_LIBRARY_PATH:

$ ldd ./build/bin/run

linux-vdso.so.1 =>  (0x00007fffcedde000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6.0.16 
libgcc_s.so.1 => /lib64/libgcc_s-4.4.6-20110824.so.1
libc.so.6 => /lib64/libc.so.6 
libm.so.6 => /lib64/libm.so.6 
/lib64/ld-linux-x86-64.so.2 (0x0000560caff38000)

2. gdb when it segfaults

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7dea45c in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.209.62.al12.x86_64
(gdb) bt
#0  0x00007ffff7dea45c in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#1  0x00007ffff7df0c55 in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2
#2  0x00007ffff7b1dc41 in std::locale::_S_initialize() () from $project_dir/build/lib/libstdc++.so.6
#3  0x00007ffff7b1dc85 in std::locale::locale() () from $project_dir/build/lib/libstdc++.so.6
#4  0x00007ffff7b1a574 in std::ios_base::Init::Init() () from $project_dir/build/lib/libstdc++.so.6
#5  0x0000000000400fde in _GLOBAL__sub_I_main () at $project_dir/build/gcc-4.9.4/include/c++/4.9.4/iostream:74
#6  0x00000000004012ed in __libc_csu_init ()
#7  0x00007ffff7518cb0 in __libc_start_main () from /lib64/libc.so.6
#8  0x0000000000401021 in _start ()
(gdb)

3. LD_DEBUG=all

I also tried to see the linker information by enabling LD_DEBUG=all for the segfault case. I found something suspicious, as it searches for pthread_once symbol, and when it unable to find this, it gives segfault (that is my interpretation of the following output snippet BTW):

initialize program: $project_dir/build/bin/run

symbol=_ZNSt8ios_base4InitC1Ev;  lookup in file=$project_dir/build/bin/run [0]
symbol=_ZNSt8ios_base4InitC1Ev;  lookup in file=$project_dir/build/lib/libstdc++.so.6 [0]
binding file $project_dir/build/bin/run [0] to $project_dir/build/lib/libstdc++.so.6 [0]: normal symbol `_ZNSt8ios_base4InitC1Ev' [GLIBCXX_3.4]
symbol=_ZNSt6localeC1Ev;  lookup in file=$project_dir/build/bin/run [0]
symbol=_ZNSt6localeC1Ev;  lookup in file=$project_dir/build/lib/libstdc++.so.6 [0]
binding file $project_dir/build/lib/libstdc++.so.6 [0] to $project_dir/build/lib/libstdc++.so.6 [0]: normal symbol `_ZNSt6localeC1Ev' [GLIBCXX_3.4]
symbol=pthread_once;  lookup in file=$project_dir/build/bin/run [0]
symbol=pthread_once;  lookup in file=$project_dir/build/lib/libstdc++.so.6 [0]
symbol=pthread_once;  lookup in file=$project_dir/build/lib/libgcc_s.so.1 [0]
symbol=pthread_once;  lookup in file=/lib64/libc.so.6 [0]
symbol=pthread_once;  lookup in file=/lib64/libm.so.6 [0]
symbol=pthread_once;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]

But I dont see any pthread_once for the case when it runs successfully!


Questions

I know that its very difficult to debug like this and probably I've not given a lot of informations about the environments and all. But still, my question is: what could be the possible root-cause for this segfault? How to debug further and find that? Once I find the issue, fix would be easy.


Compiler and Platform

I'm using GCC 4.9 on RHEL5.


Experiments

E#1

If I comment the following line:

std::vector<std::shared_ptr<int>> y {}; 

It compiles and runs fine!

E#2

I just included the following header to my program:

#include <boost/filesystem.hpp>

and linked accordingly. Now it works without any segfault. So it seems by having a dependency on libboost_system.so.1.53.0., some requirements are met, or the problem is circumvented!

E#3

Since I saw it working when I made the executable to be linked against libboost_system.so.1.53.0, so I did the following things step by step.

Instead of using #include <boost/filesystem.hpp> in the code itself, I use the original code and ran it by preloading libboost_system.so using LD_PRELOAD as follows:

$ LD_PRELOAD=$project_dir/build/lib/libboost_system.so $project_dir/build/bin/run

and it ran successfully!

Next I did ldd on the libboost_system.so which gave a list of libs, two of which were:

  /lib64/librt.so.1
  /lib64/libpthread.so.0

So instead of preloading libboost_system, I preload librt and libpthread separately:

$ LD_PRELOAD=/lib64/librt.so.1 $project_dir/build/bin/run

$ LD_PRELOAD=/lib64/libpthread.so.0 $project_dir/build/bin/run

In both cases, it ran successfully.

Now my conclusion is that by loading either librt or libpthread (or both ), some requirements are met or the problem is circumvented! I still dont know the root cause of the issue, though.


Compilation and Linking Options

Since the build system is complex and there are lots of options which are there by default. So I tried to explicitly add -lpthread using CMake's set command, then it worked, as we have already seen that by preloading libpthread it works!

In order to see the build difference between these two cases (when-it-works and when-it-gives-segfault), I built it in verbose mode by passing -v to GCC, to see the compilation stages and the options it actually passes to cc1plus (compiler) and collect2 (linker).

(Note that paths has been edited for brevity, using dollar-sign and dummy paths.)

$/gcc-4.9.4/cc1plus -quiet -v -I /a/include -I /b/include -iprefix $/gcc-4.9.4/ -MMD main.cpp.d -MF main.cpp.o.d -MT main.cpp.o -D_GNU_SOURCE -D_REENTRANT -D __USE_XOPEN2K8 -D _LARGEFILE_SOURCE -D _FILE_OFFSET_BITS=64 -D __STDC_FORMAT_MACROS -D __STDC_LIMIT_MACROS -D NDEBUG $/lab/main.cpp -quiet -dumpbase main.cpp -msse -mfpmath=sse -march=core2 -auxbase-strip main.cpp.o -g -O3 -Wall -Wextra -std=gnu++1y -version -fdiagnostics-color=auto -ftemplate-depth=128 -fno-operator-names -o /tmp/ccxfkRyd.s

Irrespective of whether it works or not, the command-line arguments to cc1plus are exactly the same. No difference at all. That does not seem to be very helpful.

The difference, however, is at the linking time. Here is what I see, for the case when it works:

$/gcc-4.9.4/collect2 -plugin $/gcc-4.9.4/liblto_plugin.so
-plugin-opt=$/gcc-4.9.4/lto-wrapper -plugin-opt=-fresolution=/tmp/cchl8RtI.res -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lpthread -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc --eh-frame-hdr -m elf_x86_64 -export-dynamic -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o run /usr/lib/../lib64/crt1.o /usr/lib/../lib64/crti.o $/gcc-4.9.4/crtbegin.o -L/a/lib -L/b/lib -L/c/lib -lpthread --as-needed main.cpp.o -lboost_timer -lboost_wave -lboost_chrono -lboost_filesystem -lboost_graph -lboost_locale -lboost_thread -lboost_wserialization -lboost_atomic -lboost_context -lboost_date_time -lboost_iostreams -lboost_math_c99 -lboost_math_c99f -lboost_math_c99l -lboost_math_tr1 -lboost_math_tr1f -lboost_math_tr1l -lboost_mpi -lboost_prg_exec_monitor -lboost_program_options -lboost_random -lboost_regex -lboost_serialization -lboost_signals -lboost_system -lboost_unit_test_framework -lboost_exception -lboost_test_exec_monitor -lbz2 -licui18n -licuuc -licudata -lz -rpath /a/lib:/b/lib:/c/lib: -lstdc++ -lm -lgcc_s -lgcc -lpthread -lc -lgcc_s -lgcc $/gcc-4.9.4/crtend.o /usr/lib/../lib64/crtn.o

As you can see, -lpthread is mentioned twice! The first -lpthread (which is followed by --as-needed) is missing for the case when it gives segfault. That is the only difference between these two cases.


Output of nm -C in both cases

Interestingly, the output of nm -C in both cases is identical (if you ignore the integer values in the first columns).

0000000000402580 d _DYNAMIC
0000000000402798 d _GLOBAL_OFFSET_TABLE_
0000000000401000 t _GLOBAL__sub_I_main
0000000000401358 R _IO_stdin_used
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
                 w _Jv_RegisterClasses
                 U _Unwind_Resume
0000000000401150 W std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_destroy()
0000000000401170 W std::vector<std::shared_ptr<int>, std::allocator<std::shared_ptr<int> > >::~vector()
0000000000401170 W std::vector<std::shared_ptr<int>, std::allocator<std::shared_ptr<int> > >::~vector()
0000000000401250 W std::vector<std::unique_ptr<int, std::default_delete<int> >, std::allocator<std::unique_ptr<int, std::default_delete<int> > > >::~vector()
0000000000401250 W std::vector<std::unique_ptr<int, std::default_delete<int> >, std::allocator<std::unique_ptr<int, std::default_delete<int> > > >::~vector()
                 U std::ios_base::Init::Init()
                 U std::ios_base::Init::~Init()
0000000000402880 B std::cout
                 U std::basic_ostream<char, std::char_traits<char> >& std::endl<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&)
0000000000402841 b std::__ioinit
                 U std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)
                 U operator delete(void*)
                 U operator new(unsigned long)
0000000000401510 r __FRAME_END__
0000000000402818 d __JCR_END__
0000000000402818 d __JCR_LIST__
0000000000402820 d __TMC_END__
0000000000402820 d __TMC_LIST__
0000000000402838 A __bss_start
                 U __cxa_atexit
0000000000402808 D __data_start
0000000000401100 t __do_global_dtors_aux
0000000000402820 t __do_global_dtors_aux_fini_array_entry
0000000000402810 d __dso_handle
0000000000402828 t __frame_dummy_init_array_entry
                 w __gmon_start__
                 U __gxx_personality_v0
0000000000402838 t __init_array_end
0000000000402828 t __init_array_start
00000000004012b0 T __libc_csu_fini
00000000004012c0 T __libc_csu_init
                 U __libc_start_main
                 w __pthread_key_create
0000000000402838 A _edata
0000000000402990 A _end
000000000040134c T _fini
0000000000400e68 T _init
0000000000401028 T _start
0000000000401054 t call_gmon_start
0000000000402840 b completed.6661
0000000000402808 W data_start
0000000000401080 t deregister_tm_clones
0000000000401120 t frame_dummy
0000000000400f40 T main
00000000004010c0 t register_tm_clones

回答1:

Given the point of crash, and the fact that preloading libpthread seems to fix it, I believe that the execution of the two cases diverges at locale_init.cc:315. Here is an extract of the code:

  void
  locale::_S_initialize()
  {
#ifdef __GTHREADS
    if (__gthread_active_p())
      __gthread_once(&_S_once, _S_initialize_once);
#endif
    if (!_S_classic)
      _S_initialize_once();
  }

__gthread_active_p() returns true if your program is linked against pthread, specifically it checks if pthread_key_create is available. On my system, this symbol is defined in "/usr/include/c++/7.2.0/x86_64-pc-linux-gnu/bits/gthr-default.h" as static inline, hence it is a potential source of ODR violation.

Notice that LD_PRELOAD=libpthread,so will always cause __gthread_active_p() to return true.

__gthread_once is another inlined symbol that should always forward to pthread_once.

It's hard to guess what's going on without debugging, but I suspect that you are hitting the true branch of __gthread_active_p() even when it shouldn't, and the program then crashes because there is no pthread_once to call.

EDIT: So I did some experiments, the only way I see to get a crash in std::locale::_S_initialize is if __gthread_active_p returns true, but pthread_once is not linked in.

libstdc++ does not link directly against pthread, but it imports half of pthread_xx as weak objects, which means they can be undefined and not cause a linker error.

Obviously linking pthread will make the crash disappear, but if I am right, the main issue is that your libstdc++ thinks that it is inside a multi-threaded executable even if we did not link pthread in.

Now, __gthread_active_p uses __pthread_key_create to decide if we have threads or no. This is defined in your executable as a weak object (can be nullptr and still be fine). I am 99% sure that the symbol is there because of shared_ptr (remove it and check nm again to be sure). So, somehow __pthread_key_create gets bound to a valid address, maybe because of that last -lpthread in your linker flags. You can verify this theory by putting a breakpoint at locale_init.cc:315 and checking which branch you take.

EDIT2:

Summary of the comments, the issue is only reproducible if we have all of the following:

  1. Use ld.gold instead of ld.bfd
  2. Use --as-needed
  3. Forcing a weak definition of __pthread_key_create, in this case via instantiation of std::shared_ptr.
  4. Not linking to pthread, or linking pthread after --as-needed.

To answer the questions in the comments:

Why does it use gold by default?

By default it uses /usr/bin/ld, which on most distro is a symlink to either /usr/bin/ld.bfd or /usr/bin/ld.gold. Such default can be manipulated using update-alternatives. I am not sure why in your case it is ld.gold, as far as I understand RHEL5 ships with ld.bfd as default.

And why does gold not add pthread.so dependency to the binary if it is needed?

Because the definition of what is needed is somehow shady. man ld says (emphasis mine):

--as-needed

--no-as-needed

This option affects ELF DT_NEEDED tags for dynamic libraries mentioned on the command line after the --as-needed option. Normally the linker will add a DT_NEEDED tag for each dynamic library mentioned on the command line, regardless of whether the library is actually needed or not. --as-needed causes a DT_NEEDED tag to only be emitted for a library that at that point in the link satisfies a non-weak undefined symbol reference from a regular object file or, if the library is not found in the DT_NEEDED lists of other needed libraries, a non-weak undefined symbol reference from another needed dynamic library. Object files or libraries appearing on the command line after the library in question do not affect whether the library is seen as needed. This is similar to the rules for extraction of object files from archives. --no-as-needed restores the default behaviour.

Now, according to this bug report, gold is honoring the "non weak undefined symbol" part, while ld.bfd sees weak symbols as needed. TBH I do not have a full understanding on this, and there is some discussion on that link as to whether this is to be considered a ld.gold bug, or a libstdc++ bug.

Why do I need to mention -pthread and -lpthread both? (-pthread is passed by default by our build system, and I've pass -lpthread to make it work with gold is used).

-pthread and -lpthread do different things (see pthread vs lpthread). It is my understanding that the former should imply the latter.

Regardless, you can probably pass -lpthread only once, but you need to do it before --as-needed, or use --no-as-needed after the last library and before -lpthread.

It is also worth mentioning that I was not able to reproduce this issue on my system (GCC 7.2), even using the gold linker. So I suspect that it has been fixed in a more recent version libstdc++, which might also explain why it does not segfault if you use the system standard library.



回答2:

This is likely a problem caused by subtle mismatches between libstdc++ ABIs. GCC 4.9 is not the system compiler on Red Hat Enterprise Linux 5, so it's not quite clear what you are using there (DTS 3?).

The locale implementation is known to be quite sensitive to ABI mismatches. See this thread on the gcc-help list:

  • Binary compatibility between an old static libstdc++ and a new dynamic one
  • plus follow-ups in the next month

Your best bet is to figure out which bits of libstdc++ where linked where, and somehow achieve consistency (either by hiding symbols, or recompiling things so that they are compatible).

It may also be useful to investigate the hybrid linkage model used for libstdc++ in Red Hat's Developer Toolset (where newer bits are linked statically, but the bulk of the C++ standard library uses the existing system DSO), but the system libstdc++ in Red hat Enterprise Linux 5 might be too old for that if you need support for current language features.