Program segfaults on alpine linux. How do I resolv

2019-05-17 18:34发布

问题:

I've been working on a webrtc datachannel library in C/C++ and wrote a program in C to:

  1. Create two peers from the same process.
  2. Establish a connection between them.
  3. Close the connection if it's successful.

Everything runs fine on a debian docker container and on my host opensuse tumbleweed (all x86_64 and 64bit), but on alpine linux container (64bit x86_64), I'm getting a SEGFAULT inside the child processes:

The function above is from the program's dependency "libnice". It seems like *agent == NULL and there is no way that is made null in the caller's scope. I even inserted a printf("Argument is %p", agent); right before the function call and it prints out its memory and I can verify it's not null. From the disassembly, it looks like the line where copying the agent's contents (0x557a1d20) as the local variable in the callee's stack results in a segfault. The segfault always occurs at this point even after a make clean and recompilation. Fail at activation record? Stack corruption?

UPDATE: I made a more lightweight container and ran it, and now it segfaults at a different place in that same priv_conn_keepalive_tick_unlocked. The argument seems to be set though (Notice the 0x7ffff7f9ad08):

Since I thought I might be hitting the libmusl's default stack limit of 80k, I used getrlimit(RLIMIT_STACK, &rl) to obtain the stack size and it looks like it's already 8 MB and not 80k. Increasing this limit further does not seem to make any difference except that if I assign more than 8 MB, my program crashes early inside the Gdb. Gdb says it got an unknown signal "? ?"; outside the gdb, it crashes at the normal point where it normally crashes without the altered stack size.

I'm not sure what exactly the problem is (stack corruption?) and what to do next to resolve this.

Here's my program's flow:

For every peer that is created, a child process is created with a fork(). Parent <--> child communication is done by ZeroMQ and I use protocol buffers to forward any callbacks (and its arguments) that are triggered inside the child onto an event loop running in the parent process.

So for the above program, there are 2 child processes and 1 parent process.

Steps to reproduce:

  • Source file: https://github.com/hamon-in/librtcdcpp/blob/alpine-test/examples/websocket_client/2in1.c
  • Alpine docker container: https://github.com/hamon-in/librtcdcpp/blob/alpine-test/Dockerfile.amd64
  • Run the container and binary is located at /psl-librtcdcpp/examples/websocket_client/2in1
  • 2in1 will spawn two child processes both of which will crash.

回答1:

On further investigation, the crash is in an instruction writing at a mildly large negative offset from the stack base pointer, so it's probably just a simple stack overflow.

The right way to fix this is reducing the excess stack usage or explicitly requesting a large stack at pthread_create time, but I don't see where pthread_create is being called from. A quick check to verify that this is the problem would be to override the default stack size for new threads by performing the following somewhere early in the program:

pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 1<<20); // 1 MB
pthread_setattr_default_np(&attr);


回答2:

Add -Werror=implicit-function-declaration to your CFLAGS and you'll immediately have the cause. The key clue is the pointer value 0x557a1d20, which is almost surely the result of truncating a pointer to 32 bits. This happens when you failed to declare a function that returns a pointer and the compiler (by an awful backwards default) assumes it returns int rather than producing an error, then subsequently allows the implicit conversion from int to pointer despite the C language disallowing it.