I have a large program that needs to be made as resilient as possible, and has a large number of threads.
I need to catch all signals SIGBUS
SIGSEGV
, and re-initialize the problem thread if necessary, or disable the thread to continue with reduced functionality.
My first thought is to do a setjump
, and then set signal handlers, that can log the problem, and then do a longjump
back to a recovery point in the thread. There is the issue that the signal handler would need to determine which thread the signal came from, to use the appropriate jump buffer as jumping back to the wrong thread would be useless.
Does anyone have any idea how to determine the offending thread in the signal handler?
Using
syscall(SYS_gettid)
works for me on my Linux box:gcc pt.c -lpthread -Wall -Wextra
To be more portable
pthread_self
can be used. It is async-signal-safe.But the thread which got the
SIGSEGV
should start a new thread by async-signal-safe means and should not do asiglongjmp
as it could result in the invocation of non-async-signal-safe functions.I'm going to assume you've already thought this through and have an extremely good reason to believe that your program will be more resilient by attempting to retry after a SIGSEGV - bearing in mind segfaults highlight issues with dangling pointers and other abuses that might also be corrupting unpredictable locations in your process address space without segfaulting.
Since you've thought this through extremely carefully, and you've determined (somehow) that the particular way your application segfaults cannot possibly disguise the corruption of the accounting data used for canceling and restarting threads, and that you have perfect cancellation logic for those threads (also extraordinarily rare), let's go ahead and tackle the problem.
The SIGSEGV handler on Linux is executed in the thread of the failing instruction (man 7 signal). We can't call pthread_self() as it's not async signal safe, but the internet widely seems to agree that syscall (man 2 syscall) is safe, so we can get the thread ID via syscall SYS_gettid. So we'll to maintain a mapping of pthread_t's (pthread_self) to pid's (gettid()). Since write() is also safe, we can trap SEGV, write the current thread ID down a pipe, and then pause until pthread_cancel terminates us.
We also need a monitor thread to keep an eye on when things go pear-shaped. The monitor thread monitors the read end of the pipe for information on the terminated thread, and may restart it.
Because I think pretending to handle SIGSEGV is daft, I'm going to call the structures here which do so daft_thread_t, etc. someone_please_fix_me represents your broken code. The monitor thread is main(). When a thread segfaults, it is trapped by the signal handler, writes its ID down a pipe; the monitor reads the pipe, cancels the thread with pthread_cancel and pthread_join, and restarts it.
If you haven't thought about it: Attempting to recover from SIGSEGV is extraordinarily risky - I strongly advise against it. Threads share an address space. The thread that segfaulted might also have corrupted other thread data or global accounting data, such as malloc()'s accounting. A far safer approach - assuming the failing code is irreparably broken but must be used - is to quarantine the failing code behind a process boundary, for instance by fork()ing before invoking the broken code. You then must trap SIGCLD and deal with the process crashing or terminating normally, alongside a number of other pitfalls, but at least you don't have to worry about random corruption. Of course, the best option is to fix the bloody code so you're not observing segfaults.
In my experience, when a threaded program receives a synchronous signal - i.e. one generated by something the program did, such as dereferencing a bad pointer - the thread that caused the problem receives the signal.
I've used one system that explicitly guaranteed this behaviour, but I don't know whether it's general. Also of of course if the offending thread has blocked the signal, as in a paradigm where one thread handles all signals, presumably it will go to the signal handling thread.