I have a large program that needs to be made as resilient as possible, and has a large number of threads.
I need to catch all signals SIGBUS
SIGSEGV
, and re-initialize the problem thread if necessary, or disable the thread to continue with reduced functionality.
My first thought is to do a setjump
, and then set signal handlers, that can log the problem, and then do a longjump
back to a recovery point in the thread. There is the issue that the signal handler would need to determine which thread the signal came from, to use the appropriate jump buffer as jumping back to the wrong thread would be useless.
Does anyone have any idea how to determine the offending thread in the signal handler?
Using syscall(SYS_gettid)
works for me on my Linux box: gcc pt.c -lpthread -Wall -Wextra
//pt.c
#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <setjmp.h>
#include <signal.h>
#include <string.h>
#include <ucontext.h>
#include <stdlib.h>
static sigjmp_buf jmpbuf[65536];
static void handler(int sig, siginfo_t *siginfo, void *context)
{
//ucontext_t *ucontext = context;
pid_t tid = syscall(SYS_gettid);
printf("Thread %d in handler, signal %d\n", tid, sig);
siglongjmp(jmpbuf[tid], 1);
}
static void *threadfunc(void *data)
{
int index, segvindex = *(int *)data;
pid_t tid = syscall(SYS_gettid);
for(index = 0; index < 500; index++) {
if (sigsetjmp(jmpbuf[tid], 1) == 1) {
printf("Recovery of thread %d\n", tid);
continue;
}
printf("Thread %d, index %d\n", tid, index);
if (index % 5 == segvindex) {
printf("%zu\n", strlen((char *)2)); // SIGSEGV
}
pthread_yield();
}
return NULL;
}
int main(void)
{
pthread_t thread1, thread2, thread3;
int segvindex1 = rand() % 5;
int segvindex2 = rand() % 5;
int segvindex3 = rand() % 5;
struct sigaction sact;
memset(&sact, 0, sizeof sact);
sact.sa_sigaction = handler;
sact.sa_flags = SA_SIGINFO;
if (sigaction(SIGSEGV, &sact, NULL) < 0) {
perror("sigaction");
return 1;
}
pthread_create(&thread1, NULL, &threadfunc, (void *) &segvindex1);
pthread_create(&thread2, NULL, &threadfunc, (void *) &segvindex2);
pthread_create(&thread3, NULL, &threadfunc, (void *) &segvindex3);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
pthread_join(thread3, NULL);
return 0;
}
To be more portable pthread_self
can be used. It is async-signal-safe.
But the thread which got the SIGSEGV
should start a new thread by async-signal-safe means and should not do a siglongjmp
as it could result in the invocation of non-async-signal-safe functions.
I'm going to assume you've already thought this through and have an extremely good reason to believe that your program will be more resilient by attempting to retry after a SIGSEGV - bearing in mind segfaults highlight issues with dangling pointers and other abuses that might also be corrupting unpredictable locations in your process address space without segfaulting.
Since you've thought this through extremely carefully, and you've determined (somehow) that the particular way your application segfaults cannot possibly disguise the corruption of the accounting data used for canceling and restarting threads, and that you have perfect cancellation logic for those threads (also extraordinarily rare), let's go ahead and tackle the problem.
The SIGSEGV handler on Linux is executed in the thread of the failing instruction (man 7 signal). We can't call pthread_self() as it's not async signal safe, but the internet widely seems to agree that syscall (man 2 syscall) is safe, so we can get the thread ID via syscall SYS_gettid. So we'll to maintain a mapping of pthread_t's (pthread_self) to pid's (gettid()). Since write() is also safe, we can trap SEGV, write the current thread ID down a pipe, and then pause until pthread_cancel terminates us.
We also need a monitor thread to keep an eye on when things go pear-shaped. The monitor thread monitors the read end of the pipe for information on the terminated thread, and may restart it.
Because I think pretending to handle SIGSEGV is daft, I'm going to call the structures here which do so daft_thread_t, etc. someone_please_fix_me represents your broken code. The monitor thread is main(). When a thread segfaults, it is trapped by the signal handler, writes its ID down a pipe; the monitor reads the pipe, cancels the thread with pthread_cancel and pthread_join, and restarts it.
#include <assert.h>
#include <errno.h>
#include <pthread.h>
#include <signal.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/syscall.h>
#define MAX_DAFT_THREADS (1024) // arbitrary
#define CHECK_OSCALL(call, onfail) { \
if ((call) == -1) { \
char buf[512]; \
strerror_r(errno, buf, sizeof(buf)); \
fprintf(stderr, "%s@%d failed: %s\n", __FILE__, __LINE__, buf); \
onfail; \
} \
}
/*********************** daft thread accounting *****************/
typedef void* (*threadproc_t)(void* arg);
struct daft_thread_t {
threadproc_t start_routine;
void* start_routine_arg;
pthread_t pthread;
pid_t tid;
};
struct daft_thread_accounting_info_t {
int monitor_pipe[2];
pthread_mutex_t info_lock;
size_t daft_thread_count;
struct daft_thread_t daft_threads[MAX_DAFT_THREADS];
};
static struct daft_thread_accounting_info_t g_thread_accounting;
void daft_thread_accounting_info_init(struct daft_thread_accounting_info_t* inf)
{
memset(inf, 0, sizeof(*inf));
pthread_mutex_init(&inf->info_lock, NULL);
CHECK_OSCALL(pipe(inf->monitor_pipe), abort());
}
struct daft_thread_wrapper_data_t {
struct daft_thread_t* thread_info;
};
static void* daft_thread_wrapper(void* arg)
{
struct daft_thread_t* wrapper = arg;
wrapper->tid = gettid();
return (*wrapper->start_routine)(wrapper->start_routine_arg);
}
static void start_daft_thread(threadproc_t proc, void* arg)
{
struct daft_thread_t* info;
pthread_mutex_lock(&g_thread_accounting.info_lock);
assert (g_thread_accounting.daft_thread_count < MAX_DAFT_THREADS);
info = &g_thread_accounting.daft_threads[g_thread_accounting.daft_thread_count++];
pthread_mutex_unlock(&g_thread_accounting.info_lock);
info->start_routine = proc;
info->start_routine_arg = arg;
CHECK_OSCALL(pthread_create(&info->pthread, NULL, daft_thread_wrapper, info), abort());
}
static struct daft_thread_t* find_thread_by_tid(pid_t thread_id)
{
int k;
struct daft_thread_t* info = NULL;
pthread_mutex_lock(&g_thread_accounting.info_lock);
for (k = 0; k < g_thread_accounting.daft_thread_count; ++k) {
if (g_thread_accounting.daft_threads[k].tid == thread_id) {
info = &g_thread_accounting.daft_threads[k];
break;
}
}
pthread_mutex_unlock(&g_thread_accounting.info_lock);
return info;
}
static void restart_daft_thread(struct daft_thread_t* info)
{
void* unused;
CHECK_OSCALL(pthread_cancel(info->pthread), abort());
CHECK_OSCALL(pthread_join(info->pthread, &unused), abort());
info->tid = 0;
CHECK_OSCALL(pthread_create(&info->pthread, NULL, daft_thread_wrapper, info), abort());
}
/************* signal handling stuff **************/
struct sigdeath_notify_info {
int signum;
pid_t tid;
};
static void sigdeath_handler(int signum, siginfo_t* info, void* ctx)
{
int z;
struct sigdeath_notify_info inf = {
.signum = signum,
.tid = gettid()
};
z = write(g_thread_accounting.monitor_pipe[1], &inf, sizeof(inf));
assert (z == sizeof(inf)); // or else SIGABRT. Are we handling that too? Hope not.
pause(); // returning doesn't do us any good.
}
static void register_signal_handlers()
{
struct sigaction sa = {};
sa.sa_sigaction = sigdeath_handler;
sa.sa_flags = SA_SIGINFO;
CHECK_OSCALL(sigaction(SIGSEGV, &sa, NULL), abort());
CHECK_OSCALL(sigaction(SIGBUS, &sa, NULL), abort());
}
pid_t gettid() { return (pid_t) syscall(SYS_gettid); }
/** This is the code that segfaults randomly. Kwality with a 'k'. */
static void* someone_please_fix_me(void* arg)
{
char* i_think_this_address_looks_nice = (char*) 42;
sleep(1 + rand() % 200);
i_think_this_address_looks_nice[0] = 'q'; // ugh
return NULL;
}
// main() will serve as the monitor thread here
int main()
{
int k;
struct sigdeath_notify_info death;
daft_thread_accounting_info_init(&g_thread_accounting);
register_signal_handlers();
for (k = 0; k < 200; ++k) {
start_daft_thread(someone_please_fix_me, (void*) k);
}
while (read(g_thread_accounting.monitor_pipe[0], &death, sizeof(death)) == sizeof(death)) {
struct daft_thread_t* info = find_thread_by_tid(death.tid);
if (info == NULL) {
fprintf(stderr, "*** thread_id %u not found\n", death.tid);
continue;
}
fprintf(stderr, "Thread %u (%d) died of %d, restarting.\n",
death.tid, (int) info->start_routine_arg, death.signum);
restart_daft_thread(info);
}
fprintf(stderr, "Shouldn't get here.\n");
return 0;
}
If you haven't thought about it: Attempting to recover from SIGSEGV is extraordinarily risky - I strongly advise against it. Threads share an address space. The thread that segfaulted might also have corrupted other thread data or global accounting data, such as malloc()'s accounting. A far safer approach - assuming the failing code is irreparably broken but must be used - is to quarantine the failing code behind a process boundary, for instance by fork()ing before invoking the broken code. You then must trap SIGCLD and deal with the process crashing or terminating normally, alongside a number of other pitfalls, but at least you don't have to worry about random corruption. Of course, the best option is to fix the bloody code so you're not observing segfaults.
In my experience, when a threaded program receives a synchronous signal - i.e. one generated by something the program did, such as dereferencing a bad pointer - the thread that caused the problem receives the signal.
I've used one system that explicitly guaranteed this behaviour, but I don't know whether it's general. Also of of course if the offending thread has blocked the signal, as in a paradigm where one thread handles all signals, presumably it will go to the signal handling thread.