How to trigger spurious wake-up within a Linux app

2020-06-11 14:15发布

问题:

Some background:

I have an application that relies on third party hardware and a closed source driver. The driver currently has a bug in it that causes the device to stop responding after a random period of time. This is caused by an apparent deadlock within the driver and interrupts proper functioning of my application, which is in an always-on 24/7 highly visible environment.

What I have found is that attaching GDB to the process, and immediately detaching GDB from the process results in the device resuming functionality. This was my first indication that there was a thread locking issue within the driver itself. There is some kind of race condition that leads to a deadlock. Attaching GDB was obviously causing some reshuffling of threads and probably pushing them out of their wait state, causing them to re-evaluate their conditions and thus breaking the deadlock.

The question:

My question is simply this: is there a clean wait for an application to trigger all threads within the program to interrupt their wait state? One thing that definitely works (at least on my implementation) is to send a SIGSTOP followed immediately by a SIGCONT from another process (i.e. from bash):

kill -19 `cat /var/run/mypidfile` ; kill -18 `cat /var/run/mypidfile`

This triggers a spurious wake-up within the process and everything comes back to life.

I'm hoping there is an intelligent method to trigger a spurious wake-up of all threads within my process. Think pthread_cond_broadcast(...) but without having access to the actual condition variable being waited on.

Is this possible, or is relying on a program like kill my only approach?

回答1:

The way you're doing it right now is probably the most correct and simplest. There is no "wake all waiting futexes in a given process" operation in the kernel, which is what you would need to achieve this more directly.

Note that if the failure-to-wake "deadlock" is in pthread_cond_wait but interrupting it with a signal breaks out of the deadlock, the bug cannot be in the application; it must actually be in the implementation of pthread condition variables. glibc has known unfixed bugs in its condition variable implementation; see http://sourceware.org/bugzilla/show_bug.cgi?id=13165 and related bug reports. However, you might have found a new one, since I don't think the existing known ones can be fixed by breaking out of the futex wait with a signal. If you can report this bug to the glibc bug tracker, it would be very helpful.