helping a client out with an issue that they are having. I'm more of a sysadmin/DBA guy so I'm struggling with helping them out. They are saying it is a bug in the kernel/environment, I'm trying to either prove or disprove that before I insist that it is in their code or seek vendor support for the OS.
Happens on Red Hat and Oracle Enterprise Linux 5.7 (and 5.8), application is written in C++
The problem they are experiencing is that the main thread starts a separate thread to do a potentially long-running TCP connect() [client connecting to server].
If the 'long-running' aspect takes too long, they cancel the thread and start another one.
This is done because we don't know the state of the server program:
- server program up and running --> connection immediately accepted
- server program not running, machine and network OK --> connection
immediately failed with error 'connection refused'
- machine or network crashed or down --> connection takes a long time
to fail with error 'no route to host'
The problem is that cancelling the thread that has the mutex locked
(with cleanup handlers set up to unlock the mutex) sometimes does NOT unlock the mutex.
That leaves the main thread hung on trying to lock the mutex.
Detailed environment info:
- glibc-2.5-65
- glibc-2.5-65
- libcap-1.10-26
- kernel-debug-2.6.18-274.el5
- glibc-headers-2.5-65
- glibc-common-2.5-65
- libcap-1.10-26
- kernel-doc-2.6.18-274.el5
- kernel-2.6.18-274.el5
- kernel-headers-2.6.18-274.el5
- glibc-devel-2.5-65
Code was built with:
c++ -g3 tst2.C -lpthread -o tst2
Any advice and guidance is greatly appreciated
It's correct that cancelled threads do not unlock mutexes they hold, you need to arrange for that to happen manually, which can be tricky as you need to be very careful to use the right cleanup handlers around every possible cancellation point. Assuming you're using pthread_cancel
to cancel the thread and setting cleanup handlers with pthread_cleanup_push
to unlock the mutexes, there are a couple of alternatives you could try which might be simpler to get right and so may be more reliable.
Using RAII to unlock the mutex will be more reliable. On GNU/Linux pthread_cancel
is implemented with a special exception of type __cxxabi::__forced_unwind
, so when a thread is cancelled an exception is thrown and the stack is unwound. If a mutex is locked by an RAII type then its destructor will be guaranteed to run if the stack is unwound by a __forced_unwind
exception. Boost Thread
provides a portable C++ library that wraps Pthreads and is much easier to use. It provides an RAII type boost::mutex
and other useful abstractions. Boost Thread also provides its own "thread interruption" mechanism which is similar to Pthread cancellation but not the same, and Pthread cancellation points (such as connect
) are not Boost Thread interruption points, which can be helpful for some applications. However in your client's case since the point of cancellation is to interrupt the connect
call they probably do want to stick with Pthread cancellation. The (non-portable) way GNU/Linux implements cancellation as an exception means it will work well with boost::mutex
.
There is really no excuse for explicitly locking and unlocking mutexes when you're writing in C++, IMHO the most important and most useful feature of C++ is destructors which are ideal for automatically releasing resources such as mutex locks.
Another option would be to use a robust mutex, which is created by calling pthread_mutexattr_setrobust
on a pthread_mutexattr_t
before initializing the mutex. If a thread dies while holding a robust mutex the kernel will make a note of it so that the next thread which tries to lock the mutex gets the special error code EOWNERDEAD
. If possible, the new thread can make the data protected by the thread consistent again and take ownership of the mutex. This is much harder to use correctly than simply using an RAII type to lock and unlock the mutex.
A completely different approach would be to decide if you really need to hold the mutex lock while calling connect
. Holding mutexes during slow operations is not a good idea. Can't you call connect
then if successful lock the mutex and update whatever shared data is being protected by the mutex?
My preference would be to both use Boost Thread and avoid holding the mutex for long periods.
The problem they are experiencing is that the main thread starts a separate thread to do a potentially long-running TCP connect() [client connecting to server]. If the 'long-running' aspect takes too long, they cancel the thread and start another one.
Trivial fix -- don't cancel the thread. Is it doing any harm? If necessary, have the thread check (when the connect
finally does complete) whether the connection is still needed and, if not, close it, release the mutex, and terminate. You can do this with a boolean variable protected by a mutex.
Also, a thread should not hold a mutex while waiting for network I/O. Mutexes should be used only for things that are fast and primarily CPU-limited or perhaps limited by local disk.
Finally, if you feel you need to reach in from the outside and force a thread to do something, step back. You wrote the code for that thread. If you feel that need, it means you didn't code that thread to do what you really wanted it to do. The fix is to modify the thread to do what, and only what, you actually want. Then you won't have to "push it around" from the outside.