Overhead of pthread mutexes?

2019-01-22 06:19发布

站内文章 / C++

83 0

三岁会撩人

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to make a C++ API (for Linux and Solaris) thread-safe, so that its functions can be called from different threads without breaking internal data structures. In my current approach I'm using pthread mutexes to protect all accesses to member variables. This means that a simple getter function now locks and unlocks a mutex, and I'm worried about the overhead of this, especially as the API will mostly be used in single-threaded apps where any mutex locking seems like pure overhead.

So, I'd like to ask:

do you have any experience with performance of single-threaded apps that use locking versus those that don't?
how expensive are these lock/unlock calls, compared to eg. a simple "return this->isActive" access for a bool member variable?
do you know better ways to protect such variable accesses?

回答1:

All modern thread implementations can handle an uncontended mutex lock entirely in user space (with just a couple of machine instructions) - only when there is contention, the library has to call into the kernel.

Another point to consider is that if an application doesn't explicitly link to the pthread library (because it's a single-threaded application), it will only get dummy pthread functions (which don't do any locking at all) - only if the application is multi-threaded (and links to the pthread library), the full pthread functions will be used.

And finally, as others have already pointed out, there is no point in protecting a getter method for something like isActive with a mutex - once the caller gets a chance to look at the return value, the value might already have been changed (as the mutex is only locked inside the getter method).

回答2:

"A mutex requires an OS context switch. That is fairly expensive. "

This is not true on Linux, where mutexes are implemented using something called futex'es. Acquiring an uncontested (i.e., not already locked) mutex is, as cmeerw points out, a matter of a few simple instructions, and is typically in the area of 25 nanoseconds w/current hardware.

For more info: Futex

Numbers everybody should know

回答3:

This is a bit off-topic but you seem to be new to threading - for one thing, only lock where threads can overlap. Then, try to minimize those places. Also, instead of trying to lock every method, think of what the thread is doing (overall) with an object and make that a single call, and lock that. Try to get your locks as high up as possible (this again increases efficiency and may /help/ to avoid deadlocking). But locks don't 'compose', you have to mentally at least cross-organize your code by where the threads are and overlap.

回答4:

I did a similar library and didn't have any trouble with lock performance. (I can't tell you exactly how they're implemented, so I can't say conclusively that it's not a big deal.)

I'd go for getting it right first (i.e. use locks) then worry about performance. I don't know of a better way; that's what mutexes were built for.

An alternative for single thread clients would be to use the preprocessor to build a non-locked vs locked version of your library. E.g.:

#ifdef BUILD_SINGLE_THREAD
    inline void lock () {}
    inline void unlock () {}
#else
    inline void lock () { doSomethingReal(); }
    inline void unlock () { doSomethingElseReal(); }
#endif

Of course, that adds an additional build to maintain, as you'd distribute both single and multithread versions.

回答5:

I can tell you from Windows, that a mutex is a kernel object and as such incurs a (relatively) significant locking overhead. To get a better performing lock, when all you need is one that works in threads, is to use a critical section. This would not work across processes, just the threads in a single process.

However.. linux is quite a different beast to multi-process locking. I know that a mutex is implemented using the atomic CPU instructions and only apply to a process - so they would have the same performance as a win32 critical section - ie be very fast.

Of course, the fastest locking is not to have any at all, or to use them as little as possible (but if your lib is to be used in a heavily threaded environment, you will want to lock for as short a time as possible: lock, do something, unlock, do something else, then lock again is better than holding the lock across the whole task - the cost of locking isn't in the time taken to lock, but the time a thread sits around twiddling its thumbs waiting for another thread to release a lock it wants!)

回答6:

A mutex requires an OS context switch. That is fairly expensive. The CPU can still do it hundreds of thousands of times per second without too much trouble, but it is a lot more expensive than not having the mutex there. Putting it on every variable access is probably overkill.

It also probably is not what you want. This kind of brute-force locking tends to lead to deadlocks.

do you know better ways to protect such variable accesses?

Design your application so that as little data as possible is shared. Some sections of code should be synchronized, probably with a mutex, but only those that are actually necessary. And typically not individual variable accesses, but tasks containing groups of variable accesses that must be performed atomically. (perhaps you need to set your is_active flag along with some other modifications. Does it make sense to set that flag and make no further changes to the object?)

回答7:

I was curious about the expense of using a pthred_mutex_lock/unlock. I had a scenario where I needed to either copy anywhere from 1500-65K bytes without using a mutex or to use a mutex and do a single write of a pointer to the data needed.

I wrote a short loop to test each

gettimeofday(&starttime, NULL)
COPY DATA
gettimeofday(&endtime, NULL)
timersub(&endtime, &starttime, &timediff)
print out timediff data

ettimeofday(&starttime, NULL)
pthread_mutex_lock(&mutex);
gettimeofday(&endtime, NULL)
pthread_mutex_unlock(&mutex);
timersub(&endtime, &starttime, &timediff)
print out timediff data

If I was copying less than 4000 or so bytes, then the straight copy operation took less time. If however I was copying more than 4000 bytes, then it was less costly to do the mutex lock/unlock.

The timing on the mutex lock/unlock ran between 3 and 5 usec long including the time for the gettimeofday for the currentTime which took about 2 usec

回答8:

For member variable access, you should use read/write locks, which have slightly less overhead and allow multiple concurrent reads without blocking.

In many cases you can use atomic builtins, if your compiler provides them (if you are using gcc or icc __sync_fetch*() and the like), but they are notouriously hard to handle correctly.

If you can guarantee the access being atomic (for example on x86 an dword read or write is always atomic, if it is aligned, but not a read-modify-write), you can often avoid locks at all and use volatile instead, but this is non portable and requires knowledge of the hardware.

回答9:

Well a suboptimal but simple approach is to place macros around your mutex locks and unlocks. Then have a compiler / makefile option to enable / disable threading.

Ex.

#ifdef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //actual mutex call
#endif

#ifndef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //do nothing
#endif

Then when compiling do a gcc -DTHREAD_ENABLED to enable threading.

Again I would NOT use this method in any large project. But only if you want something fairly simple.