Does anyone know of published benchmarks of the overhead of locking instead of relying on certainly atomic operations/intrinsics (on a multiprocessor system) only?
I’m particularly interested in general conclusions, e.g. something like “regardless of the platform, locking is at least a factor X slower than intrinsics.” (That’s why I can’t just benchmark myself.)
I’m interested in direct comparisons, e.g. how much faster is using
#pragma omp atomic
++x;
instead of
#pragma omp critical
++x;
(assuming that every other update of x
is also critical).
Basically, I need this to justify a complex lock-free implementation instead of a straightforward locking one where starvation isn’t an issue. Conventional wisdom is that while locking is simpler, non-locking implementations have tons of advantages. But I’m hard pressed to find reliable data.
I don't know where specific studies are, but you are unlikely to find a definitive locks-are-better answer anywhere. It depends very much on how you use them, how much contention they are subject to, and what you are using the primitives for. If all you want to do is increment numbers, then yes, you'll probably find atomic primitives to be faster than locks, but if you want to perform multi-word compare and swap, or complex updates to tree structures, etc., you'll find that the lock-free code is not only much more complex and difficult to debug, but that the performance advantages over a well-designed lock-based implementation are inconclusive at best, and hardly worth the substantial increase in complexity. TANSTAAFL.
I’m particularly interested in general
conclusions, e.g. something like
“regardless of the platform, locking
is at least a factor X slower than
intrinsics.”
I'a afraid there are no general conclusion, because this issue is related to a architect's atomic instruction design, cache layout, and memory bus. These may very different between x86 and MIPS. You can make benchmark on the architects which you may use, and compare them. The lock's design may impact the benchmark a lot, so even you find a report, you shouldn't believe that simply.
MS did some benchmarks for the new concurrent collection classes in .NET 4.
http://blogs.msdn.com/b/pfxteam/archive/2010/04/26/9997562.aspx
Not C/C++, but the underlying principles are the same: use platform CAS/interlocked operations instead of locking where you can.
Cliff Click also has some benchmarks on his lock-free hash table:
http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf
Nevertheless, the amount of performance gain - or loss, for that matter - of using lock-free methods vs locking depends highly on the algorithm and the exact use case.
Answers to this sort of question are highly complex: lock-free code frequently spins rather than waiting under contention, and so under high contention may run significantly slower than it would if the threads just blocked themselves in a queue behind a lock. In addition, lock-free code can still spend a lot of time issuing memory barriers, and so might have unexpected performance characteristics on different hardware and on different platforms.
So, the old standby answer to performance questions rears its head again: measure both of them in your situation, and pick the faster one.
Basically, I need this to justify a complex lock-free implementation instead of a straightforward locking one where starvation isn’t an issue.
Well, all that said, unless this data structure is at the absolute center of your massive-scale multithreaded application, I would go with a lock-based solution, with as few locks as possible. The bugs will be so much easier to find that it will be worth it. Where threading is involved, I personally find it very difficult to justify any kind of complex implementation, lock-free or otherwise.