Are lock-free atomics address-free in practice?

2019-04-25 11:37发布

问题:

Boost.Interprocess is a wonderful library that simplifies the usage of shared memory amongst different processes. It provides mutexes, condition variables, and semaphores, which allow for synchronization when writing and reading from the shared memory.

However, in some situations these (relatively) performance-intensive synchronization mechanisms are not necessary - atomic operations suffice for my use case, and will likely give much better performance.

Unfortunately, Boost.Interprocess does not seem to come with atomics.


The C++ Standard Library provides the std::atomic class template, which encapsulates objects whose operations need to be atomic, and also has functions to test if atomic operations are lock-free. But it does not require lock-free atomics to be address-free as well: [atomics.lockfree]/4 merely encourages that lock-free operations be address-free, and this is in agreement with cppreference.

I cannot think of any reason why one would implement lock-free atomics in a non-address-free manner. It even appears to me to be considerably easier to implement lock-free atomics in an address-free manner.

Since I would gain significant performance benefits when using atomics instead of mutexes (from Boost.Interprocess), it seems tempting to discount standard-compliance here and store std::atomic objects in the shared memory.


There are two parts to this question:

  1. Do CPUs implement lock-free atomics in an address-free manner in practice? (I only care about CPUs that are used to run modern desktop and mobile operating systems (e.g. Windows, MacOS, Linux, Android, iOS), but not embedded systems)
  2. Why on earth would an implementation use non-address-free atomics that are lock-free?

回答1:

Yes, lock-free atomics are address-free on all C++ implementations on all normal CPUs, and can safely be used on shared-memory between processes. Non-lock-free atomics1 won't be safe between processes, though. Each process will have its own hash table of locks (Where is the lock for a std::atomic?).

The C++ standard intends lock-free atomics to work in shared memory between processes, but it can only go as far as "should" without defining terms and so on.

C++draft 29.5 Lock-free property

[ Note: Operations that are lock-free should also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation should not depend on any per-process state. This restriction enables communication by memory that is mapped into a process more than once and by memory that is shared between two processes. — end note ]

This is a quality-of-implementation recommendation that is very easy to implement on current hardware, and in fact you'd have to try hard to design a deathstation9000 C++ implementation that violates it on x86 / ARM / PowerPC / other mainstream CPU while actually being lock-free.


The mechanism hardware exposes for atomic read-modify-write operations is based on MESI cache coherency which only cares about physical addresses. x86 lock cmpxchg / lock add / etc. makes a core hang on to a cache line in Modified state so no other core can read/write it in the middle of the atomic operation. (Can num++ be atomic for 'int num'?).

Most non-x86 architectures use LL/SC, which lets you write a retry loop that only does the store if it will be atomic. LL/SC can emulate CAS with O(1) overhead in a wait-free manner without introducing addresses.

C++ lock-free atomics compile to use LL/SC instructions directly. See my answer on the num++ question for x86 examples. See Atomically clearing lowest non-zero bit of an unsigned integer for some examples of AArch64 code-gen for compare_exchange_weak vs fetch_add using LL/SC instructions.

Atomic pure-load or pure-store are easier and happen for free with aligned data. On x86, see Why is integer assignment on a naturally aligned variable atomic on x86? Other ISAs have similar rules.


Related: I included some comments about address-free in my answer on Genuinely test std::atomic is lock-free or not. I'm not sure whether they're helpful or correct. :/


Footnote 1:

All mainstream CPUs have lock-free atomics for objects up to the width of a pointer. Some have wider atomics, like x86 has lock cmpxchg16b, but not all implementations choose to use it for double-width lock-free atomic objects. Check C++17 std::atomic::is_always_lock_free, or ATOMIC_xxx_LOCK_FREE if defined, for compile-time detection.

(Some microcontrollers can't hold a pointer in a single register (or copy it around with a single operation), but there aren't usually multi-core implementations of such ISAs.)


Why on earth would an implementation use non-address-free atomics that are lock-free?

I don't know any plausible reason on hardware that works anything like normal modern CPUs. You could may imagine some architecture where you do atomic operations by submitting the address to some

I think the C++ standard wants to avoid constraining non-mainstream implementations as much as possible. e.g. C++ on top of some kind of interpreter, rather than compiled do machine code for a "normal" CPU architecture.

IDK if you could usefully implement C++ on a loosely-coupled shared memory system like a cluster with ethernet links instead of shared memory, or non-coherent shared memory (that has to be flushed explicitly for other threads to see your stores).

I think it's mostly that the C++ committee can't say much about how atomics must be implemented without assuming that implementations will run programs under an OS where multiple processes can set up shared memory.

They might be imagining some future ISA where address-free atomics aren't possible, but I think more likely they don't want to talk about shared-memory between multiple C++ programs. The standard only requires that an implementation run one program.

Apparently std::atomic_flag is actually guaranteed to be address-free Why only std::atomic_flag is guaranteed to be lock-free?, so IDK why they don't make the same requirement for any atomic<T> that the implementation chooses to implement as lock-free.