Genuinely test std::atomic is lock-free or not

2019-02-23 10:31发布

问题:

Since std::atomic::is_lock_free() may not genuinely reflect the reality [ref], I'm considering writing a genuine runtime test instead. However, when I got down to it, I found that it's not a trivial task I thought it to be. I'm wondering whether there is some clever idea that could do it.

回答1:

Other than performance, the standard doesn't guarantee any way you can tell; that's more or less the point.

If you are willing to introduce some platform-specific UB, you could do something like cast a atomic<int64_t> * to a volatile int64_t* and see if you observe "tearing" when another thread reads the object. But that fails on 32-bit x86 where lock-free int64_t is efficient with only small overhead (using SSE2 or x87), but volatile int64_t* will produce tearing using two separate 4-byte stores the way most compilers compile it.

If this test succeeds (i.e. the plain C++ type was naturally atomic with just volatile), that tells you any sane compiler will make it lock-free very cheaply. But if it fails, it doesn't tell you very much. A lock-free atomic for that type may be only slightly more expensive than the plain version for loads/stores, or the compiler may not make it lock-free at all.

On any specific platform / target architecture, you can single-step your code in a debugger and see what asm instructions run. (Including stepping into libatomic function calls like __atomic_store_16). This is the only 100% reliable way.

(Fun fact: gcc7 with statically linked libatomic may always use locking for 16-byte objects on x86-64, because it doesn't have an opportunity to do runtime CPU detection at dynamic link time and use lock cmpxchg16b on CPUs that support it, with the same mechanism glibc uses to pick optimal memcpy / strchr implementations for the current system.)


You could portably look for a performance difference (e.g. scalability with multiple readers), but x86-64 lock cmpxchg16b doesn't scale1. Multiple readers contend with each other, unlike 8 byte and narrower atomic objects where pure asm loads are atomic and can be used. lock cmpxchg16b acquires exclusive access to a cache line before executing; abusing the side-effect of atomically loading the old value on failure to implement .load() is much worse than an 8-byte atomic load which compiles to just a regular load instruction.

That's part of the reason that gcc7 decided to stop returning true for is_lock_free() on 16-byte objects, as described in the GCC mailing list message about the change you're asking about.

Also note that clang on 32-bit x86 uses lock cmpxchg8b to implement std::atomic<int64_t>, just like for 16-byte objects in 64-bit mode. So you would see a lack of parallel read scaling with it, too. (https://bugs.llvm.org/show_bug.cgi?id=33109)


std::atomic<> implementations that use locking usually still don't make the object larger by including a lock byte or word in each object. It would change the ABI, but lock-free vs. locking is already an ABI difference. The standard allows this, but weird hardware might need extra bytes in the object even when lock-free. Anyway sizeof(atomic<T>) == sizeof(T) doesn't tell you anything either way. If it's larger it's most likely that your implementation added a mutex, but you can't be sure without checking the asm.

The normal mechanism is to use the address of the atomic object as a key for a global hash table of locks. Two objects aliasing / colliding and sharing the same lock is extra contention, but not a correctness problem. These locks are only taken/released from library functions, not while holding other such locks, so it can't create a deadlock.

You could detect this by using shared memory between two different processes (so each process would have its own hash table of locks). Is C++11 atomic<T> usable with mmap?

  • check that std::atomic<T> is the same size as T (so the lock isn't in the object itself).

  • Map a shared memory segment from two separate processes that don't otherwise share any of their address space. It doesn't matter if you map it to a different base address in each process.

  • Store patterns like all-ones and all-zeros from one process while reading from the other (and look for tearing). Same as what I suggested with volatile above.

  • Also test atomic increment: have each thread do 1G increments and check that the result is 2G every time. Even if pure load and pure store are naturally atomic (the tearing test), read-modify-write operations like fetch_add / operator++ need special support: Can num++ be atomic for 'int num'?

From the C++11 standard, the intent is that this should still be atomic for lock-free objects. It might also work for non-lock-free objects (if they embed the lock in the object), which is why you have to rule that out by checking sizeof().

To facilitate inter-process communication via shared memory, it is our intent that lock-free operations also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation shall not depend on any per-process state.

If you see tearing between two processes, the object wasn't lock-free (at least not the way C++11 intended, and not the way you'd expect on normal shared-memory CPUs.)

I'm not sure why address-free matters if the processes don't have to share any address-space other than 1 page containing the atomic object2. (Of course, C++11 doesn't require that the implementation uses pages at all. Or maybe an implementation could put the hash table of locks at the top or bottom of each page? In which case using a hash function that depended on address bits above the page offset would be totally silly.)

Anyway, this depends on a lot of assumptions about how computers work that are true on all normal CPUs, but which C++ doesn't make. If the implementation you care about is on a mainstream CPU like x86 or ARM under a normal OS, then this testing method should be fairly accurate and might be an alternative to just reading the asm. It's not something that's very practical to do automatically at compile time, but it would be possible to automate a test like this and put it into a build script, unlike reading the asm.


Footnote 1: 16-byte atomics on x86

No x86 hardware guarantees support for 16-byte atomic load/store with SSE instructions. In practice many modern CPUs do have atomic movaps load/store, but there are no guarantees of this in Intel/AMD manuals the way there are for 8-byte x87/MMX/SSE loads/stores on Pentium and later. And no way to detect which CPUs do/don't have atomic 128-bit ops (other than lock cmpxchg16b), so compiler writers can't safely use them.

See SSE instructions: which CPUs can do atomic 16B memory operations? for a nasty corner case: testing on K10 shows that aligned xmm load/store shows no tearing between threads on the same socket, but threads on different sockets experience rare tearing because HyperTransport apparently only gives the minimum x86 atomicity guarantee of 8 byte objects. (IDK if lock cmpxchg16b is more expensive on a system like that.)

Without published guarantees from vendors, we can never be sure about weird microarchitectural corner cases, either. Lack of tearing in a simple test with one thread writing patterns and the other reading is pretty good evidence, but it's always possible that something could be different in some special case the CPU designers decided to handle a different way than normal.


A pointer + counter struct where read-only access only needs the pointer can be cheap, but current compilers need union hacks to get them to do an 8-byte atomic load of just the first half of the object. How can I implement ABA counter with c++11 CAS?. For an ABA counter, you'd normally update it with a CAS anyway, so lack of a 16-byte atomic pure store is not a problem.

An ILP32 ABI (32-bit pointers) in 64-bit mode (like Linux's x32 ABI, or AArch64's ILP32 ABI) means pointer+integer can fit in only 8 bytes, but integer registers are still 8 bytes wide. This makes it much more efficient to use a pointer+counter atomic object than in full 64-bit mode where a pointer is 8 bytes.


Footnote 2: address-free

I think the term "address-free" is a separate claim from not depending on any per-process state. As I understand it, it means that correctness doesn't depend on both threads using the same address for the same memory location. But if correctness also depends on them sharing the same global hash table (IDK why storing the address of an object in the object itself would ever help), that would only matter if it was possible to have multiple addresses for the same object within the same process. That is possible on something like x86's real-mode segmentation model, where a 20-bit linear address space is addressed with 32-bit segment:offset. (Actual C implementations for 16-bit x86 exposed segmentation to the programmer; hiding it behind C's rules would be possible but not high performance.)

It's also possible with virtual memory: two mappings of the same physical page to different virtual addresses within the same process is possible but weird. That might or might not use the same lock, depending on whether the hash function uses any address bits above the page offset. (The low bits of an address, that represent the offset within a page, are the same for every mapping. i.e. virtual to physical translation for those bits is a no-op, which is why VIPT caches are usually designed to take advantage of that to get speed without aliasing.)

So a non-lock-free object might be address-free within a single process, even if it uses a separate global hash table instead of adding a mutex to the atomic object. But this would be a very unusual situation; it's extremely rare to use virtual memory tricks to create two addresses for the same variable within the same process that shares all of its address-space between threads. Much more common would be atomic objects in shared memory between processes. (I may be misunderstanding the meaning of "address-free"; possibly it means "address-space free", i.e. lack of dependency on other addresses being shared.)



回答2:

I think you are really just trying to detect this special case specific to gcc where is_lock_free reports false, but the underlying implementation (hidden behind a libatomic function call) is still using cmpxchg16b. You want to know about this, since you consider such an implementation genuinely lock free.

In that case, as a practical matter, I would just write your detection function to hardcode the gcc version range you know operates in this manner. Currently, all versions after the one in which the change to stop inlining cmpxchg16b apparently still use a lock-free implementation under the covers, so a check today would be "open ended" (i.e., all versions after X). Prior to this point is_lock_free returns true (which you consider correct). After some hypothetical future change to gcc which makes the library call use locks, the is_lock_free() == false answer will become genuinely true, and you'll close your check by recording the version in which it occurred.

So something like this should be a good start:

template <typename T>
bool is_genuinely_lock_free(std::atomic<T>& t) {
#if     __GNUC__ >= LF16_MAJOR_FIRST && __GNUC_MINOR__ >= LF16_MINOR_FIRST && \
        __GNUC__ <= LF16_MAJOR_LAST  && __GNUC_MINOR__ <= LF16_MINOR_LAST
    return sizeof(T) == 16 || t.is_lock_free();
#else
    return t.is_lock_free();
#endif
}

Here the LF16 macros define the version range where gcc returns the "wrong" answer for is_lock_free for 16-byte objects. Note that since second half of this change (to make __atomic_load_16 and friends use locks) you'll only need the first half of the check today. You need to determine the exact version when is_lock_free() started returning false for 16-byte objects: the links Peter provides discussing this issue are a good start, and you can do some checking in godbolt - although the latter doesn't provide everything you need since it doesn't decompile library functions like __atomic_load16: you may need to dig into the libatomic source for that. It's also possible that the macro check should be tied to the libstdc++ or libatomic version instead of the compiler version (although AFAIK in typical installs the versions of all of those are bound together). You'll probably want to add a few more checks to the #if to limit it to 64-bit x86 platforms as well.

I think this approach is valid since the concept of genuinely lock-free isn't really well-defined: you have decided in this case you want to consider the cmpxchg16b implementation in gcc lock-free, but if other grey areas occur in other future implementations you'll want to make another judgment call about whether you consider it lock-free. So the hardcoding approach seems approximately as robust for the non-gcc cases as some type of detection since in either case unknown future implementations may trigger the wrong answer. For the gcc case it seems more robust and definitely more simple.

The basis for this idea is that getting the answer wrong is not going to be a world-destroying functional problem, but rather a performance issue: I'm guessing you are trying to do this detection to select between alternate implementations one of which is faster on a "genuinely" lock-free system, and other being more suitable when std::atomic is lock-based.

If your requirements are stronger, and you really want to be more robust, why not combine approaches: use this simple version detection approach and combine it with a runtime/compile-time detection approach which examines tearing behavior or decompilation as suggested in Peter's answer. If both approaches agree, use it as your answer; if they disagree, however, surface the error and do further investigation. This will also help you catch the point, if ever, at which gcc changes the implementation to make 16-byte objects lock-full.