Here (and in a few SO questions) I see that C++ doesn't support something like lock-free std::atomic<double>
and can't yet support something like atomic AVX/SSE vector because it's CPU-dependent (though nowadays of CPUs I know, ARM, AArch64 and x86_64 have vectors).
But is there assembly-level support for atomic operations on double
s or vectors in x86_64? If so, which operations are supported (like load, store, add, subtract, multiply maybe)? Which operations does MSVC++2017 implement lock-free in atomic<double>
?
Actually, C++11
std::atomic<double>
is lock-free on typical C++ implementations, and does expose nearly everything you can do in asm for lock-free programming withfloat
/double
on x86 (e.g. load, store, and CAS are enough to implement anything: Why isn't atomic double fully implemented). Current compilers don't always compileatomic<double>
efficiently, though.C++11 std::atomic doesn't have an API for Intel's transactional-memory extensions (TSX) (for FP or integer). TSX could be a game-changer especially for FP / SIMD, since it would remove all overhead of bouncing data between xmm and integer registers. If the transaction doesn't abort, whatever you just did with double or vector loads/stores happens atomically.
Some non-x86 hardware supports atomic add for float/double, and C++ p0020 is a proposal to add
fetch_add
andoperator+=
/-=
template specializations to C++'sstd::atomic<float>
/<double>
.Hardware with LL/SC atomics instead of x86-style memory-destination instruction, such as ARM and most other RISC CPUs, can do atomic RMW operations on
double
andfloat
without a CAS, but you still have to get the data from FP to integer registers because LL/SC is usually only available for integer regs, like x86'scmpxchg
. However, if the hardware arbitrates LL/SC pairs to avoid/reduce livelock, it would be significantly more efficient than with a CAS loop in very-high-contention situations. If you've designed your algorithms so contention is rare, there's maybe only a small code-size difference between an LL/add/SC retry-loop for fetch_add vs. a load + add + LL/SC CAS retry loop.x86 natually-aligned loads and stores are atomic up to 8 bytes, even x87 or SSE. (For example
movsd xmm0, [some_variable]
is atomic, even in 32-bit mode). In fact, gcc uses x87fild
/fistp
or SSE 8B loads/stores to implementstd::atomic<int64_t>
load and store in 32-bit code.Ironically, compilers (gcc7.1, clang4.0, ICC17, MSVC CL19) do a bad job in 64-bit code (or 32-bit with SSE2 available), and bounce data through integer registers instead of just doing
movsd
loads/stores directly to/from xmm regs (see it on Godbolt):Without
-mtune=intel
, gcc likes to store/reload for integer->xmm. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 and related bugs I reported. This is a poor choice even for-mtune=generic
. AMD has high latency formovq
between integer and vector regs, but it also has high latency for a store/reload. With the default-mtune=generic
,load()
compiles to:Moving data between xmm and integer register brings us to the next topic:
Atomic read-modify-write (like
fetch_add
) is another story: there is direct support for integers with stuff likelock xadd [mem], eax
(see Can num++ be atomic for 'int num'? for more details). For other things, likeatomic<struct>
oratomic<double>
, the only option on x86 is a retry loop withcmpxchg
(or TSX).Atomic compare-and-swap (CAS) is usable as a lock-free building-block for any atomic RMW operation, up to the max hardware-supported CAS width. On x86-64, that's 16 bytes with
cmpxchg16b
(not available on some first-gen AMD K8, so for gcc you have to use-mcx16
or-march=whatever
to enable it).gcc makes the best asm possible for
exchange()
:compare_exchange
always does a bitwise comparison, so you don't need to worry about the fact that negative zero (-0.0
) compares equal to+0.0
in IEEE semantics, or that NaN is unordered. This could be an issue if you try to check thatdesired == expected
and skip the CAS operation, though. For new enough compilers,memcmp(&expected, &desired, sizeof(double)) == 0
might be a good way to express a bitwise comparison of FP values in C++. Just make sure you avoid false positives; false negatives will just lead to an unneeded CAS.Hardware-arbitrated
lock or [mem], 1
is definitely better than having multiple threads spinning onlock cmpxchg
retry loops. Every time a core gets access to the cache line but fails itscmpxchg
is wasted throughput compared to integer memory-destination operations that always succeed once they get their hands on a cache line.Some special cases for IEEE floats can be implemented with integer operations. e.g. absolute value of an
atomic<double>
could be done withlock and [mem], rax
(where RAX has all bits except the sign bit set). Or force a float / double to be negative by ORing a 1 into the sign bit. Or toggle its sign with XOR. You could even atomically increase its magnitude by 1 ulp withlock add [mem], 1
. (But only if you can be sure it wasn't infinity to start with...nextafter()
is an interesting function, thanks to the very cool design of IEEE754 with biased exponents that makes carry from mantissa into exponent actually work.)There's probably no way to express this in C++ that will let compilers do it for you on targets that use IEEE FP. So if you want it, you might have to do it yourself with type-punning to
atomic<uint64_t>
or something, and check that FP endianness matches integer endianness, etc. etc. (Or just do it only for x86. Most other targets have LL/SC instead of memory-destination locked operations anyway.)Correct. There's no way to detect when a 128b or 256b store or load is atomic all the way through the cache-coherency system. (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70490). Even a system with atomic transfers between L1D and execution units can get tearing between 8B chunks when transferring cache-lines between caches over a narrow protocol. Real example: a multi-socket Opteron K10 with HyperTransport interconnects appears to have atomic 16B loads/stores within a single socket, but threads on different sockets can observe tearing.
But if you have a shared array of aligned
double
s, you should be able to use vector loads/stores on them without risk of "tearing" inside any givendouble
.Per-element atomicity of vector load/store and gather/scatter?
I think it's safe to assume that an aligned 32B load/store is done with non-overlapping 8B or wider loads/stores, although Intel doesn't guarantee that. For unaligned ops, it's probably not safe to assume anything.
If you need a 16B atomic load, your only option is to
lock cmpxchg16b
, withdesired=expected
. If it succeeds, it replaces the existing value with itself. If it fails, then you get the old contents. (Corner-case: this "load" faults on read-only memory, so be careful what pointers you pass to a function that does this.) Also, the performance is of course horrible compared to actual read-only loads that can leave the cache line in Shared state, and that aren't full memory barriers.16B atomic store and RMW can both use
lock cmpxchg16b
the obvious way. This makes pure stores much more expensive than regular vector stores, especially if thecmpxchg16b
has to retry multiple times, but atomic RMW is already expensive.The extra instructions to move vector data to/from integer regs are not free, but also not expensive compared to
lock cmpxchg16b
.In C++11 terms:
atomic<__m128d>
would be slow even for read-only or write-only operations (usingcmpxchg16b
), even if implemented optimally.atomic<__m256d>
can't even be lock-free.alignas(64) atomic<double> shared_buffer[1024];
would in theory still allow auto-vectorization for code that reads or writes it, only needing tomovq rax, xmm0
and thenxchg
orcmpxchg
for atomic RMW on adouble
. (In 32-bit mode,cmpxchg8b
would work.) You would almost certainly not get good asm from a compiler for this, though!You can atomically update a 16B object, but atomically read the 8B halves separately. (I think this is safe with respect to memory-ordering on x86: see my reasoning at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835).
However, compilers don't provide any clean way to express this. I hacked up a union type-punning thing that works for gcc/clang: How can I implement ABA counter with c++11 CAS?. But gcc7 and later won't inline
cmpxchg16b
, because they're re-considering whether 16B objects should really present themselves as "lock-free". (https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html).On x86-64 atomic operations are implemented via the LOCK prefix. The Intel Software Developer's Manual (Volume 2, Instruction Set Reference) states
Neither of those instructions operates on floating point registers (like the XMM, YMM or FPU registers).
This means that there is no natural way to implement atomic float/double operations on x86-64. While most of those operations could be implemented by loading the bit representation of the floating point value into a general purpose (i.e. integer) register, doing so would severely degrade performance so the compiler authors opted not to implement it.
As pointed out by Peter Cordes in the comments, the LOCK prefix is not required for loads and stores, as those are always atomic on x86-64. However the Intel SDM (Volume 3, System Programming Guide) only guarantees that the following loads/stores are atomic:
In particular, atomicity of loads/stores from/to the larger XMM and YMM vector registers is not guaranteed.