C++20 includes specializations for atomic<float>
and atomic<double>
. Can anyone here explain for what practical purpose this should be good for? The only purpose I can imagine is when I have a thread that changes an atomic double or float asynchronously at random points and other threads read this values asynchronously (but a volatile double or float should in fact do the same on most platforms). But the need for this should be extremely rare. I think this rare case couldn't justify an inclusion into the C++20 standard.
相关问题
- Sorting 3 numbers without branching [closed]
- How to compile C++ code in GDB?
- Why does const allow implicit conversion of refere
- thread_local variables initialization
- What uses more memory in c++? An 2 ints or 2 funct
相关文章
- How can I convert a f64 to f32 and get the closest
- Class layout in C++: Why are members sometimes ord
- How to mock methods return object with deleted cop
- Which is the best way to multiply a large and spar
- C++ default constructor does not initialize pointe
- Difference between Thread#run and Thread#wakeup?
- Selecting only the first few characters in a strin
- Java/Spring MVC: provide request context to child
atomic<float>
andatomic<double>
have existed since C++11. Theatomic<T>
template works for arbitrary trivially-copyableT
. Everything you could hack up with legacy pre-C++11 use ofvolatile
for shared variables can be done with C++11atomic<double>
withstd::memory_order_relaxed
.What doesn't exist until C++20 are atomic RMW operations like
x.fetch_add(3.14);
or for shortx += 3.14
. (Why isn't atomic double fully implemented wonders why not). Those member functions were only available in theatomic
integer specializations, so you could only load, store, exchange, and CAS onfloat
anddouble
, like for arbitraryT
like class types.See Atomic double floating point or SSE/AVX vector load/store on x86_64 for details on how to roll your own with
compare_exchange_weak
, and how that (and pure load, pure store, and exchange) compiles in practice with GCC and clang for x86. (Not always optimal, gcc bouncing to integer regs unnecessarily.) Also for details on lack ofatomic<__m128i>
load/store because vendors won't publish real guarantees to let us take advantage (in a future-proof way) of what current HW does.These new specializations provide maybe some efficiency (on non-x86) and convenience with
fetch_add
andfetch_sub
(and the equivalent+=
and-=
overloads). Only those 2 operations that are supported, notfetch_mul
or anything else. See the current draft of 31.8.3 Specializations for floating-point types, and cppreferencestd::atomic
It's not like the committee went out of their way to introduce new FP-relevant atomic RMW member functions
fetch_mul
, min, max, or even absolute value or negation, which is ironically easier in asm, just bitwise AND or XOR to clear or flip the sign bit and can be done with x86lock and
if the old value isn't needed. Actually since carry-out from the MSB doesn't matter, 64-bitlock xadd
can implementfetch_xor
with1ULL<<63
. Assuming of course IEEE754 style sign/magnitude FP. Similarly easy on LL/SC machines that can do 4-byte or 8-byte fetch_xor, and they can easily keep the old value in a register.So the one thing that could be done significantly more efficiently in x86 asm than in portable C++ without union hacks (atomic bitwise ops on FP bit patterns) still isn't exposed by ISO C++.
It makes sense that the integer specializations don't have
fetch_mul
: integer add is much cheaper, typically 1 cycle latency, the same level of complexity as atomic CAS. But for floating point, multiply and add are both quite complex and typically have similar latency. Moreover, if atomic RMWfetch_add
is useful for anything, I'd assumefetch_mul
would be, too. Again unlike integer where lockless algorithms commonly add/sub but very rarely need to build an atomic shift or mul out of a CAS. x86 doesn't have memory-destination multiply so has no direct HW support forlock imul
.It seems like this is more a matter of bringing
atomic<double>
up to the level you might naively expect (supporting.fetch_add
and sub like integers), not of providing a serious library of atomic RMW FP operations. Perhaps that makes it easier to write templates that don't have to check for integral, just numeric, types?For pure store / pure load, maybe some global scale factor that you want to be able to publish to all threads with a simple store? And readers load it before every work unit or something. Or just as part of a lockless queue or stack of
double
.It's not a coincidence that it took until C++20 for anyone to say "we should provide fetch_add for
atomic<double>
in case anyone wants it."Plausible use-case: to manually multi-thread the sum of an array (instead of using
#pragma omp parallel for simd reduction(+:my_sum_variable)
or a standard<algorithm>
likestd::accumulate
with a C++17 parallel execution policy).The parent thread might start with
atomic<double> total = 0;
and pass it by reference to each thread. Then threads do*totalptr += sum_region(array+TID*size, size)
to accumulate the results. Instead of having a separate output variable for each thread and collecting the results in one caller. It's not bad for contention unless all threads finish at nearly the same time. (Which is not unlikely, but it's at least a plausible scenario.)If you just want separate load and separate store atomicity like you're hoping for from
volatile
, you already have that with C++11.Don't use
volatile
for threading: useatomic<T>
withmo_relaxed
See When to use volatile with multi threading? for details on mo_relaxed atomic vs. legacy
volatile
for multithreading.volatile
data races are UB, but it does work in practice as part of roll-your-own atomics on compilers that support it, with inline asm needed if you want any ordering wrt. other operations, or if you want RMW atomicity instead of separate load / ALU / separate store. All mainstream CPUs have coherent cache/shared memory. But with C++11 there's no reason to do that:std::atomic<>
obsoleted hand-rolledvolatile
shared variables.At least in theory. In practice some compilers (like GCC) still have missed-optimizations for
atomic<double>
/atomic<float>
even for just simple load and store. (And the C++20 new overloads aren't implemented yet on Godbolt).atomic<integer>
is fine though, and does optimize as well as volatile or plain integer + memory barriers.In some ABIs (like 32-bit x86),
alignof(double)
is only 4. Compilers normally align it by 8 but inside structs they have to follow the ABI's struct packing rules so an under-alignedvolatile double
is possible. Tearing will be possible in practice if it splits a cache-line boundary, or on some AMD an 8-byte boundary.atomic<double>
instead ofvolatile
can plausibly matter for correctness on some real platforms, even when you don't need atomic RMW. e.g. this G++ bug which was fixed by increasing usingalignas()
in thestd::atomic<>
implementation for objects small enough to be lock_free.(And of course there are platforms where an 8-byte store isn't naturally atomic so to avoid tearing you need a fallback to a lock. If you care about such platforms, a publish-occasionally model should use a hand-rolled SeqLock or
atomic<float>
ifatomic<double>
isn'talways_lock_free
.)You can get the same efficient code-gen (without extra barrier instructions) from
atomic<T>
using mo_relaxed as you can withvolatile
. Unfortunately in practice, not all compilers have efficientatomic<double>
. For example, GCC9 for x86-64 copies from XMM to general-purpose integer registers.Godbolt GCC9 for x86-64, gcc -O3. (Also included an integer version)
clang compiles it efficiently, with the same move-scalar-double load and store for
ax
as forvx
andpx
.Fun fact: C++20 apparently deprecates
vx += 1.0
. Perhaps this is to help avoid confusion between separate load and store like vx = vx + 1.0 vs. atomic RMW? To make it clear there are 2 separate volatile accesses in that statement?Note that
x = x + 1
is not the same thing asx += 1
foratomic<T> x
: the former loads into a temporary, adds, then stores. (With sequential-consistency for both).Yes, this is the only purpose of an atomic regardless of the actual type. may it be an atomic
bool
,char
,int
,long
or whatever.Whatever usage you have for
type
,std::atomic<type>
is a thread-safe version of it. Whatever usage you have for afloat
or adouble
,std::atomic<float/double>
can be written, read or compared with a thread-safe manner.saying that
std::atomic<float/double>
has only rare usages is practically saying thatfloat/double
have rare usages.EDIT: Adding Ulrich Eckhardt's comment to clarify: 'Let me try to rephrase that: Even if volatile on one particular platform/environment/compiler did the same thing as atomic<>, down to the generated machine code, then atomic<> is still much more expressive in its guarantees and furthermore, it is guaranteed to be portable. Moreover, when you can write self-documenting code, then you should do that.'
Volatile sometimes has the below 2 effects:
See also Understanding volatile keyword in c++
TLDR;
Be explicit about what you want.
From std::memory_order
As a final rant: In practice, the only feasible languages for building an OS kernel are usually C and C++. Given that, I would like provisions in the 2 standards for 'telling the compiler to butt out', i.e. to be able to explicitly tell the compiler to not change the 'intent' of the code. The purpose would be to use C or C++ as a portable assembler, to an even greater degree than today.
An somewhat silly code example is worth compiling on e.g. godbolt.org for ARM and x86_64, both gcc, to see that in the ARM case, the compiler generates two __sync_synchronize (HW CPU barrier) operations for the atomic, but not for the volatile variant of the code (uncomment the one you want). The point being that using atomic gives predictable, portable behavior.
Godbolt output for ARM gcc 8.3.1:
For those who want an X86 example, a colleague of mine, Angus Lepper, graciously contributed this example: godbolt example of bad volatile use on x86_64