可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a question regarding the order of operations in the following code:
std::atomic<int> x;
std::atomic<int> y;
int r1;
int r2;
void thread1() {
y.exchange(1, std::memory_order_acq_rel);
r1 = x.load(std::memory_order_relaxed);
}
void thread2() {
x.exchange(1, std::memory_order_acq_rel);
r2 = y.load(std::memory_order_relaxed);
}
Given the description of std::memory_order_acquire
on the cppreference page (https://en.cppreference.com/w/cpp/atomic/memory_order), that
A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load.
it seems obvious that there can never be an outcome that r1 == 0 && r2 == 0
after running thread1
and thread2
concurrently.
However, I cannot find any wording in the C++ standard (looking at the C++14 draft right now), which establishes guarantees that two relaxed loads cannot be reordered with acquire-release exchanges. What am I missing?
EDIT: As has been suggested in the comments, it is actually possible to get both r1 and r2 equal to zero. I've updated the program to use load-acquire as follows:
std::atomic<int> x;
std::atomic<int> y;
int r1;
int r2;
void thread1() {
y.exchange(1, std::memory_order_acq_rel);
r1 = x.load(std::memory_order_acquire);
}
void thread2() {
x.exchange(1, std::memory_order_acq_rel);
r2 = y.load(std::memory_order_acquire);
}
Now is it possible to get both and r1
and r2
equal to 0 after concurrently executing thread1
and thread2
? If not, which C++ rules prevent this?
回答1:
The standard does not define the C++ memory model in terms of how operations are ordered around atomic operations with a specific ordering parameter.
Instead, for the acquire/release ordering model, it defines formal relationships such as "synchronizes-with" and "happens-before" that specify how data is synchronized between threads.
N4762, §29.4.2 - [atomics.order]
An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M
and takes its value from any side effect in the release sequence headed by A.
In §6.8.2.1-9, the standard also states that if a store A synchronizes with a load B, anything sequenced before A inter-thread "happens-before" anything sequenced after B.
No "synchronizes-with" (and hence inter-thread happens-before) relationship is established in your second example (the first is even weaker) because the runtime relationships (that check the return values from the loads) are missing.
But even if you did check the return value, it would not be helpful since the exchange
operations do not actually 'release' anything (i.e. no memory operations are sequenced before those operations).
Neiter do the atomic load operations 'acquire' anything since no operations are sequenced after the loads.
Therefore, according to the standard, each of the four possible outcomes for the loads in both examples (including 0 0) is valid.
In fact, the guarantees given by the standard are no stronger than memory_order_relaxed
on all operations.
If you want to exclude the 0 0 result in your code, all 4 operations must use std::memory_order_seq_cst
. That guarantees a single total order of the involved operations.
回答2:
In the original version, it is possible to see r1 == 0 && r2 == 0
because there is no requirement that the stores propogate to the other thread before it reads it. This is not a re-ordering of either thread's operations, but e.g. a read of stale cache.
Thread 1's cache | Thread 2's cache
x == 0; | x == 0;
y == 0; | y == 0;
y.exchange(1, std::memory_order_acq_rel); // Thread 1
x.exchange(1, std::memory_order_acq_rel); // Thread 2
The release on Thread 1 is ignored by Thread 2, and vice-versa. In the abstract machine there is not consistency with the values of x
and y
on the threads
Thread 1's cache | Thread 2's cache
x == 0; // stale | x == 1;
y == 1; | y == 0; // stale
r1 = x.load(std::memory_order_relaxed); // Thread 1
r2 = y.load(std::memory_order_relaxed); // Thread 2
You need more threads to get "violations of causality" with acquire / release pairs, as the normal ordering rules, combined with the "becomes visible side effect in" rules force at least one of the load
s to see 1
.
Without loss of generality, let's assume that Thread 1 executes first.
Thread 1's cache | Thread 2's cache
x == 0; | x == 0;
y == 0; | y == 0;
y.exchange(1, std::memory_order_acq_rel); // Thread 1
Thread 1's cache | Thread 2's cache
x == 0; | x == 0;
y == 1; | y == 1; // sync
The release on Thread 1 forms a pair with the acquire on Thread 2, and the abstract machine describes a consistent y
on both threads
r1 = x.load(std::memory_order_relaxed); // Thread 1
x.exchange(1, std::memory_order_acq_rel); // Thread 2
r2 = y.load(std::memory_order_relaxed); // Thread 2
回答3:
in Release-Acquire ordering for create synchronization point between 2 threads we need some atomic object M
which will be the same in both operations
An atomic operation A
that performs a release operation on an
atomic object M
synchronizes with an atomic operation B
that performs an acquire operation on M
and takes its value from any
side effect in the release sequence headed by A
.
or in more details:
If an atomic store in thread A
is tagged memory_order_release
and an atomic load in thread B
from the same variable is tagged
memory_order_acquire
, all memory writes (non-atomic and relaxed
atomic) that happened-before the atomic store from the point of view
of thread A
, become visible side-effects in thread B
. That
is, once the atomic load is completed, thread B
is guaranteed to
see everything thread A
wrote to memory.
The synchronization is established only between the threads releasing
and acquiring the same atomic variable.
N = u | if (M.load(acquire) == v) :[B]
[A]: M.store(v, release) | assert(N == u)
here synchronization point on M
store-release and load-acquire(which take value from store-release !). as result store N = u
in thread A
(before store-release on M
) visible in B
(N == u
) after load-acquire on same M
if take example:
atomic<int> x, y;
int r1, r2;
void thread_A() {
y.exchange(1, memory_order_acq_rel);
r1 = x.load(memory_order_acquire);
}
void thread_B() {
x.exchange(1, memory_order_acq_rel);
r2 = y.load(memory_order_acquire);
}
what we can select for common atomic object M
? say x
? x.load(memory_order_acquire);
will be synchronization point with x.exchange(1, memory_order_acq_rel)
( memory_order_acq_rel
include memory_order_release
(more strong) and exchange
include store
) if x.load
load value from x.exchange
and main will be synchronized loads after acquire (be in code after acquire nothing exist) with stores before release (but again before exchange nothing in code).
correct solution (look for almost exactly question ) can be next:
atomic<int> x, y;
int r1, r2;
void thread_A()
{
x.exchange(1, memory_order_acq_rel); // [Ax]
r1 = y.exchange(1, memory_order_acq_rel); // [Ay]
}
void thread_B()
{
y.exchange(1, memory_order_acq_rel); // [By]
r2 = x.exchange(1, memory_order_acq_rel); // [Bx]
}
assume that r1 == 0
.
All modifications to any particular atomic variable occur in a total
order that is specific to this one atomic variable.
we have 2 modification of y
: [Ay]
and [By]
. because r1 == 0
this mean that [Ay]
happens before [By]
in total modification order of y
. from this - [By]
read value stored by [Ay]
. so we have next:
A
is write to x
- [Ax]
A
do store-release [Ay]
to y
after this ( acq_rel include release,
exchange include store)
B
load-acquire from y
([By]
value stored by [Ay]
- once the atomic load-acquire (on
y
) is completed, thread B
is
guaranteed to see everything thread A
wrote to memory before
store-release (on y
). so it view side-effect of [Ax]
- and r2 == 1
another possible solution use atomic_thread_fence
atomic<int> x, y;
int r1, r2;
void thread_A()
{
x.store(1, memory_order_relaxed); // [A1]
atomic_thread_fence(memory_order_acq_rel); // [A2]
r1 = y.exchange(1, memory_order_relaxed); // [A3]
}
void thread_B()
{
y.store(1, memory_order_relaxed); // [B1]
atomic_thread_fence(memory_order_acq_rel); // [B2]
r2 = x.exchange(1, memory_order_relaxed); // [B3]
}
again because all modifications of atomic variable y
occur in a total order. [A3]
will be before [B1]
or visa versa.
if [B1]
before [A3]
- [A3]
read value stored by [B1]
=> r1 == 1
.
if [A3]
before [B1]
- the [B1]
is read value stored by [A3]
and from Fence-fence synchronization:
A release fence [A2]
in thread A
synchronizes-with an acquire fence [B2]
in thread B
, if:
- There exists an atomic object
y
,
- There exists an atomic write
[A3]
(with any memory order) that
modifies y
in thread A
[A2]
is sequenced-before [A3]
in thread A
There exists an atomic read [B1]
(with any memory order) in thread
B
[B1]
reads the value written by [A3]
[B1]
is sequenced-before [B2]
in thread B
In this case, all stores ([A1]
) that are sequenced-before [A2]
in thread A
will happen-before all loads ([B3]
) from the same locations (x
) made in thread B
after [B2]
so [A1]
(store 1 to x) will be before and have visible effect for [B3]
(load form x and save result to r2
). so will be loaded 1
from x
and r2==1
[A1]: x = 1 | if (y.load(relaxed) == 1) :[B1]
[A2]: ### release ### | ### acquire ### :[B2]
[A3]: y.store(1, relaxed) | assert(x == 1) :[B3]
回答4:
You already have an answer to the language-lawyer part of this. But I want to answer the related question of how to understand why this can be possible in asm on a possible CPU architecture that uses LL/SC for RMW atomics.
It doesn't make sense for C++11 to forbid this reordering: it would require a store-load barrier in this case where some CPU architectures could avoid one.
It might actually be possible with real compilers on PowerPC, given the way they map C++11 memory-orders to asm instructions.
On PowerPC64, a function with an acq_rel exchange and an acquire load (using pointer args instead of static variables) compiles as follows with gcc6.3 -O3 -mregnames
. This is from a C11 version because I wanted to look at clang output for MIPS and SPARC, and Godbolt's clang setup works for C11 <atomic.h>
but fails for C++11 <atomic>
when you use -target sparc64
.
(source + asm on Godbolt for MIPS32R6, SPARC64, ARM 32, and PowerPC64.)
foo:
lwsync # with seq_cst exchange this is full sync, not just lwsync
# gone if we use exchage with mo_acquire or relaxed
# so this barrier is providing release-store ordering
li %r9,1
.L2:
lwarx %r10,0,%r4 # load-linked from 0(%r4)
stwcx. %r9,0,%r4 # store-conditional 0(%r4)
bne %cr0,.L2 # retry if SC failed
isync # missing if we use exchange(1, mo_release) or relaxed
ld %r3,0(%r3) # 64-bit load double-word of *a
cmpw %cr7,%r3,%r3
bne- %cr7,$+4 # skip over the isync if something about the load? PowerPC is weird
isync # make the *a load a load-acquire
blr
isync
is not a store-load barrier; it only requires the preceding instructions to complete locally (retire from the out-of-order part of the core). It doesn't wait for the store buffer to be flushed so other threads can see the earlier stores.
Thus the SC (stwcx.
) store that's part of the exchange can sit in the store buffer and become globally visible after the pure acquire-load that follows it. In fact, another Q&A already asked this, and the answer is that we think this reordering is possible. Does `isync` prevent Store-Load reordering on CPU PowerPC?
If the pure load is seq_cst
, PowerPC64 gcc puts a sync
before the ld
. Making the exchange
seq_cst
does not prevent the reordering. Remember that C++11 only guarantees a single total order for SC operations, so the exchange and the load both need to be SC for C++11 to guarantee it.
So PowerPC has a bit of an unusual mapping from C++11 to asm for atomics. Most systems put the heavier barriers on stores, allowing seq-cst loads to be cheaper or only have a barrier on one side. I'm not sure if this was required for PowerPC's famously-weak memory ordering, or if another choice was possible.
https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html shows some possible implementations on various architectures. It mentions multiple alternatives for ARM.
On AArch64, we get this for the original C++ version of thread1:
thread1():
adrp x0, .LANCHOR0
mov w1, 1
add x0, x0, :lo12:.LANCHOR0
.L2:
ldaxr w2, [x0] @ load-linked with acquire semantics
stlxr w3, w1, [x0] @ store-conditional with sc-release semantics
cbnz w3, .L2 @ retry until exchange succeeds
add x1, x0, 8 @ the compiler noticed the variables were next to each other
ldar w1, [x1] @ load-acquire
str w1, [x0, 12] @ r1 = load result
ret
The reordering can't happen there because AArch64 release-stores are sequential-release, not plain release. This means they can't reorder with later loads.
But on a hypothetical machine that also or instead had plain-release LL/SC atomics, it's easy to see that an acq_rel doesn't stop later loads to different cache lines from becoming globally visible after the LL but before the SC of the exchange.
If exchange
is implemented with a single transaction like on x86, so the load and store are adjacent in the global order of memory operations, then certainly no later operations can be reordered with an acq_rel
exchange and it's basically equivalent to seq_cst
.
But LL/SC doesn't have to be a true atomic transaction to give RMW atomicity for that location.
In fact, a single asm swap
instruction could have relaxed or acq_rel semantics. SPARC64 needs membar
instructions around its swap
instruction, so unlike x86's xchg
it's not seq-cst on its own. (SPARC has really nice / human readable instruction mnemonics, especially compared to PowerPC. Well basically anything is more readable that PowerPC.)
Thus it doesn't make sense for C++11 to require that it did: it would hurt an implementation on a CPU that didn't otherwise need a store-load barrier.