Does standard C++11 guarantee that memory_order_seq_cst
prevents StoreLoad reordering around an atomic operation for non-atomic memory accesses?
As known, there are 6 std::memory_order
s in C++11, and its specifies how regular, non-atomic memory accesses are to be ordered around an atomic operation - Working Draft, Standard for Programming Language C++ 2016-07-12: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4606.pdf
§ 29.3 Order and consistency
§ 29.3 / 1
The enumeration memory_order specifies the detailed regular
(non-atomic) memory synchronization order as defined in 1.10 and may
provide for operation ordering. Its enumerated values and their
meanings are as follows:
Also known, that these 6 memory_orders prevent some of these reordering:
But, does memory_order_seq_cst
prevent StoreLoad reordering around an atomic operation for regular, non-atomic memory accesses or only for other atomic with the same memory_order_seq_cst
?
I.e. to prevent this StoreLoad-reordering should we use std::memory_order_seq_cst
for both STORE and LOAD, or only for one of it?
std::atomic<int> a, b;
b.store(1, std::memory_order_seq_cst); // Sequential Consistency
a.load(std::memory_order_seq_cst); // Sequential Consistency
About Acquire-Release semantic is all clear, it specifies exactly non-atomic memory-access reordering across atomic operations: http://en.cppreference.com/w/cpp/atomic/memory_order
To prevent StoreLoad-reordering we should use std::memory_order_seq_cst
.
Two examples:
std::memory_order_seq_cst
for both STORE and LOAD: there is MFENCE
StoreLoad can't be reordered - GCC 6.1.0 x86_64: https://godbolt.org/g/mVZJs0
std::atomic<int> a, b;
b.store(1, std::memory_order_seq_cst); // can't be executed after LOAD
a.load(std::memory_order_seq_cst); // can't be executed before STORE
std::memory_order_seq_cst
for LOAD only: there isn't MFENCE
StoreLoad can be reordered - GCC 6.1.0 x86_64: https://godbolt.org/g/2NLy12
std::atomic<int> a, b;
b.store(1, std::memory_order_release); // can be executed after LOAD
a.load(std::memory_order_seq_cst); // can be executed before STORE
Also if C/C++-compiler used alternative mapping of C/C++11 to x86, which flushes the Store Buffer before the LOAD: MFENCE,MOV (from memory)
, so we must use std::memory_order_seq_cst
for LOAD too: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html As this example is discussed in another question as approach (3): Does it make any sense instruction LFENCE in processors x86/x86_64?
I.e. we should use std::memory_order_seq_cst
for both STORE and LOAD to generate MFENCE
guaranteed, that prevents StoreLoad reordering.
Is it true, that memory_order_seq_cst
for atomic Load or Store:
specifi Acquire-Release semantic - prevent: LoadLoad, LoadStore, StoreStore reordering around an atomic operation for regular, non-atomic memory accesses,
but prevent StoreLoad reordering around an atomic operation only for other atomic operations with the same memory_order_seq_cst
?
No, standard C++11 doesn't guarantee that memory_order_seq_cst
prevents StoreLoad reordering of non-atomic
around an atomic(seq_cst)
.
Even standard C++11 doesn't guarantee that memory_order_seq_cst
prevents StoreLoad reordering of atomic(non-seq_cst)
around an atomic(seq_cst)
.
Working Draft, Standard for Programming Language C++ 2016-07-12: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4606.pdf
- There shall be a single total order S on all
memory_order_seq_cst
operations - C++11 Standard:
§ 29.3
3
There shall be a single total order S on all memory_order_seq_cst
operations, consistent with the “happens before” order and
modification orders for all affected locations, such that each
memory_order_seq_cst operation B that loads a value from an atomic
object M observes one of the following values: ...
- But, any atomic operations with ordering weaker than
memory_order_seq_cst
hasn't sequential consistency and hasn't single total order, i.e. non-memory_order_seq_cst
operations can be reordered with memory_order_seq_cst
operations in allowed directions - C++11 Standard:
§ 29.3
8 [ Note: memory_order_seq_cst ensures sequential consistency
only for a program that is free of data races and uses exclusively
memory_order_seq_cst operations. Any use of weaker ordering will
invalidate this guarantee unless extreme care is used. In particular,
memory_order_seq_cst fences ensure a total order only for the fences
themselves. Fences cannot, in general, be used to restore sequential
consistency for atomic operations with weaker ordering specifications.
— end note ]
Also C++-compilers allows such reorderings:
- On x86_64
Usually - if in compilers seq_cst implemented as barrier after store, then:
STORE-C(relaxed);
LOAD-B(seq_cst);
can be reordered to LOAD-B(seq_cst);
STORE-C(relaxed);
Screenshot of Asm generated by GCC 7.0 x86_64: https://godbolt.org/g/4yyeby
Also, theoretically possible - if in compilers seq_cst implemented as barrier before load, then:
STORE-A(seq_cst);
LOAD-C(acq_rel);
can be reordered to LOAD-C(acq_rel);
STORE-A(seq_cst);
- On PowerPC
STORE-A(seq_cst);
LOAD-C(relaxed);
can be reordered to LOAD-C(relaxed);
STORE-A(seq_cst);
Also on PowerPC can be such reordering:
STORE-A(seq_cst);
STORE-C(relaxed);
can reordered to STORE-C(relaxed);
STORE-A(seq_cst);
If even atomic variables are allowed to be reordered across atomic(seq_cst), then non-atomic variables can also be reordered across atomic(seq_cst).
Screenshot of Asm generated by GCC 4.8 PowerPC: https://godbolt.org/g/BTQBr8
More details:
- On x86_64
STORE-C(release);
LOAD-B(seq_cst);
can be reordered to LOAD-B(seq_cst);
STORE-C(release);
Intel® 64 and IA-32 Architectures
8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations
I.e. x86_64 code:
STORE-A(seq_cst);
STORE-C(release);
LOAD-B(seq_cst);
Can be reordered to:
STORE-A(seq_cst);
LOAD-B(seq_cst);
STORE-C(release);
This can happen because between c.store
and b.load
isn't mfence
:
x86_64 - GCC 7.0: https://godbolt.org/g/dRGTaO
C++ & asm - code:
#include <atomic>
// Atomic load-store
void test() {
std::atomic<int> a, b, c;
a.store(2, std::memory_order_seq_cst); // movl 2,[a]; mfence;
c.store(4, std::memory_order_release); // movl 4,[c];
int tmp = b.load(std::memory_order_seq_cst); // movl [b],[tmp];
}
It can be reordered to:
#include <atomic>
// Atomic load-store
void test() {
std::atomic<int> a, b, c;
a.store(2, std::memory_order_seq_cst); // movl 2,[a]; mfence;
int tmp = b.load(std::memory_order_seq_cst); // movl [b],[tmp];
c.store(4, std::memory_order_release); // movl 4,[c];
}
Also, Sequential Consistency in x86/x86_64 can be implemented in four ways: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
LOAD
(without fence) and STORE
+ MFENCE
LOAD
(without fence) and LOCK XCHG
MFENCE
+ LOAD
and STORE
(without fence)
LOCK XADD
( 0 ) and STORE
(without fence)
- 1 and 2 ways:
LOAD
and (STORE
+MFENCE
)/(LOCK XCHG
) - we reviewed above
- 3 and 4 ways: (
MFENCE
+LOAD
)/LOCK XADD
and STORE
- allow next reordering:
STORE-A(seq_cst);
LOAD-C(acq_rel);
can be reordered to LOAD-C(acq_rel);
STORE-A(seq_cst);
- On PowerPC
STORE-A(seq_cst);
LOAD-C(relaxed);
can be reordered to LOAD-C(relaxed);
STORE-A(seq_cst);
Allows Store-Load reordering (Table 5 - PowerPC): http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf
Stores Reordered After Loads
I.e. PowerPC code:
STORE-A(seq_cst);
STORE-C(relaxed);
LOAD-C(relaxed);
LOAD-B(seq_cst);
Can be reordered to:
LOAD-C(relaxed);
STORE-A(seq_cst);
STORE-C(relaxed);
LOAD-B(seq_cst);
PowerPC - GCC 4.8 : https://godbolt.org/g/xowFD3
C++ & asm - code:
#include <atomic>
// Atomic load-store
void test() {
std::atomic<int> a, b, c; // addr: 20, 24, 28
a.store(2, std::memory_order_seq_cst); // li r9<-2; sync; stw r9->[a];
c.store(4, std::memory_order_relaxed); // li r9<-4; stw r9->[c];
c.load(std::memory_order_relaxed); // lwz r9<-[c];
int tmp = b.load(std::memory_order_seq_cst); // sync; lwz r9<-[b]; ... isync;
}
By dividing a.store
into two parts - it can be reordered to:
#include <atomic>
// Atomic load-store
void test() {
std::atomic<int> a, b, c; // addr: 20, 24, 28
//a.store(2, std::memory_order_seq_cst); // part-1: li r9<-2; sync;
c.load(std::memory_order_relaxed); // lwz r9<-[c];
a.store(2, std::memory_order_seq_cst); // part-2: stw r9->[a];
c.store(4, std::memory_order_relaxed); // li r9<-4; stw r9->[c];
int tmp = b.load(std::memory_order_seq_cst); // sync; lwz r9<-[b]; ... isync;
}
Where load-from-memory lwz r9<-[c];
executed earlier than store-to-memory stw r9->[a];
.
Also on PowerPC can be such reordering:
STORE-A(seq_cst);
STORE-C(relaxed);
can reordered to STORE-C(relaxed);
STORE-A(seq_cst);
Because PowerPC has weak memory ordering model - allows Store-Store reordering (Table 5 - PowerPC): http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf
Stores Reordered After Stores
I.e. on PowerPC operations Store can be reordered with other Store, then previous example can be reordered such as:
#include <atomic>
// Atomic load-store
void test() {
std::atomic<int> a, b, c; // addr: 20, 24, 28
//a.store(2, std::memory_order_seq_cst); // part-1: li r9<-2; sync;
c.load(std::memory_order_relaxed); // lwz r9<-[c];
c.store(4, std::memory_order_relaxed); // li r9<-4; stw r9->[c];
a.store(2, std::memory_order_seq_cst); // part-2: stw r9->[a];
int tmp = b.load(std::memory_order_seq_cst); // sync; lwz r9<-[b]; ... isync;
}
Where store-to-memory stw r9->[c];
executed earlier than store-to-memory stw r9->[a];
.
The std::memory_order_seq_cst
guarantees there is no reordering by either compiler nor cpu. In this case the same memory order as if only one instruction where executed at a time.
But the compiler optimization confuses the issues, if you turn off -O3 then the fence is there.
The compiler can see that in your test program with -O3 that there are no consequence of the mfence as the program is too simple.
If you ran it on an Arm on the other hand like this you can see the barriers dmb ish
.
So if your program is more complex you might see the mfence
in this part of the code but not if the compiler can analyse and reason that it is not needed.