My test code is as below, and I found that only the memory_order_seq_cst
forbade compiler's reorder.
#include <atomic>
using namespace std;
int A, B = 1;
void func(void) {
A = B + 1;
atomic_thread_fence(memory_order_seq_cst);
B = 0;
}
And other choices such as memory_order_release
, memory_order_acq_rel
did not generate any compiler barrier at all.
I think they must work with atomic variable just as below.
#include <atomic>
using namespace std;
atomic<int> A(0);
int B = 1;
void func(void) {
A.store(B+1, memory_order_release);
B = 0;
}
But I do not want to use atomic variable. At the same time, I think the "asm("":::"memory")" is too low level.
Is there any better choice?
re: your edit:
But I do not want to use atomic variable.
Why not? If it's for performance reasons, use them with memory_order_relaxed
and atomic_signal_fence(mo_whatever)
to block compiler reordering without any runtime overhead other than the compiler barrier potentially blocking some compile-time optimizations, depending on the surrounding code.
If it's for some other reason, then maybe atomic_signal_fence
will give you code that happens to work on your target platform. I suspect that it does order non-atomic<>
loads and/or stores, so it might even help avoid data-race Undefined Behaviour in C++.
Sufficient for what?
Regardless of any barriers, if two threads run this function at the same time, your program has Undefined Behaviour because of concurrent access to non-atomic<>
variables. So the only way this code can be useful is if you're talking about synchronizing with a signal handler that runs in the same thread.
That would also be consistent with asking for a "compiler barrier", to only prevent reordering at compile time, because out-of-order execution and memory reordering always preserve the behaviour of a single thread. So you never need extra barrier instructions to make sure you see your own operations in program order, you just need to stop the compiler reordering stuff at compile time. See Jeff Preshing's post: Memory Ordering at Compile Time
This is what atomic_signal_fence
is for. You can use it with any std::memory_order
, just like thread_fence, to get different strengths of barrier and only prevent the optimizations you need to prevent.
... atomic_thread_fence(memory_order_acq_rel)
did not generate any compiler barrier at all!
Totally wrong, in several ways.
atomic_thread_fence
is a compiler barrier plus whatever run-time barriers are necessary to restrict reordering in the order our loads/stores become visible to other threads.
I'm guessing you mean it didn't emit any barrier instructions when you looked at the asm output for x86. Instructions like x86's MFENCE are not "compiler barriers", they're run-time memory barriers and prevent even StoreLoad reordering at run-time. (That's the only reordering that x86 allows. SFENCE and LFENCE are only needed when using weakly-ordered (NT) stores, like MOVNTPS
(_mm_stream_ps
).)
On a weakly-ordered ISA like ARM, thread_fence(mo_acq_rel) isn't free, and compiles to an instruction. gcc5.4 uses dmb ish
. (See it on the Godbolt compiler explorer).
A compiler barrier just prevents reordering at compile time, without necessarily preventing run-time reordering. So even on ARM, atomic_signal_fence(mo_seq_cst)
compiles to no instructions.
A weak enough barrier allows the compiler to do the store to B
ahead of the store to A
if it wants, but gcc happens to decide to still do them in source order even with thread_fence(mo_acquire) (which shouldn't order stores with other stores).
So this example doesn't really test whether something is a compiler barrier or not.
Strange compiler behaviour from gcc for an example that is different with a compiler barrier:
See this source+asm on Godbolt.
#include <atomic>
using namespace std;
int A,B;
void foo() {
A = 0;
atomic_thread_fence(memory_order_release);
B = 1;
//asm volatile(""::: "memory");
//atomic_signal_fence(memory_order_release);
atomic_thread_fence(memory_order_release);
A = 2;
}
This compiles with clang the way you'd expect: the thread_fence is a StoreStore barrier, so the A=0 has to happen before B=1, and can't be merged with the A=2.
# clang3.9 -O3
mov dword ptr [rip + A], 0
mov dword ptr [rip + B], 1
mov dword ptr [rip + A], 2
ret
But with gcc, the barrier has no effect, and only the final store to A is present in the asm output.
# gcc6.2 -O3
mov DWORD PTR B[rip], 1
mov DWORD PTR A[rip], 2
ret
But with atomic_signal_fence(memory_order_release)
, gcc's output matches clang. So atomic_signal_fence(mo_release)
is having the barrier effect we expect, but atomic_thread_fence
with anything weaker than seq_cst isn't acting as a compiler barrier at all.
One theory here is that gcc knows that it's officially Undefined Behaviour for multiple threads to write to non-atomic<>
variables. This doesn't hold much water, because atomic_thread_fence
should still work if used to synchronize with a signal handler, it's just stronger than necessary.
BTW, with atomic_thread_fence(memory_order_seq_cst)
, we get the expected
# gcc6.2 -O3, with a mo_seq_cst barrier
mov DWORD PTR A[rip], 0
mov DWORD PTR B[rip], 1
mfence
mov DWORD PTR A[rip], 2
ret
We get this even with only one barrier, which would still allow the A=0 and A=2 stores to happen one after the other, so the compiler is allowed to merge them across a barrier. (Observers failing to see separate A=0 and A=2 values is a possible ordering, so the compiler can decide that's what always happens). Current compilers don't usually do this kind of optimization, though. See discussion at the end of my answer on Can num++ be atomic for 'int num'?.