As we know from a previous answer to Does it make any sense instruction LFENCE in processors x86/x86_64? that we can not use SFENCE
instead of MFENCE
for Sequential Consistency.
An answer there suggests that MFENCE
= SFENCE
+LFENCE
, i.e. that LFENCE
does something without which we can not provide Sequential Consistency.
LFENCE
makes impossible to reordering:
SFENCE
LFENCE
MOV reg, [addr]
-- To -->
MOV reg, [addr]
SFENCE
LFENCE
For example reordering of MOV [addr], reg
LFENCE
--> LFENCE
MOV [addr], reg
provided by mechanism - Store Buffer, which reorders Store - Loads for performance increase, and beacause LFENCE
does not prevent to it. And SFENCE
disables this mechanism.
What mechanism disables the LFENCE
to make impossible reordering (x86 have not mechanism - Invalidate-Queue)?
And is reordering of SFENCE
MOV reg, [addr]
--> MOV reg, [addr]
SFENCE
possible only in theory or perhaps in reality? And if possible, in reality, what mechanisms, how does it work?
From the Intel manuals, volume 2A, page 3-464 documentation for the
LFENCE
instruction:So yes, your example reordering is explicitly prevented by the
LFENCE
instruction. Your second example involving onlySFENCE
instructions IS a valid reordering, sinceSFENCE
has no impact on load operations.In general MFENCE != SFENCE + LFENCE. For example the code below, when compiled with
-DBROKEN
, fails on some Westmere and Sandy Bridge systems but appears to work on Ryzen. In fact on AMD systems just an SFENCE seems to be sufficient.SFENCE + LFENCE doesn't block StoreLoad reordering, so it's not sufficient for sequential consistency. Only
mfence
(or alock
ed operation, or a real serializing instruction likecpuid
) will do that. See Jeff Preshing's Memory Reordering Caught in the Act for a case where only a full barrier is sufficient.From Intel's instruction-set reference manual entry for
sfence
:but
LFENCE forces earlier instructions to "complete locally" (i.e. retire from the out-of-order part of the core), but for a store or SFENCE that just means putting data or a marker in the memory-order buffer, not flushing it so the store becomes globally visible. i.e. SFENCE "completion" (retirement from the ROB) doesn't include flushing the store buffer.
This is like Preshing describes in Memory Barriers Are Like Source Control Operations, where StoreStore barriers aren't "instant". Later in that that article, he explains why a #StoreStore + #LoadLoad + a #LoadStore barrier doesn't add up to a #StoreLoad barrier. (x86 LFENCE has some extra serialization of the instruction stream, but since it doesn't flush the store buffer the reasoning still holds).
LFENCE is not fully serializing like
cpuid
(which is as strong a memory barrier asmfence
or alock
ed instruction). It's just LoadLoad + LoadStore barrier, plus some execution serialization stuff which maybe started as an implementation detail but is now enshrined as a guarantee, at least on Intel CPUs. It's useful withrdtsc
, and for avoiding branch speculation to mitigate Spectre.BTW, SFENCE is a no-op except for NT stores; it orders them with respect to normal (release) stores. But not with respect to loads or LFENCE. Only on CPU that's normally weakly-ordered does a store-store barrier do anything.
The real concern is StoreLoad reordering between a store and a load, not between a store and barriers, so you should look at a case with a store, then a barrier, then a load.
can become globally visible (i.e. commit to L1d cache) in this order: