I have read different things about how a memory barrier works.
For example, the user Johan's answer in this question says that a memory barrier is an instruction that the CPU executes.
While the user Peter Cordes's comment in this question says the following about how the CPU reorders instructions:
It reads faster than it can execute, so it can see a window of
upcoming instructions. For details, see some of the links in the x86
tag wiki, like Agner Fog's microarch pdf, and also David Kanter's
writeup of Intel's Haswell design. Of course, if you had simply
googled "out of order execution", you'd find the wikipedia article,
which you should read.
So I'm guessing based on the above comment that if a memory barrier exists between the instructions, the CPU will see this memory barrier, which causes the CPU not to reorder the instructions, so this means that a memory barrier is a "marker" for the CPU to see and not to execute.
Now my guess is that a memory barrier acts both as a marker and as an instruction for the CPU to execute.
For the marker part, the CPU sees the memory barrier between the instructions, which causes the CPU not to reorder the instructions.
As for the instruction part, the CPU will execute the memory barrier instruction, which causes the CPU to do things like flushing the store buffer, and then the CPU will continue to execute the instructions after the memory barrier.
Am I correct?
No, mfence
is not serializing on the instruction stream, and lfence
(which is) doesn't flush the store buffer.
(In practice on Skylake, mfence
does block out-of-order execution of later ALU instructions, not just loads. (Proof: experiment details at the bottom of this answer). So it's implemented as an execution barrier, even though on paper it's not required to be one. But lock xchg
doesn't, and is also a full barrier.)
I'd suggest reading Jeff Preshing's Memory Barriers Are Like Source Control Operations article, to get a better understanding of what memory barriers need to do, and what they don't need to do. They don't (need to) block out-of-order execution in general.
A memory barrier restricts the order that memory operations can become globally visible, not (necessarily) the order in which instructions execute. Go read @BeeOnRope's updated answer to your previous question again: Does an x86 CPU reorder instructions? to learn more about how memory reordering can happen without OoO exec, and how OoO exec can happen without memory reordering.
Stalling the pipeline and flushing buffers is one (low-performance) way to implement barriers, used on some ARM chips, but higher-performance CPUs with more tracking of memory ordering can have cheaper memory barriers that only restrict ordering of memory operations, not all instructions. And for memory ops, they control order of access to L1d cache (at the other end of the store buffer), not necessarily the order that stores write their data into the store buffer.
x86 already needs lots of memory-order tracking for normal loads/stores for high performance while maintaining its strongly-ordered memory model where only StoreLoad reordering is allowed to be visible to observers outside the core (i.e. stores can be buffered until after later loads). (Intel's optimization manual uses the term Memory Order Buffer, or MOB, instead of store buffer, because it has to track load ordering as well. It has to do a memory-ordering machine clear if it turns out that a speculative load took data too early.) Modern x86 CPUs preserve the illusion of respecting the memory model while actually executing loads and stores aggressively out of order.
mfence
can do its job just by writing a marker into the memory-order buffer, without being a barrier for out-of-order execution of later ALU instructions. This marker must at least prevent later loads from executing until the mfence
marker reaches the end of the store buffer. (As well as ordering NT stores and operations on weakly-ordered WC memory).
(But again, simpler behaviour is a valid implementation choice, for example not letting any stores after an mfence
write data to the store buffer until all earlier loads have retired and earlier stores have committed to L1d cache. i.e. fully drain the MOB / store buffer. I don't know exactly what current Intel or AMD CPUs do.)
On Skylake specifically, my testing shows mfence
is 4 uops for the front-end (fused domain), and 2 uops that actually execute on execution ports (one for port2/3 (load/store-address), and one for port4 (store-data)). Presumably it's a special kind of uop that writes a marker into the memory-order buffer. The 2 uops that don't need an execution unit might be similar to lfence
. I'm not sure if they block the front-end from even issuing a later load, but hopefully not because that would stop later independent ALU operations from being executed.
lfence
is an interesting case: as well as being a LoadLoad + LoadStore barrier (even for weakly-ordered loads; normal loads/stores are already ordered), lfence
is also a weak execution barrier (note that mfence
isn't, just lfence
). It can't execute until all earlier instructions have "completed locally". Presumably that means "retired" from the out-of-order core.
But a store can't commit to L1d cache until after it retires anyway (i.e. after it's known to be non-speculative), so waiting for stores to retire from the ROB (ReOrder Buffer for uops) isn't the same thing as waiting for the store buffer to empty. See Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?.
So yes, the CPU pipeline does have to "notice" lfence
before it executes, presumably in the issue/rename stage. My understanding is that lfence
can't issue until the ROB is empty. (On Intel CPUs, lfence
is 2 uops for the front-end, but neither of them need execution units, according to Agner Fog's testing. http://agner.org/optimize/.)
lfence
is even cheaper on AMD Bulldozer-family: 1 uop with 4-per-clock throughput. IIRC, it's not partially-serializing on those CPUs, so you can only use lfence; rdtsc
to stop rdtsc
from sampling the clock early on Intel CPUs.
For fully serializing instructions like cpuid
or iret
, it would also wait until the store buffer has drained. (They're full memory barriers, as strong as mfence
). Or something like that; they're multiple uops so maybe only the last one does the serializing, I'm not sure which side of the barrier the actual work of cpuid
happens on (or if it can't overlap with either earlier or later instructions). Anyway, the pipeline itself has to notice serializing instructions, but the full memory-barrier effect might be from uops that do what mfence
does.
Bonus reading:
On AMD Bulldozer-family, sfence
is as expensive as mfence
, and may be as strong a barrier. (The x86 docs set a minimum for how strong each kind of barrier is; they don't prevent them from being stronger because that's not a correctness problem). Ryzen is different: sfence
has one per 20c throughput, while mfence
is 1 per 70c.
sfence
is very cheap on Intel (a uop for port2/port3, and a uop for port4), and just orders NT stores wrt. normal stores, not flushing the store buffer or serializing execution. It can execute at one per 6 cycles.
sfence
doesn't drain the store buffer before retiring. It doesn't become globally visible itself until all preceding stores have become globally visible first, but this is decoupled from the execution pipeline by the store buffer. The store buffer is always trying to drain itself (i.e. commit stores to L1d) so sfence
doesn't have to do anything special, except for putting a special kind of mark in the MOB that stops NT stores from reordering past it, unlike the marks that regular stores put which only order wrt. regular stores and later loads.
It reads faster than it can execute, so it can see a window of upcoming instructions.
See this answer I wrote which is a more detailed version of my comment. It goes over some basics of how a modern x86 CPU finds and exploits instruction-level parallelism by looking at instructions that haven't executed yet.
In code with high ILP, recent Intel CPUs can actually bottleneck on the front-end fairly easily; the back-end has so many execution units that it's rarely a bottleneck unless there are data dependencies or cache misses, or you use a lot of a single instruction that can only run on limited ports. (e.g. vector shuffles). But any time the back-end doesn't keep up with the front-end, the out-of-order window starts to fill with instructions to find parallelism in.