What I'm wondering is if lock xchg
will have similar behavior to mfence
from the perspective of one thread accessing a memory location that is being mutated (lets just say at random) by other threads. Does it guarantee I get the most up to date value? Of memory read/write instructions that follow after?
The reason for my confusion is:
8.2.2 “Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.”
-Intel 64 Developers Manual Vol. 3
Does this apply across threads?
mfence
states:
Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any SFENCE and LFENCE instructions, and any serializing instructions (such as the CPUID instruction).
-Intel 64 Developers Manual Vol 3A
This sounds like a stronger guarantee. As it sounds like mfence
is almost flushing the write buffer, or at least reaching out to the write buffer and other cores to ensure my future load/stores are up to date.
When bench-marked both instructions take on the order of ~100 cycles to complete. So I can't see that big of a difference either way.
Primarily I am just confused. I instructions based around lock
used in mutexes, but then these contain no memory fences. Then I see lock free programming that uses memory fences, but no locks. I understand AMD64 has a very strong memory model, but stale values can persist in cache. If lock
doesn't behave the same behavior as mfence
then how do mutexes help you see the most recent value?
I believe your question is the same as asking if
mfence
has the same barrier semantics as thelock
-prefixed instructions on x86, or if it provides fewer1 or additional guarantees in some cases.My current best answer is that it was Intel's intent and that the ISA documentation guarantees that
mfence
andlock
ed instructions provide the same fencing semantics, but that due to implementation oversights,mfence
actually provides stronger fencing semantics on recent hardware (since at least Haswell). In particular,mfence
can fence a subsequent non-temporal load from a WC-type memory region, whilelock
ed instructions do not.We know this because Intel tells us this in processor errata such as HSD162 (Haswell) and SKL155 (Skylake) which tell us that locked instructions don't fence a subsequent non-temporal read from WC-memory:
From this, we can determine that (1) Intel probably intended that locked instructions fence NT loads from WC-type memory, or else this wouldn't be an errata0.5 and (2) that locked instructions don't actually do that, and Intel wasn't able to or chose not to fix this with a microcode update, and
mfence
is recommended instead.In Skylake,
mfence
actually lost its additional fencing capability with respect to NT loads, as per SKL079: MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions - this has pretty much the same text as thelock
-instruction errata, but applies tomfence
. However, the status of this errata is "It is possible for the BIOS to contain a workaround for this erratum.", which is generally Intel-speak for "a microcode update addresses this".This sequence of errata can perhaps be explained by timing: the Haswell errata only appears in early 2016, years after the the release of that processor, so we can assume the issue came to Intel's attention some moderate amount of time before that. At this point Skylake was almost certainly already out in the wild, with apparently a less conservative
mfence
implementation which also didn't fence NT loads on WC-type memory regions. Fixing the way locked instructions works all the way back to Haswell was probably either impossible or expensive based on their wide use, but some way was needed to fence NT loads.mfence
apparently already did the job on Haswell, and Skylake would be fixed so thatmfence
worked there too.It doesn't really explain why SKL079 (the
mfence
one) appeared in January 2016, nearly two years before SKL155 (thelocked
one) appeared in late 2017, or why the latter appeared so much after the identical Haswell errata, however.One might speculate on what Intel will do in the future. Since they weren't able/willing to change the
lock
instruction for Haswell through Skylake, representing hundreds of million (billions?) of deployed chips, they'll never be able to guarantee that locked instructions fence NT loads, so they might consider making this the documented, architected behavior in the future. Or they might update the locked instructions, so they do fence such reads, but as a practical matter you can't rely on this probably for a decade or more, until chips with the current non-fencing behavior are almost out of circulation.Similar to Haswell, according to BV116 and BJ138, NT loads may pass earlier locked instructions on Sandy Bridge and Ivy Bridge, respectively. It's possible that earlier microarchitectures also suffer from this issue. This "bug" does not seem to exist in Broadwell and microarchitectures after Skylake.
Peter Cordes has written a bit about the Skylake
mfence
change at the end of this answer.The remaining part of this answer is my original answer, before I knew about the errata, and which is left mostly for historical interest.
Old Answer
My informed guess at the answer is that
mfence
provides additional barrier functionality: between accesses using weakly-ordered instructions (e.g., NT stores) and perhaps between accesses weakly-ordered regions (e.g., WC-type memory).That said, this is just an informed guess and you'll find details of my investigation below.
Details
Documentation
It isn't exactly clear the extent that the memory consistency effects of
mfence
differs that provided bylock
-prefixed instruction (includingxchg
with a memory operand, which is implicitly locked).I think it is safe to say that solely with respect to write-back memory regions and not involving any non-temporal accesses,
mfence
provides the same ordering semantics aslock
-prefixed operation.What is open for debate is whether
mfence
differs at all fromlock
-prefixed instructions when it comes to scenarios outside the above, in particular when accesses involve regions other than WB regions or when non-temporal (streaming) operations are involved.For example, you can find some suggestions (such as here or here) that
mfence
implies strong barrier semantics when WC-type operations (e.g., NT stores) are involved.For example, quoting Dr. McCalpin in this thread (emphasis added):
Let's check out the referenced section 8.2.5 of the Intel SDM:
Contrary to Dr. McCalpin's interpretation2, I see this section as somewhat ambiguous as to whether
mfence
does something extra. The three sections referring to IO, locked instructions and serializing instructions do imply that they provide a full barrier between memory operations before and after the operation. They don't make any exception for weakly ordered memory and in the case of the IO instructions, one would also assume they need to work in a consistent way with weakly ordered memory regions since such are often used for IO.Then the section for the
FENCE
instructions, it explicitly mentions weak memory regions: "The SFENCE, LFENCE, and MFENCE instructions **provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data."Do we read between the lines and take this to mean that these are the only instructions that accomplish this and that the previously mentioned techniques (including locked instructions) don't help for weak memory regions? We can find some support for this idea by noting that fence instructions were introduced3 at the same time as weakly-ordered non-temporal store instructions, and by text like that found in 11.6.13 Cacheability Hint Instructions dealing specifically with weakly ordered instructions:
Again, here the fence instructions are specifically mentioned to be appropriate for fencing weakly ordered instructions.
We also find support for the idea that locked instruction might not provide a barrier between weakly ordered accesses from the last sentence already quoted above:
Here is basically implies that the
FENCE
instructions essentially replace a functionality previously offered by the serializingcpuid
in terms of memory ordering. However, iflock
-prefixed instructions provided the same barrier capability ascpuid
, that would likely have been the previously suggested way, since these are in general much faster thancpuid
which often takes 200 or more cycles. The implication being that there were scenarios (probably weakly ordered scenarios) thatlock
-prefixed instructions didn't handle, and wherecpuid
was being used, and wheremfence
is now suggested as a replacement, implying stronger barrier semantics thanlock
-prefixed instructions.However, we could interpret some of the above in a different way: note that in the context of the fence instructions it is often mentioned that they are performance-efficient way to ensure ordering. So it could be that these instructions are not intended to provide additional barriers, but simply more efficient barriers for.
Indeed,
sfence
at a few cycles is much faster than serializing instructions likecpuid
orlock
-prefixed instructions which are generally 20 cycles or more. On the other handmfence
isn't generally faster than locked instructions4, at least on modern hardware. Still, it could have been faster when introduced, or on some future design, or perhaps it was expected to be faster but that didn't pan out.So I can't make a certain assessment based on these sections of the manual: I think you can make a reasonable argument that it could be interpreted either way.
We can further look at documentation for various non-temporal store instructions in the Intel ISA guide. For example, in the documentation for the non-temporal store
movnti
you find the following quote:The part about "if multiple processors might use different memory types to read/write the destination memory locations" is a bit confusing to me. I would expect this rather to say something like "to enforce ordering in the globally visible write order between instructions using weakly ordered hints" or something like that. Indeed, the actual memory type (e.g., as defined by the MTTR) probably doesn't even come into play here: the ordering issues can arise solely in WB-memory when using weakly ordered instructions.
Performance
The
mfence
instruction is reported to take 33 cycles (back-to-back latency) on modern CPUs based on Agner fog's instruction timing, but a more complex locked instructon likelock cmpxchg
is reported to take only 18 cycles.If
mfence
provided barrier semantics no stronger thanlock cmpxchg
, the latter is doing strictly more work and there is no apparent reason formfence
to take significantly longer. Of course you could argue thatlock cmpxchg
is simply more important thanmfence
and hence gets more optimization. This argument is weakened by the fact that all of the locked instructions are considerably faster thanmfence
, even infrequently used ones. Also, you would imagine that if there were a single barrier implementation shared by all thelock
instructions,mfence
would simply use the same one as that's the simplest and easiest to validation.So the slower performance of
mfence
is, in my opinion, significant evidence thatmfence
is doing some extra.0.5 This isn't a water-tight argument. Some things may appear in errata that are apparently "by design" and not a bug, such as
popcnt
false dependency on destination register - so some errata can be considered a form of documentation to update expectations rather than always implying a hardware bug.1 Evidently, the
lock
-prefixed instruction also perform an atomic operation which isn't possible to achieve solely withmfence
, so thelock
-prefixed instructions definitely have additional functionality. Therefore, formfence
to be useful, we would expect it either to have additional barrier semantics in some scenarios, or to perform better.2 It is also entirely possible that he was reading a different version of the manual where the prose was different.
3
SFENCE
in SSE,lfence
andmfence
in SSE2.4 And often it's slower: Agner has it listed at 33 cycles latency on recent hardware, while locked instructions are usually about 20 cycles.