I have been trying to Google my question but I honestly don't know how to succinctly state the question.
Suppose I have two threads in a multi-core Intel system. These threads are running on the same NUMA node. Suppose thread 1 writes to X once, then only reads it occasionally moving forward. Suppose further that, among other things, thread 2 reads X continuously. If I don't use a memory fence, how long could it be between thread 1 writing X and thread 2 seeing the updated value?
I understand that the write of X will go to the store buffer and from there to the cache, at which point MESIF will kick in and thread 2 will see the updated value via QPI. (Or at least this is what I've gleaned). I presume that the store buffer would get written to the cache either on a store fence or if that store buffer entry needs to be reused, but I don't know store buffers get allocated to writes.
Ultimately the question I'm trying to answer for myself is if it is possible for thread 2 to not see thread 1's write for several seconds in a fairly complicated application that is doing other work.
Memory barriers don't make other threads see your stores any faster. (Except that blocking later loads could slightly reduce contention for committing buffered stores.)
The store buffer always tries to commit retired (known non-speculative) stores to L1d cache as fast as possible. That makes them globally visible because of MESI/MESIF/MOESI. The store buffer is not designed as a proper cache or write-combining buffer (although it can combine back-to-back stores to the same cache line), so it needs to empty itself to make room for new stores. Unlike a cache, it wants to keep itself empty, not full.
Fences/barriers work by making the current thread wait, not by speeding up store visibility.
A simple implementation of a full barrier (mfence
or a lock
ed operation) is to stall the pipeline until the store buffer drains, but high-performance implementations can do better and allow out-of-order execution separately from the memory-order restriction.
(Unfortunately Skylake's mfence
does fully block out-of-order execution, to fix the obscure SKL079 erratum involving NT loads from WC memory. But lock add
or xchg
or whatever only block later loads from reading L1d or the store buffer until the barrier reaches the end of the store buffer. And mfence
on earlier CPUs presumably also doesn't have that problem.)
In general on non-x86 architectures (which have explicit asm instructions for weaker memory barriers, like only StoreStore fences without caring about loads), the principle is the same: block whichever operations it needs to block until this core has completed earlier operations of whatever type.
Related:
Globally Invisible load instructions talks about what it means for a load to become globally visible.
Does a memory barrier ensure that the cache coherence has been completed?
Does a memory barrier acts both as a marker and as an instruction?
Ultimately the question I'm trying to answer for myself is if it is possible for thread 2 to not see thread 1's write for several seconds
No, the worst-case latency is maybe something like store-buffer length (56 entries on Skylake, up from 42 in BDW) times cache-miss latency, because x86's strong memory model (no StoreStore reordering) requires stores to commit in-order. But RFOs for multiple cache lines can be in flight at once, so the max delay is maybe 1/5th of that (conservative estimate: there are 10 Line Fill Buffers). There can also be contention from loads also in flight, but we just want an order of magnitude back-of-the-envelope number.
Lets say RFO latency (DRAM or from another core) is 300 clock cycles (basically made up) on a 3GHz CPU. So a worst-case delay for a store to become globally visible is maybe something like 300 * 56 / 5
= 3360 core clock cycles. So within an order of magnitude, worst case is about ~1 microsecond on the 3GHz CPU we're assuming. (CPU frequency cancels out, so an estimate of RFO latency in nanoseconds would have been more useful).
That's when all your stores need to wait a long time for RFOs, because they're all to locations that are uncached or owned by other cores. And none of them are to the same cache line back-to-back so none can merge in the store buffer. So normally you'd expect it to be significantly faster.
I don't think there's any plausible mechanism for it to take even a hundred microseconds, let alone a whole second.