The ordering of L1 cache controller to process mem

2019-07-15 01:07发布

问题:

Under the total store order(TSO) memory consistency model, a x86 cpu will have a write buffer to buffer write requests and can serve reordered read requests from the write buffer. And it says that the write requests in the write buffer will exit and be issued toward cache hierarchy in FIFO order, which is the same as program order.

I am curious about:

To serve the write requests issued from the write buffer, does L1 cache controller handle the write requests, finish the cache coherence of the write requests and insert data into L1 cache in the same order as the issue order?

回答1:

I think you're confused because your terminology (and probably also mental model) doesn't seem to match reality. You don't "finish the cache coherence" for a store, you get ownership of the cache line before you're allowed to modify it. At the instant/cycle that modification happens, it becomes part of the view of memory contents shared by all participants in the cache-coherency protocol.

(Weakly-ordered architectures have mind-bending possibilities like not all cores seeing the loads/stores from two other cores in the same order, but never mind that.)


I think you know some of this, but let me start from the basics.

L1 cache in each core participates in the cache-coherency protocol that keeps its cache coherent with the other caches in the coherency domain (e.g. L2 and L3, and L1 in other cores, but not video-RAM caches inside the GPU).

Loads become globally visible at the instant their data is read from L1 cache (or from the store buffer or from uncacheable RAM or MMIO). MFENCE can force them to wait for earlier stores to become globally visible before sampling L1, to avoid StoreLoad reordering.

Stores become globally visible at the instant their data is committed to L1 cache. The conditions required before this can happen are:

  • It's finished executing: the data+address is in a store buffer entry. (i.e. the store-address and store-data uops executed on the appropriate ports once their inputs were ready, writing the address and data into the store buffer, aka Memory Order Buffer on Intel CPUs).
  • It's retired from the out-of-order part of the core, and thus known to be non-speculative. Before retirement, we don't know that it and all preceding instructions won't fault, or that it wasn't in the shadow of a branch mispredict or other mis-speculation.

    Retirement can only happen after it's finished executing, but isn't tied to commitment to L1d. The store buffer can continue to track a non-speculative store that will definitely happen eventually even after the ROB (out-of-order execution ReOrder Buffer) has forgotten about the store instruction.

  • All preceding loads/stores/fences are already globally visible (because of x86's memory ordering rules). This excludes weakly-ordered ops (NT stores); other loads/stores can pass them.
  • The cache line is in the Exclusive or Modified state of the MESI/MESIF/MOESI cache-coherence protocol, in the L1d cache of the current core. This can take a long time if the RFO (read for ownership) encounters a cache miss in outer levels of cache, or contention with other cores that also want exclusive access to write, or atomically RMW, a cache line.

See wikipedia's MESI article for diagrams of allowed state transitions, and details. The key point is that coherency is achieved by only allowing a core to modify its copy of a cache line when it's sure that no other caches contain that line, so that it's impossible for two conflicting copies of the same line to exist.

Intel CPUs actually use MESIF, while AMD CPUs actually use MOESI to allow cache->cache data transfer of dirty data instead of write-back to a shared outer cache like the basic MESI protocol requires.

Also note that modern Intel designs (before Skylake-AVX512) implement use a large shared inclusive L3 cache as a backstop for cache-coherency, so snoop requests don't actually have to be broadcast to all cores; they just check L3 tags (which contain extra metadata to track which core is caching what.
Intel's L3 is tag-inclusive even for lines that inner caches have in Exclusive or Modified state and thus are Invalid in L3. See this paper for more details of a simplified version of what Intel does).

Also related: I wrote an answer recently about why we have small/fast L1 + larger L2/L3, instead of one big cache, including some links to other cache-related stuff.


Back to the actual question:

Yes, stores are committed to L1 in program order, because that's the order that x86 requires them to become globally visible. L1-commit order is the same thing as global-visibility order.

Instead of "finish the cache coherence", instead you should say "get ownership of the cache line". This involves communicating with other caches using the cache coherency protocol, so I guess you probably meant "finish getting exclusive ownership using the cache coherency protocl".

The memory ordering part of the MESI wiki article points out that buffering stores in a store queue is separate from out-of-order execution in general.

The store buffer decouples commit to L1d from OoO exec retirement. This can potentially hide a lot more store latency than the regular out-of-order window size. However, retired stores must eventually happen (in the right order) even if an interrupt arrives, so allowing lots of retired but not committed stores can increase interrupt latency.

The store buffer tries to commit retired stores to L1d as quickly as it can, but it's restricted by the memory ordering rules. (i.e. other cores will see stores soon; you don't need a fence to flush the store buffer unless you need the current thread to wait for that to happen before a later load in this thread. e.g. for sequentially-consistent stores.)

On a weakly-ordered ISA, later stores can commit to L1d while an earlier store is still waiting for a cache miss. (But you'd still need a memory order buffer to preserve the illusion of a single core running instructions in program order.)

The store buffer can have multiple cache misses in flight at once, because even on strongly-ordered x86 it can send an RFO for a cache line before that store is the oldest one in the buffer.



回答2:

Yes in a model like x86-TSO stores are likely committed to the L1 in program order, and Peter's answer covers it well. That is, the store buffer is maintained in program order, and the core will commit only the oldest store (or perhaps several consecutive oldest stores if they are all going to the same cache line) to L1 before moving on.1

However, you mention in the comments your concern that this might impact performance by essentially making the store buffer commit a blocking (serialized) process:

And why I am confused about this problem is that cache controller could handle the requests in a non-blocking way. But, to conform to the TSO and make sure data globally visible on a multi-core system, should cache controller follow the store ordering? Because if there are two variable A and B being updated sequentially on core 1 and core 2 get the updated B from core 1, then core 2 must also can see the updated A. And to achieve this, I think the private cache hierarchy on core 1 have to finishes the cache coherence of the variable A and B in order and make them globally visible. Am I right?

The good news is that even though the store buffer might commit in a ordered way only the oldest store to L1, it can still get plenty of parallelism with respect to the rest of the memory subsystem by looking ahead in the store buffer and making prefetch RFO requests: trying to get the line in the E state in the local core even before the store first in line to commit to L1.

This approach doesn't violate ordering, since the stores are still written in program order, but it allows full parallelism when resolving L1 store misses. It is L1 store misses that really matter anyways: stores hits in L1 can commit rapidly, at least 1 per cycle, so committing a bunch of hits doens't help much: but getting MLP on store misses is very important, especially for scattered stores the prefetcher can't deal with.

Do x86 chips actually use a technique like this? Almost certainly. Most convincingly, tests of a long series of random writes show a much better average latency than the full memory latency, implying MLP significantly better than one. You can also find patents like this one or this one where Intel describes pretty much exactly this method.

Still, nothing is perfect. There is some evidence that ordering concerns causes weird performance hiccups when stores are missing L1, even if they hit in L2.


1 It is certainly possible that it can commit stores out of order if in maintains the illusion of in-order commit, e.g., by not relinquishing ownership of cache lines written out of order until order is restored, but this is prone to deadlocks and other complicated cases, and I have no evidence that x86 does so.