What specifically marks an x86 cache line as dirty

2019-01-15 13:56发布

This question is specifically aimed at modern x86-64 cache coherent architectures - I appreciate the answer can be different on other CPUs.

If I write to memory, the MESI protocol requires that the cache line is first read into cache, then modified in the cache (the value is written to the cache line which is then marked dirty). In older write-though micro-architectures, this would then trigger the cache line being flushed, under write-back the cache line being flushed can be delayed for some time, and some write combining can occur under both mechanisms (more likely with writeback). And I know how this interacts with other cores accessing the same cache-line of data - cache snooping etc.

My question is, if the store matches precisely the value already in the cache, if not a single bit is flipped, does any Intel micro-architecture notice this and NOT mark the line as dirty, and thereby possibly save the line from being marked as exclusive, and the writeback memory overhead that would at some point follow?

As I vectorise more of my loops, my vectorised-operations compositional primitives don't explicitly check for values changing, and to do so in the CPU/ALU seems wasteful, but I was wondering if the underlying cache circuitry could do it without explicit coding (eg the store micro-op or the cache logic itself). As shared memory bandwidth across multiple cores becomes more of a resource bottleneck, this would seem like an increasingly useful optimisation (eg repeated zero-ing of the same memory buffer - we don't re-read the values from RAM if they're already in cache, but to force a writeback of the same values seems wasteful). Writeback caching is itself an acknowledgement of this sort of issue.

Can I politely request holding back on "in theory" or "it really doesn't matter" answers - I know how the memory model works, what I'm looking for is hard facts about how writing the same value (as opposed to avoiding a store) will affect the contention for the memory bus on what you may safely assume is a machine running multiple workloads that are nearly always bound by memory bandwidth. On the other hand an explanation of precise reasons why chips don't do this (I'm pessimistically assuming they don't) would be enlightening...

Update: Some answers along the expected lines here https://softwareengineering.stackexchange.com/questions/302705/are-there-cpus-that-perform-this-possible-l1-cache-write-optimization but still an awful lot of speculation "it must be hard because it isn't done" and saying how doing this in the main CPU core would be expensive (but I still wonder why it can't be a part of the actual cache logic itself).

2条回答
对你真心纯属浪费
2楼-- · 2019-01-15 14:24

It's possible to implement in hardware, but I don't think anybody does. Doing it for every store would either cost cache-read bandwidth or require an extra read port and make pipelining harder.

You'd build a cache that did a read/compare/write cycle instead of just write, and could conditionally leave the line in Exclusive state instead of Modified (of MESI). Doing it this way (instead of checking while it was still Shared) would still invalidate other copies of the line, but that means there's no interaction with memory-ordering. The (silent) store becomes globally visible while the core has Exclusive ownership of the cache line, same as if it had flipped to Modified and then back to Exclusive by doing a write-back to DRAM.

The read/compare/write has to be done atomically (you can't lose the cache line between the read and the write; if that happened the compare result would be stale). This makes it harder to pipeline data committing to L1D from the store queue.


In a multi-threaded program, it can be worth doing this as an optimization in software for shared variables only.

Avoiding invalidating everyone else's cache can make it worth converting

shared = x;

into

if(shared != x)
    shared = x;

I'm not sure if there are memory-ordering implications here. Obviously if the shared = x never happens, there's no release-sequence, so you only have acquire semantics instead of release. But if the value you're storing is often what's already there, any use of it for ordering other things will have ABA problems.

IIRC, Herb Sutter mentions this potential optimization in part 1 or 2 of his atomic Weapons: The C++ Memory Model and Modern Hardware talk. (A couple hours of video)

This is of course too expensive to do in software for anything other than shared variables where the cost of writing them is many cycles of delay in other threads (cache misses and memory-order mis-speculation machine clears: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)


Related: See this answer for more about x86 memory bandwidth in general, especially the NT vs. non-NT store stuff, and "latency bound platforms" for why single-threaded memory bandwidth on many-core Xeons is lower than on a quad-core, even though aggregate bandwidth from multiple cores is higher.

查看更多
甜甜的少女心
3楼-- · 2019-01-15 14:38

Currently no implementation of x86 (or any other ISA, as far as I know) supports optimizing silent stores.

There has been academic research on this and there is even a patent on "eliminating silent store invalidation propagation in shared memory cache coherency protocols". (Googling '"silent store" cache' if you are interested in more.)

For x86, this would interfere with MONITOR/MWAIT; some users might want the monitoring thread to wake on a silent store (one could avoid invalidation and add a "touched" coherence message). (Currently MONITOR/MWAIT is privileged, but that might change in the future.)

Similarly, such could interfere with some clever uses of transactional memory. If the memory location is used as a guard to avoid explicit loading of other memory locations or, in an architecture that supports such (such was in AMD's Advanced Synchronization Facility), dropping the guarded memory locations from the read set.

(Hardware Lock Elision is a very constrained implementation of silent ABA store elimination. It has the implementation advantage that the check for value consistency is explicitly requested.)

There are also implementation issues in terms of performance impact/design complexity. Such would prohibit avoiding read-for-ownership (unless the silent store elimination was only active when the cache line was already present in shared state), though read-for-ownership avoidance is also currently not implemented.

Special handling for silent stores would also complicate implementation of a memory consistency model (probably especially x86's relatively strong model). Such might also increase the frequency of rollbacks on speculation that failed consistency. If silent stores were only supported for L1-present lines, the time window would be very small and rollbacks extremely rare; stores to cache lines in L3 or memory might increase the frequency to very rare, which might make it a noticeable issue.

Silence at cache line granularity is also less common than silence at the access level, so the number of invalidations avoided would be smaller.

The additional cache bandwidth would also be an issue. Currently Intel uses parity only on L1 caches to avoid the need for read-modify-write on small writes. Requiring every write to have a read in order to detect silent stores would have obvious performance and power implications. (Such reads could be limited to shared cache lines and be performed opportunistically, exploiting cycles without full cache access utilization, but that would still have a power cost.) This also means that this cost would fall out if read-modify-write support was already present for L1 ECC support (which feature would please some users).

I am not well-read on silent store elimination, so there are probably other issues (and workarounds).

With much of the low-hanging fruit for performance improvement having been taken, more difficult, less beneficial, and less general optimizations become more attractive. Since silent store optimization becomes more important with higher inter-core communication and inter-core communication will increase as more cores are utilized to work on a single task, the value of such seems likely to increase.

查看更多
登录 后发表回答