CPU cache inhibition

2020-03-30 02:51发布

站内文章 / C

58 0

聊天终结者

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Say I have the defacto standard x86 CPU with 3 level of Caches, L1/L2 private, and L3 shared among cores. Is there a way to allocate shared memory whose data will not be cached on L1/L2 private caches, but rather it will only be cached at L3? I don't want to fetch data from memory (that's too costly), but I'd like to experiment with performance with and without bringing the shared data into private caches.

The assumption is that L3 is shared among the cores (presumably physically indexed cache) and thus will not incur any false sharing or cache line invalidation for heavily used shared data.

Any solution (if it exists) would have to be done programmatically, using C and/or assembly for intel based CPUs (relatively modern Xeon architectures (skylake, broadwell), running linux based OS.

Edit:

I have latency sensitive code which uses a form of shared memory for synchronization. The data will be in L3, but when read or written to it will go into L1/L2 depending on cache inclusivity policy. By implication of the problem, the data will have to be invalidated adding an unnecessary (I think) performance hit. I'd like to see if it's possible to just store the data, either through some page policy or special instructions only in L3.

I know it's possible to use the special memory register to inhibit caching for security reasons, but that requires CPL0 privilege.

Edit2:

I'm dealing with parallel codes that run on high performance systems for months at at time. The systems are high core-count systems (eg. 40-160+ cores) that periodically perform synchronization which needs to execute in usecs.

回答1:

x86 has no way to do a store that bypasses or writes through L1D/L2 but not L3. There are NT stores which bypass all cache. Anything that forces a write-back to L3 also forces write-back all the way to memory. (e.g. a clwb instruction). Those are designed for non-volatile RAM use cases, or for non-coherent DMA, where it's important to get data committed to actual RAM.

There's also no way to do a load that bypasses L1D (except from USWC memory with SSE4.1 movntdqa, but it's not "special" on other memory types). prefetchNTA can bypass L2, according to Intel's optimization manual.

Prefetch on the core doing the read should be useful to trigger write-back from other core into L3, and transfer into your own L1D. But that's only useful if you have the address ready before you want to do the load. (Dozens of cycles for it to be useful.)

Intel CPUs use a shared inclusive L3 cache as a backstop for on-chip cache coherency. 2-socket has to snoop the other socket, but Xeons that support more than 2P have snoop filters to track cache lines that move around.

When you read a line that was recently written by another core, it's always Invalid in your L1D. L3 is tag-inclusive, and its tags have extra info to track which core has the line. (This is true even if the line is in M state in an L1D somewhere, which requires it to be Invalid in L3, according to normal MESI.) Thus, after your cache-miss checks L3 tags, it triggers a request to the L1 that has the line to write it back to L3 cache (and maybe to send it directly to the core than wants it).

Skylake-X (Skylake-AVX512) doesn't have an inclusive L3 (It has a bigger private L2 and a smaller L3), but it still has a tag-inclusive structure to track which core has a line. It also uses a mesh instead of ring, and L3 latency seems to be significantly worse than Broadwell.

Possibly useful: map the latency-critical part of your shared memory region with a write-through cache policy. IDK if this patch ever made it into the mainline Linux kernel, but see this patch from HP: Support Write-Through mapping on x86. (The normal policy is WB.)

Also related: Main Memory and Cache Performance of Intel Sandy Bridge and AMD Bulldozer, an in-depth look at latency and bandwidth on 2-socket SnB, for cache lines in different starting states.

For more about memory bandwidth on Intel CPUs, see Enhanced REP MOVSB for memcpy, especially the Latency Bound Platforms section. (Having only 10 LFBs limits single-core bandwidth).

Related: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? has some experimental results for having one thread spam writes to a location while another thread reads it.

Note that the cache miss itself isn't the only effect. You also get a lot of machine_clears.memory_ordering from mis-speculation in the core doing the load. (x86's memory model is strongly ordered, but real CPUs speculatively load early and abort in the rare case where the cache line becomes invalid before the load was supposed to have "happened".

回答2:

You won't find good ways to disable use of L1 or L2 for Intel CPUs: indeed, outside of a few specific scenarios such as UC memory areas covered in Peter's answer (which will kill your performance since they don't use L3 either), the L1 in particular is fundamentally involved in reads and writes.

What you can do, however, is to use the fairly well-defined cache behavior of L1 and L2 to force evictions of data you only want to live in L3. On recent Intel architectures, both the L1 and L2 behave as pseudo-LRU "standard associative" caches. By "standard associative" I mean the cache structure you'd read about on wikipedia or in your hardware 101 course where a cache is divided into 2^N sets which have M entries (for an M-way associative cache) and N consecutive bits from the address are used to look up the set.

This means you can predict exactly which cache lines will end up in the same set. For example, Skylake has an 8-way 32K L1D and a 4-way 256K L2. This means cache lines 64K apart will fall into the same set on the L1 and L2. Normally having heavily used values fall into the same cache line is a problem (cache set contention may make your cache appear much smaller than it actually is) - but here you can use it to your advantage!

When you want to evict a line from the L1 and L2, just read or write 8 or more values to other lines spaced 64K away from your target line. Depending on the structure of your benchmark (or underlying application) you may not even need the dummy writes: in your inner loop you could simply use use say 16 values all spaced out by 64K and not return to the first value until you've visited the other 15. In this way each line would "naturally" be evicted before you use it.

Note that the dummy writes don't have to be the same on each core: each core can write to "private" dummy lines so you don't add contention for the dummy writes.

Some complications:

The addresses we discuss here (when we say things like "64K away from the target address") are physical addresses. If you're using 4K pages, you can evict from the L1 by writing at offsets of 4K, but to make it work for L2 you need 64K physical offsets - but you can't get that reliably since every time you cross a 4K page boundary you are writing to some arbitrary physical page. You can solve this by ensuring you are using 2MB huge pages for the involved cache lines.
I said "8 or more" cache lines need to be read/written. That's because the caches are likely to use some kind of pseudo-LRU rather than exact LRU. You'll have to test: you might find that the pseudo-LRU works just like exact LRU for the pattern you are using, or you might find that you need more than 8 writes to evict reliably.

Some other notes:

You can use performance counters exposed by perf to determine how often you are actually hitting in L1 vs L2 vs L3 to ensure your trick is working.
The L3 is usually no a "standard associative cache": rather the set is looked by hashing more bits of the address than a typical cache. The hashing means that you won't end up using only a few lines in L3: your target and dummy lines should be spread nicely around L3. If you find you are using an unhashed L3, it should still work (because the L3 is larger you'll still be spreading out among cache sets) - but you'll have to be more careful about possible evictions from L3 as well.

回答3:

Intel has recently announced a new instruction that seems to be relevant to this question. The instruction is called CLDEMOTE. It moves data from higher level caches to a lower level cache. (Probably from L1 or L2 to L3, although the spec isn't precise on the details.) "This may accelerate subsequent accesses to the line by other cores ...."

https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf

回答4:

I believe you should not (and probably cannot) care, and hope that the shared memory is in L3. BTW, user-space C code runs in virtual address space and your other cores might (and often do) run some other unrelated process.

The hardware and the MMU (which is configured by the kernel) will ensure that L3 is properly shared.

but I'd like to experiment with performance with and without bringing the shared data into private caches.

As far as I understand (quite poorly) recent Intel hardware, this is not possible (at least not in user-land).

Maybe you might consider the PREFETCH machine instruction and the __builtin_prefetch GCC builtin (which does the opposite of what you want, it brings data to closer caches). See this and that.

BTW, the kernel does preemptive scheduling, so context switches can happen at any moment (often several hundred times each second). When (at context switch time) another process is scheduled on the same core, the MMU needs to be reconfigured (because each process has its own virtual address space, and the caches are "cold" again).

You might be interested in processor affinity. See sched_setaffinity(2). Read about about Real-Time Linux. See sched(7). And see numa(7).

I am not sure at all that the performance hit you are afraid about is noticable (and I believe it is not avoidable in user-space).

Perhaps you might consider moving your sensitive code in kernel space (so with CPL0 privilege) but that probably requires months of work and is probably not worth the effort. I won't even try.

Have you considered other completely different approaches (e.g. rewriting it in OpenCL for your GPGPU) to your latency sensitive code ?