Scenarios when software prefetching manual instruc

2020-07-22 19:27发布

问题:

I have read about that on x86 and x86-64 Intel gcc provides special prefetching instructions:

#include <xmmintrin.h>
enum _mm_hint
{
_MM_HINT_T0 = 3,
_MM_HINT_T1 = 2,
_MM_HINT_T2 = 1,
_MM_HINT_NTA = 0
};
void _mm_prefetch(void *p, enum _mm_hint h);

Programs can use the _mm_prefetch intrinsic on any pointer in the program. And The different hints to be used with the _mm_prefetch intrinsic are implementation defined. Generally said is that each of the hints have its own meaning.

_MM_HINT_T0 fetches data to all levels of the cache for inclusive caches and to the lowest level cache for exclusive caches

_MM_HINT_T1 hint pulls the data into L2 and not into L1d. If there is an L3 cache the _MM_HINT_T2 hints can do something similar for it

_MM_HINT_NTA, allows telling the processor to treat the prefetched cache line specially

So can someone describe examples when this instruction used?

And how to properly choose the hint?

回答1:

The idea of prefetching is based upon these facts:

  • Accessing memory is very expensive the first time.
    The first time a memory address1 is accessed is must be fetched from memory, it is then stored in the cache hierarchy2.
  • Accessing memory is inherently asynchronous.
    The CPU doesn't need any resource from the core to perform the lengthiest part of a load/store3 and thus it can be easily done in parallel with other tasks4.

Thanks to the above it makes sense to try a load before it is actually needed so that when the code will actually need the data, it won't have to wait.
It is very worth nothing that the CPU can go pretty far ahead when looking for something to do, but not arbitrarily deep; so sometimes it needs the help of the programmer to perform optimally.

The cache hierarchy is, by its very nature, an aspect of the micro-architecture not the architecture (read ISA). Intel or AMD cannot give strong guarantees on what these instructions do.
Furthermore using them correctly is not easy as the programmer must have clear in mind how many cycles each instruction can take. Finally, the latest CPU are getting more and more good at hiding memory latency and lowering it.
So in general prefetching is a job for the skilled assembly programmer.

That said the only possible scenario is where the timing of a piece of code must be consistent at every invocation.
For example, if you know that an interrupt handler always update a state and it must perform as fast as possible, it is worth, when setting the hardware that uses such interrupt, to prefetch the state variable.

Regarding the different level of prefetching, my understanding is that different levels (L1 - L4) correspond to different amounts of sharing and polluting.

For example prefetch0 is good if the thread/core that executes the instruction is the same that will read the variable.
However, this will take a line in all the caches, eventually evicting other, possibly useful, lines. You can use this for example when you know that you'll need the data surely in short.

prefetch1 is good to make the data quickly available for all core or core group (depending on how L2 is shared) without polluting L1.
You can use this if you know that you may need the data or that you'll need it after having done with another task (that takes priority in using the cache).
This is not as fast as having the data in L1 but much better than having it in memory.

prefetch2 can be used to take out most of the memory access latency since it moves the data in the L3 cache.
It doesn't pollute L1 or L2 and it is shared among cores, so it's good for data used by rare (but possible) code paths or for preparing data for other cores.

prefetchnta is the easiest to understand, it is a non-temporal move. It avoids creating an entry in every cache line for a data that is accessed only once.

prefetchw/prefetchwnt1 are like the others but makes the line Exclusive and invalidates other cores lines that alias this one.
Basically, it makes writing faster as it is in the optimal state of the MESI protocol (for cache coherence).

Finally, a prefetch can be done incrementally, first by moving into L3 and then by moving into L1 (just for the threads that need it).

In short, each instruction let you decide the compromise between pollution, sharing and speed of access.
Since these all require to keep track of the use of the cache very carefully (you need to know that it's not worth creating and entry in the L1 but it is in the L2) the use is limited to very specific environments.
In a modern OS, it's not possible to keep track of the cache, you can do a prefetch just to find your quantum expired and your program replaced by another one that evicts the just loaded line.


As for a concrete example I'm a bit out of ideas.
In the past, I had to measure the timing of some external event as consistently as possible.
I used and interrupt to periodically monitor the event, in such case I prefetched the variables needed by the interrupt handler, thereby eliminating the latency of the first access.

Another, unorthodox, use of the prefetching is to move the data into the cache.
This is useful if you want to test the cache system or unmap a device from memory relying on the cache to keep the data a bit longer.
In this case moving to L3 is enough, not all CPU has an L3, so we may need to move to L2 instead.

I understand these examples are not very good, though.


1 Actually the granularity is "cache lines" not "addresses".
2 Which I assume you are familiar with. Shortly put: It, as present, goes from L1 to L3/L4. L3/L4 is shared among cores. L1 is always private per core and shared by the core's threads, L2 usually is like L1 but some model may have L2 shared across pairs of cores.
3 The lengthiest part is the data transfer from the RAM. Computing the address and initializing the transaction takes up resources (store buffer slots and TLB entries for example).
4 However any resource used to access the memory can become a critical issue as pointed out by @Leeor and proved by the Linux kernel developer.