In Agner Fog's manual Optimizing software in C++ in section 9.10 "Cahce contentions in large data structures" he describes a problem transposing a matrix when the matrix width is equal to something called the critical stride. In his test the cost for for a matrix in L1 is 40% greater when the width is equal to the critical stride. If the matrix is is even larger and only fits in L2 the cost is 600%! This is summed up nicely in Table 9.1 in his text. This is essential the same thing observed at Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Later he writes:
The reason why this effect is so much stronger for level-2 cache contentions than for level-1 cache contentions is that the level-2 cache cannot prefetch more than one line at a time.
So my questions are related to prefetching data.
From his comment I infer that L1 can prefetch more than one cache line at a time. How many can it prefetch?
From what I understand trying to write code to prefetch the data (e.g. with _mm_prefetch) is rarely ever helpful. The only example I have read of is Prefetching Examples? and it's only a O(10%) improvement (on some machines). Agner later explains this:
The reason is that modern processors prefetch data automatically thanks to out-of-order execution and advanced prediction mechanisms. Modern microprocessors are able to automatically prefetch data for regular access patterns containing multiple streams with different strides. Therefore, you don't have to prefetch data explicitly if data access can be arranged in regular patterns with fixed strides.
So how does the CPU decide which data to prefetch and are there ways to help the CPU make better choices for the prefetching (e.g. "regular patterns with fixed strides")?
Edit: Based on a comment by Leeor let me add to my questions and make it more interesting. Why does the critical stride have so much more of an effect on L2 compared to L1?
Edit: I tried to reproduce Agner Fog's table using the code at Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513? I ran this with MSVC2013 64-bit release mode on a Xeon E5 1620 (Ivy Bridge) which has L1 32KB 8-way, L2 256 KB 8-way, and L3 10MB 20-way. The max matrix size for L1 is about 90x90, 256x256 for L3, and 1619 for L3.
Matrix Size Average Time
64x64 0.004251 0.004472 0.004412 (three times)
65x65 0.004422 0.004442 0.004632 (three times)
128x128 0.0409
129x129 0.0169
256x256 0.219 //max L2 matrix size
257x257 0.0692
512x512 2.701
513x513 0.649
1024x1024 12.8
1025x1025 10.1
I'm not seeing any performance loss in L1 however L2 clearly has the critical stride problem and maybe L3. I'm not sure yet why L1 does not show a problem. It's possible there is some other source of background (overhead) which is dominating the L1 times.
This statement :
is incorrect
In fact, the L2 prefetchers are often stronger and more aggressive than L1 prefetchers. It depends on the actual machine you use, but Intels' L2 prefetcher for e.g. can trigger 2 prefetches for each request, while the L1 is usually limited (there are several types of prefetches that can coexist in the L1, but they're likely to be competing on a more limited BW than the L2 has at its disposal, so there will probably be less prefetches coming out of the L1.
The optimization guide, in Section 2.3.5.4 (Data Prefetching) counts the following prefetcher types:
And a bit further ahead:
Of the above, only the IP-based can handle strides greater than one cache line (the streaming ones can deal with anything that uses consecutive cachelines, meaning up to 64byte stride (or actually up to 128 bytes if you don't mind some extra lines). To use that, make sure that loads/stores at a given address would perform strided accesses - that's usually the case already in loops going over arrays. Compiler loop-unrolling may split that into multiple different stride streams with larger strides - that would work even better (the lookahead would be larger), unless you exceed the number of outstanding tracked IPs - again, that depends on the exact implementation.
However, if your access pattern does consist of consecutive lines, the L2 streamer is much more efficient than the L1 since it runs ahead faster.