I did a test with this
for (i32 i = 0; i < 0x800000; ++i)
{
// Hopefully this can disable hardware prefetch
i32 k = (i * 997 & 0x7FFFFF) * 0x40;
_mm_prefetch(data + ((i + 1) * 997 & 0x7FFFFF) * 0x40, _MM_HINT_NTA);
for (i32 j = 0; j < 0x40; j += 0x10)
{
//__m128 v = _mm_castsi128_ps(_mm_stream_load_si128((__m128i *)(data + k + j)));
__m128 v = _mm_load_ps((float *)(data + k + j));
a_single_chain_computation
//_mm_stream_ps((float *)(data2 + k + j), v);
_mm_store_ps((float *)(data2 + k + j), v);
}
}
Results are weird.
- No matter how much time the
a_single_chain_computation
takes, the load latency is not hidden. - And what's more, the additional total time taken grows as I add more computation. (With a single
v = _mm_mul_ps(v, v)
, prefetching saves about 0.60 - 0.57 = 0.03s. And with 16v = _mm_mul_ps(v, v)
, it saves about 1.1 - 0.75 = 0.35s. WHY?) - non-temporal load/stores degrades performance with or without prefetching. (I can understand the load part, but why stores, too?)
If your computation chain is very short and if you're reading memory sequentially then the CPU will prefetch well on its own and actually work faster since its decoder has less work to do.
Streaming loads and stores are good only if you don't plan to access this memory in the near future. They are mainly aimed at uncached write back (WB) memory that's usually found when dealing with graphic surfaces. Explicit prefecthing may work well on one architecture (CPU model) and have a negative effect on other models so use them as a last resort option when optimizing.
You need to separate two different things here (which unfortunately have a similar name) :
Non-temporal prefetching - This would prefetch the line, but write it as the least recently used one when it fills the caches, so it would be the first in line for eviction when you next use the same set. That leaves you enough time to actually use it (unless you're very unlucky), but wouldn't waste more than a single way out of that set, since the next prefetch to come along would just replace it. By the way, regarding your comments above - every prefetch would pollute the L3 cache, it's inclusive so you can't get away without it.
Non-temporal (streaming) loads/stores - this also won't pollute the caches, but using a completely different mechanism of making them uncacheable (as well as write combining). This would indeed have a penalty on performance even if you really don't need these lines ever again, since a cacheable write has the luxury of staying buffered in the cache until evicted, so you don't have to write it out right away. With uncacheables you do, and in some scenarios it might interfere with your mem BW. On the other hand you get the benefit of write-combining and weak ordering which may give you some edge is several cases. The bottom line here is that you should use it only when it helps, don't assume it magically improves performance (Nothing does that nowadays..)
Regarding your questions -
your prefetching should work, but it's not early enough to make an impact. try replacing
i+1
with a larger number. Actually, maybe even do a sweep, would be interesting to see how many elements in advance you should peek.i'd guess this is same as 1 - with 16 muls your iteration is long enough for the prefetch to work
As I said - your stores won't have the benefit of buffering in the lower level caches, and would have to get flushed to memory. That's the downside of streaming stores. it's implementation specific of course, so it might improve, but at the moment it's not always effective.