When you use non-temporal stores, e.g. movntq, and the data is already in cache, will the store update the cache instead of writing out to memory? Or will it update the cache line and write it out, evicting it? Or what?
Here's a fun dilemma. Suppose thread A is loading the cache line containing x and y. Thread B writes to x using a NT store. Thread A writes to y. There's a data race here if B's store to x can be in-transit to memory while A's load is happening. If A sees the old value of x, but the write of X already happened, then the later write of y and eventual write back of the cache line will clobber unrelated value x. I assume the processor somehow prevents that from happening? I can't see how anyone could build a reliable system using NT stores if it were allowable behavior.
All of the behaviors you describe are sensible implementations of a non-temporal store. In practice, on modern x86 CPUs, the actual semantics are that there's no effect on the L1 cache but the L2 (and higher-level caches, if any) will not evict a cache line to store the non-temporal fetch results.
There is no data race because the caches are hardware coherent. This coherence is not effected in any way by the decision to evict a cache line.
On multi-core CPUs (i.e. newer than Pentium M), the target cache line will be evicted by an NT store if it was already present in the cache hierarchy, before the NT store happens.
This is probably inefficient if the cache line is Modified (and in need of write-back); a regular store + clflush
would probably be better in that case. IDK how much it costs when the line is clean; the NT store itself moving through the cache hierarchy on the way to the memory controllers can probably do the evicting to make sure no other core can still have a stale cached copy after RAM is modified.
From Intel's x86 volume 1 manual, ch 10.4.6.2 Caching of Temporal vs. Non-Temporal Data:
If a program specifies a non-temporal store with one of these instructions
and the memory type of the destination region is write back (WB), write through (WT), or write combining (WC), the processor will do the following:
If the memory location being written to is present in the cache hierarchy, the data in the caches is evicted.1
1 Some older CPU implementations (e.g., Pentium M) allowed addresses being written with a non-temporal store instruction to be
updated in-place if the memory type was not WC and line was already in the cache.
The non-temporal data is written to memory with WC semantics.
See also: Chapter 11, “Memory Cache Control,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.
From Intel's optimization manual, 7.4.1.3 Memory Type and Non-temporal Stores. I've collapsed into a summary inside [].
Memory type can take precedence over a non-temporal hint, leading to
the following considerations:
- [NT is ignored for UC and WP: strongly-ordered uncacheable memory.]
If the programmer specifies the weakly-ordered uncacheable memory type of Write-Combining (WC), then the non-temporal store and the
region have the same semantics and there is no conflict.
If the programmer specifies a non-temporal store to cacheable memory (for example, Write-Back
(WB) or Write-Through (WT) memory types), two cases may result:
— CASE 1 — If the data is present in the cache hierarchy, the instruction will ensure consistency. A
particular processor may choose different ways to implement this. The following approaches are
probable:
(a) updating data in-place in the cache hierarchy while preserving the memory type semantics assigned to that region or
(b) evicting the data from the caches and writing the new
non-temporal data to memory (with WC semantics).
The approaches (separate or combined) can be different for future processors. Pentium 4, Intel
Core Solo and Intel Core Duo processors implement the latter policy (of evicting data from all
processor caches). The Pentium M processor implements a combination of both approaches.
If the streaming store hits a line that is present in the first-level cache, the store data is combined
in place within the first-level cache. If the streaming store hits a line present in the second-level,
the line and stored data is flushed from the second-level to system memory. [I think this whole paragraph is describing Pentium M's "combined" approach]
— CASE 2 — If the data is not present in the cache hierarchy and the destination region is mapped
as WB or WT; the transaction will be weakly ordered and is subject to all WC memory semantics.
This non-temporal store will not write-allocate. Different implementations may choose to collapse and combine such stores.