I want to store data in a large array with _mm256_stream_si256()
called in a loop.
As I understood, a memory fence is then needed to make these changes visible to other threads. The description of _mm_sfence()
says
Perform a serializing operation on all store-to-memory instructions
that were issued prior to this instruction. Guarantees that every
store instruction that precedes, in program order, is globally visible
before any store instruction which follows the fence in program order.
But will my recent stores of the current thread be visible to subsequent load instructions too (in the other threads)? Or do I have to call _mm_mfence()
? (The latter seems to be slow)
UPDATE: I saw this question earlier: when should I use _mm_sfence _mm_lfence and _mm_mfence . The answers there rather focus on when to use fence in general. My question is more specific and the answers in that question are not likely to address this (and don't currently do this).
UPDATE2: following the comments/answers, let's define "subsequent loads" as the loads in a thread that subsequently takes the lock which the current thread currently holds.
But will my recent stores be visible to subsequent load instructions too?
This sentence makes little sense. Loads are the only way any thread can see the contents of memory. Not sure why you say "too", since there's nothing else. (Other than DMA reads by non-CPU system devices.)
The definition of a store becoming globally visible is that loads in any other thread will get the data from it. It means that the store has left the CPU's private store-buffer and is part of the coherency domain that includes the data caches of all CPUs. (https://en.wikipedia.org/wiki/Cache_coherence).
CPUs always try to commit stores from their store buffer to the globally visible cache/memory state as quickly as possible. All you can do with barriers is make this thread wait until that happens before doing later operations. That can certainly be necessary in multithreaded programs with streaming stores, and it looks like that's what you're actually asking about. But I think it's important to understand that NT stores do reliably become visible to other threads very quickly even with no synchronization.
A mutex unlock on x86 is sometimes a lock add
, in which case that's a full fence for NT stores already. But if you can't rule out a mutex implementation using a simple store then you need at least sfence
.
Normal x86 stores have release memory-ordering semantics (C++11 std::memory_order_release
). MOVNT streaming stores have relaxed ordering, but mutex / spinlock functions, and compiler support for C++11 std::atomic, basically ignores them. For multi-threaded code, you have to fence them yourself to avoid breaking the synchronization behaviour of mutex / locking library functions, because they only synchronize normal x86 strongly-ordered loads and stores.
Loads in the thread that executed the stores will still always see most recently stored value, even from movnt
stores. You never need fences in a single-threaded program. The cardinal rule of out-of-order execution and memory reordering is that it never breaks the illusion of running in program order within a single thread. Same thing for compile-time reordering: since concurrent read/write access to shared data is C++ Undefined Behaviour, compilers only have to preserve single-threaded behaviour unless you use fences to limit compile-time reordering.
MOVNT + SFENCE is useful in cases like producer-consumer multi-threading, or with normal locking where the unlock of a spinlock is just a release-store.
A producer thread writes a big buffer with streaming stores, then stores "true" (or the address of the buffer, or whatever) into a shared flag variable. (Jeff Preshing calls this a payload + guard variable).
A consumer thread is spinning on that synchronization variable, and starts reading the buffer after seeing it become true.
The producer must use sfence after writing the buffer, but before writing the flag, to make sure all the stores into the buffer are globally visible before the flag. (But remember, NT stores are still always locally visible right away to the current thread.)
(With a locking library function, the flag being stored to is the lock. Other threads trying to acquire the lock are using acquire-loads.)
std::atomic <bool> buffer_ready;
producer() {
for(...) {
_mm256_stream_si256(buffer);
}
_mm_sfence();
buffer_ready.store(true, std::memory_order_release);
}
The asm would be something like
vmovntdqa [buf], ymm0
...
sfence
mov byte [buffer_ready], 1
Without sfence
, some of the movnt
stores could be delayed until after the flag store, violating the release semantics of the normal non-NT store.
If you know what hardware you're running on, and you know the buffer is always large, you might get away with skipping the sfence
if you know the consumer always reads the buffer from front to back (in the same order it was written), so it's probably not possible for the stores to the end of the buffer to still be in-flight in a store buffer in the core of the CPU running the producer thread by the time the consumer thread gets to the end of the buffer.
(in comments)
by "subsequent" I mean happening later in time.
There's no way to make this happen unless you limit when those loads can be executed, by using something that synchronizes the producer thread with the consumer. As worded, you're asking for sfence
to make NT stores globally visible the instant it executes, so that loads on other cores that execute 1 clock cycle after sfence
will see the stores. A sane definition of "subsequent" would be "in the next thread that takes the lock this thread currently holds".
Fences stronger than sfence
work, too:
Any atomic read-modify-write operation on x86 needs a lock
prefix, which is a full memory barrier (like mfence
).
So if you for example increment an atomic counter after your streaming stores, you don't also need sfence
. Unfortunately, in C++ std:atomic
and _mm_sfence()
don't know about each other, and compilers are allowed to optimize atomics following the as-if rule. So it's hard to be sure that a lock
ed RMW instruction will be in exactly the place you need it in the resulting asm.
(Basically, if a certain ordering is possible in the C++ abstract machine, the compiler can emit asm that makes it always happen that way. e.g. fold two successive increments into one +=2
so that no thread can ever observe the counter being an odd number.)
Still, the default mo_seq_cst
prevents a lot of compile-time reordering, and there's not much downside to using it for a read-modify-write operation when you're only targeting x86. sfence
is quite cheap, though, so it's probably not worth the effort trying to avoid it between some streaming stores and an lock
ed operation.
Related: pthreads v. SSE weak memory ordering. The asker of that question thought that unlocking a lock would always do a lock
ed operation, thus making sfence
redundant.
C++ compilers don't try to insert sfence
for you after streaming stores, even when there are std::atomic
operations with ordering stronger than relaxed
. It would be too hard for compilers to reliably get this right without being very conservative (e.g. sfence
at the end of every function with an NT store, in case the caller uses atomics).
The Intel intrinsics predate C11 stdatomic
and C++11 std::atomic
.
The implementation of std::atomic
pretends that weakly-ordered stores didn't exist, so you have to fence them yourself with intrinsics.
This seems like a good design choice, since you only want to use movnt
stores in special cases, because of their cache-evicting behaviour. You don't want the compiler ever inserting sfence
where it wasn't needed, or using movnti
for std::memory_order_relaxed
.
But will my recent stores of the current thread be visible to
subsequent load instructions too (in the other threads)? Or do I have
to call _mm_mfence()? (The latter seems to be slow)
Answer is NO. You are not guaranteed to see previous stores in one thread without making any synchronization attempts in other thread. Why is that?
- You compiler could reorder instructions
- Your processor can reorder instructions (on some platforms)
In C++ compiler is required to emit sequentially consistent code but only for single threaded execution. So consider following code:
int x = 5;
int y = 7;
int z = x;
In this program compiler can chose to put x = 5
after y = 7
but no later as it will be inconsistent.
If you then consider following code in other thread
int a = y;
int b = x;
Same instruction reordering can happen here as a and b are independent of each other. What will be result of running those threads?
a b
7 5
7 ? - whatever was stored in x before the assignment of 5
...
And this result we can get even if we put memory barrier between x = 5
and y = 7
because without putting barrier between a = y
and b = x
too you never know in which order they will be read.
This is just rough presentation of what you can read in Jeff Preshing's blog post Memory Ordering at Compile Time