For any std::atomic<T>
where T is a primitive type:
If I use std::memory_order_acq_rel
for fetch_xxx
operations, and std::memory_order_acquire
for load
operation and std::memory_order_release
for store
operation blindly (I mean just like resetting the default memory ordering of those functions)
- Will the results be same as if I used
std::memory_order_seq_cst
(which is being used as default) for any of the declared operations?
- If the results were the same, is this usage anyhow different than using
std::memory_order_seq_cst
in terms of efficiency?
The C++11 memory ordering parameters for atomic operations specify constraints on the ordering. If you do a store with std::memory_order_release
, and a load from another thread reads the value with std::memory_order_acquire
then subsequent read operations from the second thread will see any values stored to any memory location by the first thread that were prior to the store-release, or a later store to any of those memory locations.
If both the store and subsequent load are std::memory_order_seq_cst
then the relationship between these two threads is the same. You need a more threads to see the difference.
e.g. std::atomic<int>
variables x
and y
, both initially 0.
Thread 1:
x.store(1,std::memory_order_release);
Thread 2:
y.store(1,std::memory_order_release);
Thread 3:
int a=x.load(std::memory_order_acquire); // x before y
int b=y.load(std::memory_order_acquire);
Thread 4:
int c=y.load(std::memory_order_acquire); // y before x
int d=x.load(std::memory_order_acquire);
As written, there is no relationship between the stores to x
and y
, so it is quite possible to see a==1
, b==0
in thread 3, and c==1
and d==0
in thread 4.
If all the memory orderings are changed to std::memory_order_seq_cst
then this enforces an ordering between the stores to x
and y
. Consequently, if thread 3 sees a==1
and b==0
then that means the store to x
must be before the store to y
, so if thread 4 sees c==1
, meaning the store to y
has completed, then the store to x
must also have completed, so we must have d==1
.
In practice, then using std::memory_order_seq_cst
everywhere will add additional overhead to either loads or stores or both, depending on your compiler and processor architecture. e.g. a common technique for x86 processors is to use XCHG
instructions rather than MOV
instructions for std::memory_order_seq_cst
stores, in order to provide the necessary ordering guarantees, whereas for std::memory_order_release
a plain MOV
will suffice. On systems with more relaxed memory architectures the overhead may be greater, since plain loads and stores have fewer guarantees.
Memory ordering is hard. I devoted almost an entire chapter to it in my book.
Memory ordering can be quite tricky, and the effects of getting it wrong is often very subtle.
The key point with all memory ordering is that it guarantees what "HAS HAPPENED", not what is going to happen. For example, if you store something to a couple of variables (e.g. x = 7; y = 11;
), then another processor may be able to see y
as 11 before it sees the value 7
in x. By using memory ordering operation between setting x
and setting y
, the processor that you are using will guarantee that x = 7;
has been written to memory before it continues to store something in y
.
Most of the time, it's not REALLY important which order your writes happen, as long as the value is updated eventually. But if we, say, have a circular buffer with integers, and we do something like:
buffer[index] = 32;
index = (index + 1) % buffersize;
and some other thread is using index
to determine that the new value has been written, then we NEED to have 32
written FIRST, then index
updated AFTER. Otherwise, the other thread may get old
data.
The same applies to making semaphores, mutexes and such things work - this is why the terms release and acquire are used for the memory barrier types.
Now, the cst
is the most strict ordering rule - it enforces that both reads and writes of the data you've written goes out to memory before the processor can continue to do more operations. This will be slower than doing the specific acquire or release barriers. It forces the processor to make sure stores AND loads have been completed, as opposed to just stores or just loads.
How much difference does that make? It is highly dependent on what the system archiecture is. On some systems, the cache needs to flushed [partially] and interrupts sent from one core to another to say "Please do this cache-flushing work before you continue" - this can take several hundred cycles. On other processors, it's only some small percentage slower than doing a regular memory write. X86 is pretty good at doing this fast. Some types of embedded processors, (some models of - not sure?)ARM for example, require a bit more work in the processor to ensure everything works.