In general, for int num
, num++
(or ++num
), as a read-modify-write operation, is not atomic. But I often see compilers, for example GCC, generate the following code for it (try here):
Since line 5, which corresponds to num++
is one instruction, can we conclude that num++
is atomic in this case?
And if so, does it mean that so-generated num++
can be used in concurrent (multi-threaded) scenarios without any danger of data races (i.e. we don't need to make it, for example, std::atomic<int>
and impose the associated costs, since it's atomic anyway)?
UPDATE
Notice that this question is not whether increment is atomic (it's not and that was and is the opening line of the question). It's whether it can be in particular scenarios, i.e. whether one-instruction nature can in certain cases be exploited to avoid the overhead of the lock
prefix. And, as the accepted answer mentions in the section about uniprocessor machines, as well as this answer, the conversation in its comments and others explain, it can (although not with C or C++).
Back in the day when x86 computers had one CPU, the use of a single instruction ensured that interrupts would not split the read/modify/write and if the memory would not be used as a DMA buffer too, it was atomic in fact (and C++ did not mention threads in the standard so this wasn’t addresses).
When it was rare to have a dual core (Pentium Pro) on a customer desktop, I effectively used this to avoid the LOCK prefix on a single core machine and improve performance.
Today, it would only help against multiple threads that were all set to the same CPU affinity, so the threads you are worried about would only come into play via time slice expiring and running the other thread on the same CPU (core). That is not realistic.
With modern x86/x64 processors, the single instruction is broken up into several micro ops and furthermore the memory reading and writing is buffered. So different threads running on different CPUs will not only see this as non-atomic but may see inconsistent results concerning what it reads from memory and what it assumes other threads have read to that point in time: you need to add memory fenses to restore sane behavior.
This is absolutely what C++ defines as a Data Race that causes Undefined Behaviour, even if one compiler happened to produce code that did what you hoped on some target machine. You need to use
std::atomic
for reliable results, but you can use it withmemory_order_relaxed
if you don't care about reordering. See below for some example code and asm output usingfetch_add
.But first, the assembly language part of the question:
Memory-destination instructions (other than pure stores) are read-modify-write operations that happen in multiple internal steps. No architectural register is modified, but the CPU has to hold the data internally while it sends it through its ALU. The actual register file is only a small part of the data storage inside even the simplest CPU, with latches holding outputs of one stage as inputs for another stage, etc., etc.
Memory operations from other CPUs can become globally visible between the load and store. I.e. two threads running
add dword [num], 1
in a loop would step on each other's stores. (See @Margaret's answer for a nice diagram). After 40k increments from each of two threads, the counter might have only gone up by ~60k (not 80k) on real multi-core x86 hardware."Atomic", from the Greek word meaning indivisible, means that no observer can see the operation as separate steps. Happening physically / electrically instantaneously for all bits simultaneously is just one way to achieve this for a load or store, but that's not even possible for an ALU operation. I went into a lot more detail about pure loads and pure stores in my answer to Atomicity on x86, while this answer focuses on read-modify-write.
The
lock
prefix can be applied to many read-modify-write (memory destination) instructions to make the entire operation atomic with respect to all possible observers in the system (other cores and DMA devices, not an oscilloscope hooked up to the CPU pins). That is why it exists. (See also this Q&A).So
lock add dword [num], 1
is atomic. A CPU core running that instruction would keep the cache line pinned in Modified state in its private L1 cache from when the load reads data from cache until the store commits its result back into cache. This prevents any other cache in the system from having a copy of the cache line at any point from load to store, according to the rules of the MESI cache coherency protocol (or the MOESI/MESIF versions of it used by multi-core AMD/Intel CPUs, respectively). Thus, operations by other cores appear to happen either before or after, not during.Without the
lock
prefix, another core could take ownership of the cache line and modify it after our load but before our store, so that other store would become globally visible in between our load and store. Several other answers get this wrong, and claim that withoutlock
you'd get conflicting copies of the same cache line. This can never happen in a system with coherent caches.(If a
lock
ed instruction operates on memory that spans two cache lines, it takes a lot more work to make sure the changes to both parts of the object stay atomic as they propagate to all observers, so no observer can see tearing. The CPU might have to lock the whole memory bus until the data hits memory. Don't misalign your atomic variables!)Note that the
lock
prefix also turns an instruction into a full memory barrier (like MFENCE), stopping all run-time reordering and thus giving sequential consistency. (See Jeff Preshing's excellent blog post. His other posts are all excellent, too, and clearly explain a lot of good stuff about lock-free programming, from x86 and other hardware details to C++ rules.)On a uniprocessor machine, or in a single-threaded process, a single RMW instruction actually is atomic without a
lock
prefix. The only way for other code to access the shared variable is for the CPU to do a context switch, which can't happen in the middle of an instruction. So a plaindec dword [num]
can synchronize between a single-threaded program and its signal handlers, or in a multi-threaded program running on a single-core machine. See the second half of my answer on another question, and the comments under it, where I explain this in more detail.Back to C++:
It's totally bogus to use
num++
without telling the compiler that you need it to compile to a single read-modify-write implementation:This is very likely if you use the value of
num
later: the compiler will keep it live in a register after the increment. So even if you check hownum++
compiles on its own, changing the surrounding code can affect it.(If the value isn't needed later,
inc dword [num]
is preferred; modern x86 CPUs will run a memory-destination RMW instruction at least as efficiently as using three separate instructions. Fun fact:gcc -O3 -m32 -mtune=i586
will actually emit this, because (Pentium) P5's superscalar pipeline didn't decode complex instructions to multiple simple micro-operations the way P6 and later microarchitectures do. See the Agner Fog's instruction tables / microarchitecture guide for more info, and the x86 tag wiki for many useful links (including Intel's x86 ISA manuals, which are freely available as PDF)).Don't confuse the target memory model (x86) with the C++ memory model
Compile-time reordering is allowed. The other part of what you get with std::atomic is control over compile-time reordering, to make sure your
num++
becomes globally visible only after some other operation.Classic example: Storing some data into a buffer for another thread to look at, then setting a flag. Even though x86 does acquire loads/release stores for free, you still have to tell the compiler not to reorder by using
flag.store(1, std::memory_order_release);
.You might be expecting that this code will synchronize with other threads:
But it won't. The compiler is free to move the
flag++
across the function call (if it inlines the function or knows that it doesn't look atflag
). Then it can optimize away the modification entirely, becauseflag
isn't evenvolatile
. (And no, C++volatile
is not a useful substitute for std::atomic. std::atomic does make the compiler assume that values in memory can be modified asynchronously similar tovolatile
, but there's much more to it than that. Also,volatile std::atomic<int> foo
is not the same asstd::atomic<int> foo
, as discussed with @Richard Hodges.)Defining data races on non-atomic variables as Undefined Behaviour is what lets the compiler still hoist loads and sink stores out of loops, and many other optimizations for memory that multiple threads might have a reference to. (See this LLVM blog for more about how UB enables compiler optimizations.)
As I mentioned, the x86
lock
prefix is a full memory barrier, so usingnum.fetch_add(1, std::memory_order_relaxed);
generates the same code on x86 asnum++
(the default is sequential consistency), but it can be much more efficient on other architectures (like ARM). Even on x86, relaxed allows more compile-time reordering.This is what GCC actually does on x86, for a few functions that operate on a
std::atomic
global variable.See the source + assembly language code formatted nicely on the Godbolt compiler explorer. You can select other target architectures, including ARM, MIPS, and PowerPC, to see what kind of assembly language code you get from atomics for those targets.
Notice how MFENCE (a full barrier) is needed after a sequential-consistency stores. x86 is strongly ordered in general, but StoreLoad reordering is allowed. Having a store buffer is essential for good performance on a pipelined out-of-order CPU. Jeff Preshing's Memory Reordering Caught in the Act shows the consequences of not using MFENCE, with real code to show reordering happening on real hardware.
Re: discussion in comments on @Richard Hodges' answer about compilers merging std::atomic
num++; num-=2;
operations into onenum--;
instruction:A separate Q&A on this same subject: Why don't compilers merge redundant std::atomic writes?, where my answer restates a lot of what I wrote below.
Current compilers don't actually do this (yet), but not because they aren't allowed to. C++ WG21/P0062R1: When should compilers optimize atomics? discusses the expectation that many programmers have that compilers won't make "surprising" optimizations, and what the standard can do to give programmers control. N4455 discusses many examples of things that can be optimized, including this one. It points out that inlining and constant-propagation can introduce things like
fetch_or(0)
which may be able to turn into just aload()
(but still has acquire and release semantics), even when the original source didn't have any obviously redundant atomic ops.The real reasons compilers don't do it (yet) are: (1) nobody's written the complicated code that would allow the compiler to do that safely (without ever getting it wrong), and (2) it potentially violates the principle of least surprise. Lock-free code is hard enough to write correctly in the first place. So don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much. It's not always easy easy to avoid redundant atomic operations with
std::shared_ptr<T>
, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define ashared_ptr_unsynchronized<T>
for gcc).Getting back to
num++; num-=2;
compiling as if it werenum--
: Compilers are allowed to do this, unlessnum
isvolatile std::atomic<int>
. If a reordering is possible, the as-if rule allows the compiler to decide at compile time that it always happens that way. Nothing guarantees that an observer could see the intermediate values (thenum++
result).I.e. if the ordering where nothing becomes globally visible between these operations is compatible with the ordering requirements of the source (according to the C++ rules for the abstract machine, not the target architecture), the compiler can emit a single
lock dec dword [num]
instead oflock inc dword [num]
/lock sub dword [num], 2
.num++; num--
can't disappear, because it still has a Synchronizes With relationship with other threads that look atnum
, and it's both an acquire-load and a release-store which disallows reordering of other operations in this thread. For x86, this might be able to compile to an MFENCE, instead of alock add dword [num], 0
(i.e.num += 0
).As discussed in PR0062, more aggressive merging of non-adjacent atomic ops at compile time can be bad (e.g. a progress counter only gets updated once at the end instead of every iteration), but it can also help performance without downsides (e.g. skipping the atomic inc / dec of ref counts when a copy of a
shared_ptr
is created and destroyed, if the compiler can prove that anothershared_ptr
object exists for entire lifespan of the temporary.)Even
num++; num--
merging could hurt fairness of a lock implementation when one thread unlocks and re-locks right away. If it's never actually released in the asm, even hardware arbitration mechanisms won't give another thread a chance to grab the lock at that point.With current gcc6.2 and clang3.9, you still get separate
lock
ed operations even withmemory_order_relaxed
in the most obviously optimizable case. (Godbolt compiler explorer so you can see if the latest versions are different.)That a single compiler's output, on a specific CPU architecture, with optimizations disabled (since gcc doesn't even compile
++
toadd
when optimizing in a quick&dirty example), seems to imply incrementing this way is atomic doesn't mean this is standard-compliant (you would cause undefined behavior when trying to accessnum
in a thread), and is wrong anyways, becauseadd
is not atomic in x86.Note that atomics (using the
lock
instruction prefix) are relatively heavy on x86 (see this relevant answer), but still remarkably less than a mutex, which isn't very appropriate in this use-case.Following results are taken from clang++ 3.8 when compiling with
-Os
.Incrementing an int by reference, the "regular" way :
This compiles into :
Incrementing an int passed by reference, the atomic way :
This example, which is not much more complex than the regular way, just gets the
lock
prefix added to theincl
instruction - but caution, as previously stated this is not cheap. Just because assembly looks short doesn't mean it's fast.When your compiler uses only a single instruction for the increment and your machine is single-threaded, your code is safe. ^^
The add instruction is not atomic. It references memory, and two processor cores may have different local cache of that memory.
IIRC the atomic variant of the add instruction is called lock xadd
Yes, but...
Atomic is not what you meant to say. You're probably asking the wrong thing.
The increment is certainly atomic. Unless the storage is misaligned (and since you left alignment to the compiler, it is not), it is necessarily aligned within a single cache line. Short of special non-caching streaming instructions, each and every write goes through the cache. Complete cache lines are being atomically read and written, never anything different.
Smaller-than-cacheline data is, of course, also written atomically (since the surrounding cache line is).
Is it thread-safe?
This is a different question, and there are at least two good reasons to answer with a definite "No!".
First, there is the possibility that another core might have a copy of that cache line in L1 (L2 and upwards is usually shared, but L1 is normally per-core!), and concurrently modifies that value. Of course that happens atomically, too, but now you have two "correct" (correctly, atomically, modified) values -- which one is the truly correct one now?
The CPU will sort it out somehow, of course. But the result may not be what you expect.
Second, there is memory ordering, or worded differently happens-before guarantees. The most important thing about atomic instructions is not so much that they are atomic. It's ordering.
You have the possibility of enforcing a guarantee that everything that happens memory-wise is realized in some guaranteed, well-defined order where you have a "happened before" guarantee. This ordering may be as "relaxed" (read as: none at all) or as strict as you need.
For example, you can set a pointer to some block of data (say, the results of some calculation) and then atomically release the "data is ready" flag. Now, whoever acquires this flag will be led into thinking that the pointer is valid. And indeed, it will always be a valid pointer, never anything different. That's because the write to the pointer happened-before the atomic operation.