I've been reading this article about atomic operations, and it mentions 32bit integer assignment being atomic on x86, as long as the variable is naturally aligned.
Why does natural alignment assure atomicity?
I've been reading this article about atomic operations, and it mentions 32bit integer assignment being atomic on x86, as long as the variable is naturally aligned.
Why does natural alignment assure atomicity?
"Natural" alignment means aligned to it's own type width. Thus, the load/store will never be split across any kind of boundary wider than itself (e.g. page, cache-line, or an even narrower chunk size used for data transfers between different caches).
CPUs often do things like cache-access, or cache-line transfers between cores, in power-of-2 sized chunks, so alignment boundaries smaller than a cache line do matter. (See @BeeOnRope's comments below). See also Atomicity on x86 for more details on how CPUs implement atomic loads or stores internally, and Can num++ be atomic for 'int num'? for more about how atomic RMW operations like
atomic<int>::fetch_add()
/lock xadd
are implemented internally.First, this assumes that the
int
is updated with a single store instruction, rather than writing different bytes separately. This is part of whatstd::atomic
guarantees, but that plain C or C++ doesn't. It will normally be the case, though. The x86-64 System V ABI doesn't forbid compilers from making accesses toint
variables non-atomic, even though it does requireint
to be 4B with a default alignment of 4B. For example,x = a<<16 | b
could compile to two separate 16-bit stores if the compiler wanted.Data races are Undefined Behaviour in both C and C++, so compilers can and do assume that memory is not asynchronously modified. For code that is guaranteed not to break, use C11 stdatomic or C++11 std::atomic. Otherwise the compiler will just keep a value in a register instead of reloading every time your read it, like
volatile
but with actual guarantees and official support from the language standard.Before C++11, atomic ops were usually done with
volatile
or other things, and a healthy dose of "works on compilers we care about", so C++11 was a huge step forward. Now you no longer have to care about what a compiler does for plainint
; just useatomic<int>
. If you find old guides talking about atomicity ofint
, they probably predate C++11.Side-note: for
atomic<T>
larger than the CPU can do atomically (so.is_lock_free()
is false), see Where is the lock for a std::atomic?.int
andint64_t
/uint64_t
are lock-free on all the major x86 compilers, though.Thus, we just need to talk about the behaviour of an insn like
mov [shared], eax
.TL;DR: The x86 ISA guarantees that naturally-aligned stores and loads are atomic, up to 64bits wide. So compilers can use ordinary stores/loads as long as they ensure that
std::atomic<T>
has natural alignment.(But note that i386
gcc -m32
fails to do that for C11_Atomic
64-bit types, only aligning them to 4B, soatomic_llong
is not actually atomic. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146#c4).g++ -m32
withstd::atomic
is fine, at least in g++5 because https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65147 was fixed in 2015 by a change to the<atomic>
header. That didn't change the C11 behaviour, though.)IIRC, there were SMP 386 systems, but the current memory semantics weren't established until 486. This is why the manual says "486 and newer".
From the "Intel® 64 and IA-32 Architectures Software Developer Manuals, volume 3", with my notes in italics. (see also the x86 tag wiki for links: current versions of all volumes, or direct link to page 256 of the vol3 pdf from Dec 2015)
In x86 terminology, a "word" is two 8-bit bytes. 32 bits are a double-word, or DWORD.
That last point that I bolded is the answer to your question: This behaviour is part of what's required for a processor to be an x86 CPU (i.e. an implementation of the ISA).
The rest of the section provides further guarantees for newer Intel CPUs: Pentium widens this guarantee to 64 bits.
The section goes on to point out that accesses split across cache lines (and page boundaries) are not guaranteed to be atomic, and:
AMD's manual agrees with Intel's about aligned 64-bit and narrower loads/stores being atomic
So integer, x87, and MMX/SSE loads/stores up to 64b, even in 32-bit or 16-bit mode (e.g.
movq
,movsd
,movhps
,pinsrq
,extractps
, etc) are atomic if the data is aligned.gcc -m32
usesmovq xmm, [mem]
to implement atomic 64-bit loads for things likestd::atomic<int64_t>
. Clang4.0-m32
unfortunately useslock cmpxchg8b
bug 33109.On some CPUs with 128b or 256b internal data paths (between execution units and L1, and between different caches), 128b and even 256b vector loads/stores are atomic, but this is not guaranteed by any standard or easily queryable at run-time, unfortunately for compilers implementing
std::atomic<__int128>
or 16B structs.If you want atomic 128b across all x86 systems, you must use
lock cmpxchg16b
(available only in 64bit mode). (And it wasn't available in the first-gen x86-64 CPUs. You need to use-mcx16
with gcc/clang for them to emit it.)Even CPUs that internally do atomic 128b loads/stores can exhibit non-atomic behaviour in multi-socket systems with a coherency protocol that operates in smaller chunks: e.g. AMD Opteron 2435 (K10) with threads running on separate sockets, connected with HyperTransport.
Intel's and AMD's manuals diverge for unaligned access to cacheable memory. The common subset for all x86 CPUs is the AMD rule. Cacheable means write-back or write-through memory regions, not uncacheable or write-combining, as set with PAT or MTRR regions. They don't mean that the cache-line has to already be hot in L1 cache.
AMD guarantees atomicity for cacheable loads/stores that fit within a single 8B-aligned chunk. That makes sense, because we know from the 16B-store test on multi-socket Opteron that HyperTransport only transfers in 8B chunks, and doesn't lock while transferring to prevent tearing. (See above). I guess
lock cmpxchg16b
must be handled specially.Possibly related: AMD uses MOESI to share dirty cache-lines directly between caches in different cores, so one core can be reading from its valid copy of a cache line while updates to it are coming in from another cache.
Intel uses MESIF, which requires dirty data to propagate out to the large shared inclusive L3 cache which acts as a backstop for coherency traffic. L3 is tag-inclusive of per-core L2/L1 caches, even for lines that have to be in the Invalid state in L3 because of being M or E in a per-core L1 cache. The data path between L3 and per-core caches is only 32B wide in Haswell/Skylake, so it must buffer or something to avoid a write to L3 from one core happening between reads of two halves of a cache line, which could cause tearing at the 32B boundary.
The relevant sections of the manuals:
Notice that AMD guarantees atomicity for any load smaller than a qword, but Intel only for power-of-2 sizes. 32-bit protected mode and 64-bit long mode can load a 48 bit
m16:32
as a memory operand intocs:eip
with far-call
or far-jmp
. (And far-call pushes stuff on the stack.) IDK if this counts as a single 48-bit access or separate 16 and 32-bit.There have been attempts to formalize the x86 memory model, the latest one being the x86-TSO (extended version) paper from 2009 (link from the memory-ordering section of the x86 tag wiki). It's not usefully skimable since they define some symbols to express things in their own notation, and I haven't tried to really read it. IDK if it describes the atomicity rules, or if it's only concerned with memory ordering.
Atomic Read-Modify-Write
I mentioned
cmpxchg8b
, but I was only talking about the load and the store each separately being atomic (i.e. no "tearing" where one half of the load is from one store, the other half of the load is from a different store).To prevent the contents of that memory location from being modified between the load and the store, you need
lock
cmpxchg8b
, just like you needlock inc [mem]
for the entire read-modify-write to be atomic. Also note that even ifcmpxchg8b
withoutlock
does a single atomic load (and optionally a store), it's not safe in general to use it as a 64b load with expected=desired. If the value in memory happens to match your expected, you'll get a non-atomic read-modify-write of that location.The
lock
prefix makes even unaligned accesses that cross cache-line or page boundaries atomic, but you can't use it withmov
to make an unaligned store or load atomic. It's only usable with memory-destination read-modify-write instructions likeadd [mem], eax
.(
lock
is implicit inxchg reg, [mem]
, so don't usexchg
with mem to save code-size or instruction count unless performance is irrelevant. Only use it when you want the memory barrier and/or the atomic exchange, or when code-size is the only thing that matters, e.g. in a boot sector.)See also: Can num++ be atomic for 'int num'?
Why
lock mov [mem], reg
doesn't exist for atomic unaligned storesFrom the insn ref manual (Intel x86 manual vol2),
cmpxchg
:This design decision reduced chipset complexity before the memory controller was built into the CPU. It may still do so for
lock
ed instructions on MMIO regions that hit the PCI-express bus rather than DRAM. It would just be confusing for alock mov reg, [MMIO_PORT]
to produce a write as well as a read to the memory-mapped I/O register.The other explanation is that it's not very hard to make sure your data has natural alignment, and
lock store
would perform horribly compared to just making sure your data is aligned. It would be silly to spend transistors on something that would be so slow it wouldn't be worth using. If you really need it (and don't mind reading the memory too), you could usexchg [mem], reg
(XCHG has an implicit LOCK prefix), which is even slower than a hypotheticallock mov
.Using a
lock
prefix is also a full memory barrier, so it imposes a performance overhead beyond just the atomic RMW. i.e. x86 can't do relaxed atomic RMW (without flushing the store buffer). Other ISAs can, so using.fetch_add(1, memory_order_relaxed)
can be faster on non-x86.Fun fact: Before
mfence
existed, a common idiom waslock add dword [esp], 0
, which is a no-op other than clobbering flags and doing a locked operation.[esp]
is almost always hot in L1 cache and won't cause contention with any other core. This idiom may still be more efficient than MFENCE as a stand-alone memory barrier, especially on AMD CPUs.xchg [mem], reg
is probably the most efficient way to implement a sequential-consistency store, vs.mov
+mfence
, on both Intel and AMD.mfence
on Skylake at least blocks out-of-order execution of non-memory instructions, butxchg
and otherlock
ed ops don't. Compilers other than gcc do usexchg
for stores, even when they don't care about reading the old value.Motivation for this design decision:
Without it, software would have to use 1-byte locks (or some kind of available atomic type) to guard accesses to 32bit integers, which is hugely inefficient compared to shared atomic read access for something like a global timestamp variable updated by a timer interrupt. It's probably basically free in silicon to guarantee for aligned accesses of bus-width or smaller.
For locking to be possible at all, some kind of atomic access is required. (Actually, I guess the hardware could provide some kind of totally different hardware-assisted locking mechanism.) For a CPU that does 32bit transfers on its external data bus, it just makes sense to have that be the unit of atomicity.
Since you offered a bounty, I assume you were looking for a long answer that wandered into all interesting side topics. Let me know if there are things I didn't cover that you think would make this Q&A more valuable for future readers.
Since you linked one in the question, I highly recommend reading more of Jeff Preshing's blog posts. They're excellent, and helped me put together the pieces of what I knew into an understanding of memory ordering in C/C++ source vs. asm for different hardware architectures, and how / when to tell the compiler what you want if you aren't writing asm directly.
To answer your first question ,a variable is naturally aligned if it exists at a memory address that is a multiple of its size.
If we consider only - as the article you linked does - assignment instructions, then alignment guarantees atomicity because MOV (the assignment instruction) is atomic by design on aligned data.
Other kinds of instructions, INC for example, need to be LOCKed (an x86 prefix which gives exclusive access to the shared memory to the current processor for the duration of the prefixed operation) even if the data are aligned because they actually execute via multiple steps (=instructions, namely load, inc, store).
If you were asking why it's designed so, I would say it's a good side product from the design of CPU architecture.
Back in the 486 time, there is no multi-core CPU or QPI link, so atomicity isn't really a strict requirement at that time (DMA may require it?).
On x86, the data width is 32bits (or 64 bits for x86_64), meaning the CPU can read and write up to data width in one shot. And the memory data bus is typically the same or wider than this number. Combined with the fact that reading/writing on aligned address is done in one shot, naturally there is nothing preventing the read/write to be un-atomic. You gain speed/atomic at the same time.
If a 32-bit or smaller object is naturally-aligned within a "normal" part of memory, it will be possible for any 80386 or compatible processor other than the 80386sx to read or write all 32 bits of the object in a single operation. While the ability of a platform to do something in a quick and useful fashion doesn't necessarily mean the platform won't sometimes do it in some other fashion for some reason, and while I believe it's possible on many if not all x86 processors to have regions of memory which can only be accessed 8 or 16 bits at a time, I don't think Intel has ever defined any conditions where requesting an aligned 32-bit access to a "normal" area of memory would cause the system to read or write part of the value without reading or writing the whole thing, and I don't think Intel has any intention of ever defining any such thing for "normal" areas of memory.
Naturally aligned means that the address of the type is a multiple of the size of the type.
For example, a byte can be at any address, a short (assuming 16 bits) must be on a multiple of 2, an int (assuming 32 bits) must be on a multiple of 4, and a long (assuming 64 bits) must be on a multiple of 8.
In the event that you access a piece of data that is not naturally aligned the CPU will either raise a fault or will read/write the memory, but not as an atomic operation. The action the CPU takes will depend on the architecture.
For example, image we've got the memory layout below:
and
When we try to read
*data
the bytes that make up the value are spread across 2 int size blocks, 1 byte is in block 0-3 and 3 bytes are in block 4-7. Now, just because the blocks are logically next to each other it doesn't mean they are physically. For example, block 0-3 could be at the end of a cpu cache line, whilst block 3-7 is sitting in a page file. When the cpu goes to access block 3-7 in order to get the 3 bytes it needs it may see that the block isn't in memory and signals that it needs the memory paged in. This will probably block the calling process whilst the OS pages the memory back in.After the memory has been paged in, but before your process is woken back up another one may come along and write a
Y
to address 4. Then your process is rescheduled and the CPU completes the read, but now it has read XYXX, rather than the XXXX you expected.