Consider a single memory access (a single read or a single write, not read+write) SSE instruction on an x86 CPU. The instruction is accessing 16 bytes (128 bits) of memory and the accessed memory location is aligned to 16 bytes.
The document "Intel® 64 Architecture Memory Ordering White Paper" states that for "Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary" the memory operation appears to execute as a single memory access regardless of memory type.
The question: Do there exist Intel/AMD/etc x86 CPUs which guarantee that reading or writing 16 bytes (128 bits) aligned to a 16 byte boundary executes as a single memory access? Is so, which particular type of CPU is it (Core2/Atom/K8/Phenom/...)? If you provide an answer (yes/no) to this question, please also specify the method that was used to determine the answer - PDF document lookup, brute force testing, math proof, or whatever other method you used to determine the answer.
This question relates to problems such as http://research.swtch.com/2010/02/off-to-races.html
Update:
I created a simple test program in C that you can run on your computers. Please compile and run it on your Phenom, Athlon, Bobcat, Core2, Atom, Sandy Bridge or whatever SSE2-capable CPU you happen to have. Thanks.
// Compile with:
// gcc -o a a.c -pthread -msse2 -std=c99 -Wall -O2
//
// Make sure you have at least two physical CPU cores or hyper-threading.
#include <pthread.h>
#include <emmintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
typedef int v4si __attribute__ ((vector_size (16)));
volatile v4si x;
unsigned n1[16] __attribute__((aligned(64)));
unsigned n2[16] __attribute__((aligned(64)));
void* thread1(void *arg) {
for (int i=0; i<100*1000*1000; i++) {
int mask = _mm_movemask_ps((__m128)x);
n1[mask]++;
x = (v4si){0,0,0,0};
}
return NULL;
}
void* thread2(void *arg) {
for (int i=0; i<100*1000*1000; i++) {
int mask = _mm_movemask_ps((__m128)x);
n2[mask]++;
x = (v4si){-1,-1,-1,-1};
}
return NULL;
}
int main() {
// Check memory alignment
if ( (((uintptr_t)&x) & 0x0f) != 0 )
abort();
memset(n1, 0, sizeof(n1));
memset(n2, 0, sizeof(n2));
pthread_t t1, t2;
pthread_create(&t1, NULL, thread1, NULL);
pthread_create(&t2, NULL, thread2, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
for (unsigned i=0; i<16; i++) {
for (int j=3; j>=0; j--)
printf("%d", (i>>j)&1);
printf(" %10u %10u", n1[i], n2[i]);
if(i>0 && i<0x0f) {
if(n1[i] || n2[i])
printf(" Not a single memory access!");
}
printf("\n");
}
return 0;
}
The CPU I have in my notebook is Core Duo (not Core2). This particular CPU fails the test, it implements 16-byte memory read/writes with a granularity of 8 bytes. The output is:
0000 96905702 10512
0001 0 0
0010 0 0
0011 22 12924 Not a single memory access!
0100 0 0
0101 0 0
0110 0 0
0111 0 0
1000 0 0
1001 0 0
1010 0 0
1011 0 0
1100 3092557 1175 Not a single memory access!
1101 0 0
1110 0 0
1111 1719 99975389
The x86 ISA doesn't guarantee atomicity for anything larger than 8B, so that implementations are free to implement SSE / AVX support the way Pentium III / Pentium M / Core Duo does: internally data is handled in 64bit halves. A 128bit store is done as two 64bit stores. The data path to/from cache is only 64b wide in the Yonah microarchitecture (Core Duo). (source:Agner Fog's microarch doc).
More recent implementations do have wider data paths internally, and handle 128b instructions as a single op. Core 2 Duo (conroe/merom) was the first Intel P6-descended microarch with 128b data paths. (IDK about P4, but fortunately it's old enough to be totally irrelevant.)
This is why the OP finds that 128b ops are not atomic on Intel Core Duo (Yonah), but other posters find that they are atomic on later Intel designs, starting with Core 2 (Merom).
The diagrams on this Realworldtech writeup about Merom vs. Yonah show the 128bit path between ALU and L1 data-cache in Merom (and P4), while the low-power Yonah has a 64bit data path. The data path between L1 and L2 cache is 256b in all 3 designs.
The next jump in data path width came with Intel's Haswell, featuring 256b (32B) AVX/AVX2 loads/stores, and a 64Byte path between L1 and L2 cache. I expect that 256b loads/stores are atomic in Haswell, Broadwell, and Skylake, but I don't have one to test. I forget if Skylake again widened the paths in preparation for AVX512 in Skylake-EP (the server version), or if perhaps the initial implementation of AVX512 will be like SnB/IvB's AVX, and have 512b loads/stores occupy a load/store port for 2 cycles.
As janneb points out in his excellent experimental answer, the cache-coherency protocol between sockets in a multi-core system might be narrower than what you get within a shared-last-level-cache CPU. There is no architectural requirement on atomicity for wide loads/stores, so designers are free to make them atomic within a socket but non-atomic across sockets if that's convenient. IDK how wide the inter-socket logical data path is for AMD's Bulldozer-family, or for Intel. (I say "logical", because even if the data is transferred in smaller chunks, it might not modify a cache line until it's fully received.)
Finding similar articles about AMD CPUs should allow drawing reasonable conclusions about whether 128b ops are atomic or not. Just checking instruction tables is some help:
K8 decodes
movaps reg, [mem]
to 2 m-ops, while K10 and bulldozer-family decode it to 1 m-op. AMD's low-power bobcat decodes it to 2 ops, while jaguar decodes 128b movaps to 1 m-op. (It supports AVX1 similar to bulldozer-family CPUs: 256b insns (even ALU ops) are split into two 128b ops. Intel SnB only splits 256b loads/stores, while having full-width ALUs.)janneb's Opteron 2435 is a 6-core Istanbul CPU, which is part of the K10 family, so this single-m-op -> atomic conclusion appears accurate within a single socket.
Intel Silvermont does 128b loads/stores with a single uop, and a throughput of one per clock. This is the same as for integer loads/stores, so it's quite probably atomic.
EDIT: In the last two days I have made several tests on my three PCs and I didn't reproduce any memory error, so I can't say anything more precisely. Maybe is this memory error also dependent from OS.
EDIT: I'm programing in Delphi and not in C but I should understand C. So I have translated the code, here are you have the threads procedures where the main part is made in assembler:
Result: no mistake on my quad core PC and no mistake on my dual core PC as expected!
Can you show how debuger see your thread procedure code? Please...
Lot of answers have been posted so far and hence lot of information is already available (as a side effect lot of confusion too). I would like to site facts from Intel manual regarding hardware guaranteed atomic operations ...
In Intel's latest processors of nehalem and sandy bridge family, reading or writing to a quadword aligned to 64 bit boundary is guaranteed.
Even unaligned 2, 4 or 8 byte reads or writes are guaranteed to be atomic provided they are cached memory and fit in a cache line.
Having said that the test posted in this question passes on sandy bridge based intel i5 processor.
The "AMD Architecture Programmer's Manual Volume 1: Application Programming" says in section 3.9.1: "
CMPXCHG16B
can be used to perform 16-byte atomic accesses in 64-bit mode (with certain alignment restrictions)."However, there is no such comment about SSE instructions. In fact, there is a comment in 4.8.3 that the LOCK prefix "causes an invalid-opcode exception when used with 128-bit media instructions". It therefore seems pretty conclusive to me that the AMD processors do NOT guarantee atomic 128-bit accesses for SSE instructions, and the only way to do an atomic 128-bit access is to use
CMPXCHG16B
.The "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1" says in 8.1.1 "An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses." This is pretty conclusive that 128-bit SSE instructions are not guaranteed atomic by the ISA. Volume 2A of the Intel docs says of
CMPXCHG16B
: "This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically."Further, CPU manufacturers haven't published written guarantees of atomic 128b SSE operations for specific CPU models where that is the case.
In the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3A, which nowadays contains the specifications of the memory ordering white paper you mention, it is said in section 8.2.3.1, as you note yourself, that
Now, since the above list does NOT contain the same language for double quadword (16 bytes), it follows that the architecture does NOT guarantee that instructions which access 16 bytes of memory are atomic.
That being said, the last paragraph does hint at a way out, namely the CMPXCHG16B instruction with the LOCK prefix. You can use the CPUID instruction to figure out if your processor supports CMPXCHG16B (the "CX16" feature bit).
In the corresponding AMD document, AMD64 Technology AMD64 Architecture Programmer’s Manual Volume 2: System Programming, I can't find similar clear language.
EDIT: Test program results
(Test program modified to increase #iterations by a factor of 10)
On a Xeon X3450 (x86-64):
On a Xeon 5150 (32-bit):
On an Opteron 2435 (x86-64):
Does this mean that Intel and/or AMD guarantee that 16 byte memory accesses are atomic on these machines? IMHO, it does not. It's not in the documentation as guaranteed architectural behavior, and thus one cannot know if on these particular processors 16 byte memory accesses really are atomic or whether the test program merely fails to trigger them for one reason or another. And thus relying on it is dangerous.
EDIT 2: How to make the test program fail
Ha! I managed to make the test program fail. On the same Opteron 2435 as above, with the same binary, but now running it via the "numactl" tool specifying that each thread runs on a separate socket, I got:
So what does this imply? Well, the Opteron 2435 may, or may not, guarantee that 16-byte memory accesses are atomic for intra-socket accesses, but at least the cache coherency protocol running on the HyperTransport interconnect between the two sockets does not provide such a guarantee.
EDIT 3: ASM for the thread functions, on request of "GJ."
Here's the generated asm for the thread functions for the GCC 4.4 x86-64 version used on the Opteron 2435 system:
and for completeness, .LC3 which is the static data containing the (-1, -1, -1, -1) vector used by thread2:
Also note that this is AT&T ASM syntax, not the Intel syntax Windows programmers might be more familiar with. Finally, this is with march=native which makes GCC prefer MOVAPS; but it doesn't matter, if I use march=core2 it will use MOVDQA for storing to x, and I can still reproduce the failures.
There is actually a warning in the Intel Architecture Manual Vol 3A. Section 8.1.1 (May 2011), under the section of guaranteed atomic operations:
thus SSE instructions are not guaranteed to be atomic, even if the underlying architecture does use a single memory access (this is one reason why the memory fencing was introduced).
Combine that with this statement from the Intel Optimization Manual, Section 13.3 (April 2011)
and that fact that none of the load or store operation for SIMD guarantee atomicity, we can come to the conclusion that Intel doesn't not support any form of atomic SIMD (yet).
As an extra bit, if the memory is split along cache lines or page boundaries (when using things like
movdqu
which permit unaligned access), the following processors will not perform atomic accesses, regardless of alignment, but later processors will (again from the Intel Architecture Manual):