I am curring looking at some assembly generated for atomic operations by gcc. I tried the following short sequence:
int x1;
int x2;
int foo;
void test()
{
__atomic_store_n( &x1, 1, __ATOMIC_SEQ_CST );
if( __atomic_load_n( &x2 ,__ATOMIC_SEQ_CST ))
return;
foo = 4;
}
Looking at Herb Sutter's atomic weapons talk on code generation, he mentions that the X86 manual mandates to use xchg
for atomic stores and a simple mov
for atomic reads. So I was expecting something along the lines of:
test():
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $1, %eax
xchg %eax, x1(%rip)
movl x2(%rip), %eax
testl %eax, %eax
setne %al
testb %al, %al
je .L2
jmp .L1
.L2:
movl $4, foo(%rip)
.L1:
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
Where the memory fence is implicit because of the locked xchg
instruction.
However if I compile this using gcc -march=core2 -S test.cc
I get the following:
test():
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $1, %eax
movl %eax, x1(%rip)
mfence
movl x2(%rip), %eax
testl %eax, %eax
setne %al
testb %al, %al
je .L2
jmp .L1
.L2:
movl $4, foo(%rip)
.L1:
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
So instead of using a xchg
operation gcc here uses a mov + mfence
combination. What is the reason here for this code generation, which differs from the one mandated by the x86 architecture according to Herb Sutter?
The xchg
instruction has implied lock semantics when the destination is a memory location. What this means is you can swap the contents of a register with the contents of a memory location atomically.
The example in the question is doing an atomic store, not a swap. The x86 architecture memory model guarantees that in a multi-processor/multi-core system stores done by one thread will be seen in that order by other threads... therefore a memory move is sufficient. Having said that, there are older Intel CPUs and some clones where there are bugs in this area, and an xchg
is required as a workaround on those CPUs. See the Significant optimizations section of this wikipedia article on spinlocks:
http://en.wikipedia.org/wiki/Spinlock#Example_implementation
Which states
The simple implementation above works on all CPUs using the x86 architecture. However, a number of performance optimizations are possible:
On later implementations of the x86 architecture, spin_unlock can safely use an unlocked MOV instead of the slower locked XCHG. This is due to subtle memory ordering rules which support this, even though MOV is not a full memory barrier. However, some processors (some Cyrix processors, some revisions of the Intel Pentium Pro (due to bugs), and earlier Pentium and i486 SMP systems) will do the wrong thing and data protected by the lock could be corrupted. On most non-x86 architectures, explicit memory barrier or atomic instructions (as in the example) must be used. On some systems, such as IA-64, there are special "unlock" instructions which provide the needed memory ordering.
The memory barrier, mfence
, ensures that all stores have completed (store buffers in the CPU core are empty and values stored in the cache or memory), it also ensures that no future loads execute out of order.
The fact a MOV is sufficient to unlock the mutex (no serialization or memory barrier required) was "officially" clarified in a reply to Linus Torvalds by an Intel architect back in 1999
http://lkml.org/lkml/1999/11/24/90.
I guess it was later discovered that didn't work for some older x86 processors.