The cost of atomic counters and spinlocks on x86(_

2019-03-09 12:46发布

问题:

Preface

I recently came across some synchronization problems, which led me to spinlocks and atomic counters. Then I was searching a bit more, how these work and found std::memory_order and memory barriers (mfence, lfence and sfence).

So now, it seems that I should use acquire/release for the spinlocks and relaxed for the counters.

Some reference

x86 MFENCE - Memory Fence
x86 LOCK - Assert LOCK# Signal

Question

What is the machine code (edit: see below) for those three operations (lock = test_and_set, unlock = clear, increment = operator++ = fetch_add) with default (seq_cst) memory order and with acquire/release/relaxed (in that order for those three operations). What is the difference (which memory barriers where) and the cost (how many CPU cycles)?

Purpose

I was just wondering how bad my old code (not specifying memory order = seq_cst used) really is and if I should create some class atomic_counter derived from std::atomic but using relaxed memory ordering (as well as good spinlock with acquire/release instead of mutexes on some places ...or to use something from boost library - I have avoided boost so far).

My Knowledge

So far I do understand that spinlocks protect more than itself (but some shared resource/memory as well), so, there must be something that makes some memory view coherent for multiple threads/cores (that would be those acquire/release and memory fences). Atomic counter just lives for itself and only need that atomic increment (no other memory involved and I do not really care about the value when I read it, it is informative and can be few cycles old, no problem). There is some LOCK prefix and some instructions like xchg implicitly have it. Here my knowledge ends, I don't know how the cache and buses really work and what is behind (but I know that modern CPUs can reorder instructions, execute them in parallel and use memory cache and some synchronization). Thank you for explanation.

P.S.: I have old 32bit PC now, can only see lock addl and simple xchg, nothing else - all versions look the same (except unlock), memory_order makes no difference on my old PC (except unlock, release uses move instead of xchg). Will that be true for 64bit PC? (edit: see below) Do I have to care about memory order? (answer: no, not much, release on unlock saves few cycles, that's all.)

The Code:

#include <atomic>
using namespace std;

atomic_flag spinlock;
atomic<int> counter;

void inc1() {
    counter++;
}
void inc2() {
    counter.fetch_add(1, memory_order_relaxed);
}
void lock1() {
    while(spinlock.test_and_set()) ;
}
void lock2() {
    while(spinlock.test_and_set(memory_order_acquire)) ;
}
void unlock1() {
    spinlock.clear();
}
void unlock2() {
    spinlock.clear(memory_order_release);
}

int main() {
    inc1();
    inc2();
    lock1();
    unlock1();
    lock2();
    unlock2();
}

g++ -std=c++11 -O1 -S (32bit Cygwin, shortened output)

__Z4inc1v:
__Z4inc2v:
    lock addl   $1, _counter    ; both seq_cst and relaxed
    ret
__Z5lock1v:
__Z5lock2v:
    movl    $1, %edx
L5:
    movl    %edx, %eax
    xchgb   _spinlock, %al      ; both seq_cst and acquire
    testb   %al, %al
    jne L5
    rep ret
__Z7unlock1v:
    movl    $0, %eax
    xchgb   _spinlock, %al      ; seq_cst
    ret
__Z7unlock2v:
    movb    $0, _spinlock       ; release
    ret

UPDATE for x86_64bit: (see mfence in unlock1)

_Z4inc1v:
_Z4inc2v:
    lock addl   $1, counter(%rip)   ; both seq_cst and relaxed
    ret
_Z5lock1v:
_Z5lock2v:
    movl    $1, %edx
.L5:
    movl    %edx, %eax
    xchgb   spinlock(%rip), %al     ; both seq_cst and acquire
    testb   %al, %al
    jne .L5
    ret
_Z7unlock1v:
    movb    $0, spinlock(%rip)
    mfence                          ; seq_cst
    ret
_Z7unlock2v:
    movb    $0, spinlock(%rip)      ; release
    ret

回答1:

x86 has mostly strong memory model, all the usual stores/loads have release/acquire semantics implicitly. The exception is only SSE non-temporal store operations which require sfence to be ordered as usual. All the read-modify-write (RMW) instructions with the LOCK prefix imply full memory barrier, i.e. seq_cst.

Thus on x86, we have

  • test_and_set can be coded with lock bts (for bit-wise operations), lock cmpxchg, or lock xchg (or just xchg which implies the lock). Other spin-lock implementations can use instructions like lock inc (or dec) if they need e.g. fairness. It is not possible to implement try_lock with release/acquire fence (at least you'd need standalone memory barrier mfence anyway).
  • clear is coded with lock and (for bit-wise) or lock xchg, though, more efficient implementations would use plain write (mov) instead of locked instruction.
  • fetch_add is coded with lock add.

Removing the lock prefix will not guarantee atomicity for RMW operations thus such operations cannot be interpreted strictly as having memory_order_relaxed in C++ view. However in practice, you might want to access atomic variable via faster non-atomic operation when it is safe (in constructor, under lock).

In our experience, it does not really matter which exactly RMW atomic operation is performed they take almost the same number of cycles to execute (and mfence is about x0.5 of a lock operation). You can estimate performance of synchronization algorithms by counting the number of atomic operations (and mfences), and the number of memory indirections (cache misses).



回答2:

I recommend: x86-TSO: A Rigorous and Usable Programmer's Model for x86 Multiprocessors.

Your x86 and x86_64 are indeed pretty "well behaved". In particular, they do not re-order write operations (and any speculative writes are discarded while they are in the cpu/core's write-queue), and they do not re-order read operations. However, they will start read operations as early as they can, which means that reads and writes can be re-ordered. (A read of something sitting in the write-queue reads the queued value, so reads/writes of the same location are not re-ordered.) So:

  • read-modify-write operations require LOCKs which makes them, implicitly, memory_order_seq_cst.

    So for these operations you gain nothing by weakening the memory ordering (on the x86/x86_64). The general advice is to "keep it simple" and stick with memory_order_seq_cst, which happily is not costing anything extra for the x86 and x86_64.

    For anything newer than a Pentium, if the cpu/core already has "exclusive" access to the affected memory, the LOCK does not affect other cpus/cores, and may be a relatively simple operation.

  • memory_order_acquire/_release do not require an mfence or any other overhead.

    So, for atomic load/store, if acquire/release is sufficient, then for the x86/x86_64 those operations are "tax free".

  • memory_order_seq_cst does require mfence...

...which is worth understanding.

(NB: we are here talking about what the processor does with the instructions generated by the compiler. The compiler's re-ordering of operations is a very similar issue, but not addressed here.)

An mfence stalls the cpu/core until all pending writes are cleared out of the write-queue. In particular, any read operations which follow the mfence will not start until the write-queue is empty. Consider two threads:

  initial state: wa = wb = 0

  thread 'A'                    thread 'B'
    wa = 1 ;  (mov [wa] ← 1)      wb = 1 ;   (mov [wb] ← 1)
    a  = wb ; (mov ebx ← [wb])    b  = wa ;  (mov ebx ← [wa])

Left to their own devices, the x86/x86_64 can produce any of (a = 1, b = 1), (a = 0, b = 1), (a = 1, b = 0) and (a = 0, b = 0). The last is invalid if you expect memory_order_seq_cst -- since you cannot get that by any interleaving of the operations. The reason this can happen is that the writes of wa and wb are queued in the respective cpu's/core's queue, and the reads of wa and wb can both be scheduled and can both complete before either write. To achieve memory_order_seq_cst you need an mfence:

  thread 'A'                    thread 'B'
    wa = 1 ;  (mov [wa] ← 1)      wb = 1 ;   (mov [wb] ← 1)
        mfence ;                      mfence
    a  = wb ; (mov ebx ← [wb])    b  = wa ;  (mov ebx ← [wa])

Since there is no synchronization between the threads, the result may be anything except (a = 0, b = 0). Interestingly, the mfence is for the benefit of the thread itself, because it prevents the read operation starting before the write completes. The only thing that other threads care about is the order in which writes occur, and the x86/x86_64 does not re-order those in any case.

So, to implement memory_order_seq_cst atomic_load() and atomic_store(), it is necessary to insert an mfence after one or more stores and before a load. Where these operations are implemented as library functions, the common convention is to add the mfence to all stores, leaving the load "naked". (The logic being that loads are more common than stores, and it seems better to add the overhead to the store.)


For spin-locks, at least, your question seems to boil down to whether a spin-unlock operation requires an mfence, or not, and what difference it makes.

The C11 atomic_flag_clear() is, implicitly, memory_order_seq_cst, for which an mfence is required. The C11 atomic_flag_test_and_set() is not only a read-modify-write operation but is also implictly memory_order_seq_cst -- and LOCK does that.

C11 does not offer a spin-lock in the threads.h library. But you can use an atomic_flag -- though for your x86/x86_64 you have PAUSE instruction problem to deal with. The question is, do you need memory_order_seq_cst for this, in particular for the unlock ? I think the answer is no, and that the trick is to do: atomic_flag_test_and_set_explicit(xxx, memory_order_acquire) and atomic_flag_clear(xxx, memory_order_release).

FWIW, the glibc pthread_spin_unlock() does not have an mfence. Nor does the gcc __sync_lock_release() (which is explicitly a "release" operation). But the gcc _atomic_clear() is aligned with the C11 atomic_flag_clear(), and takes a memory order parameter.

What difference does the mfence make to the unlock ? Clearly it's very disruptive to the pipe-line, and since it's not necessary, there's not much to be gained working out the exact scale of its impact, which will depend on the circumstances.



回答3:

spinlock do not use mfence, mfence only enforce serialise/flush of data of current core. The fence itself do not in any way relate to atomic operation.

For spinlock you need some kind of atomic action to exchange data to a memory place. There are many different implementation, targeted for different requirement: for example, do it work on kernel or user-space? is it fair-lock?

A very simple and dumb spinlock for x86 looks like this (my kernel use this):

typedef volatile uint32_t _SPINLOCK __attribute__ ((aligned(16)));
static inline void _SPIN_LOCK(_SPINLOCK* lock) {
__asm (
       "cli\n"
       "lock bts %0, 0\n"
       "jnc 1f\n"
       "0:\n"
       "pause\n"
       "test %0, 1\n"
       "je 0b\n"
       "lock bts %0, 0\n"
       "jc 0b\n"
       "1:\n"
       :
       : "m"(lock)
       :
       );
}

The logic is simple

  1. test and exchange a bit, if zero it mean the lock not taken, and we got it.
  2. if bit is not zero, it mean the lock is taken by other, pause is a hint recommended by cpu manufacture so that it doesn't burn the cpu with a tight look.
  3. loop until you got the lock

Note 1. you may also implement spinlock with intrinsics and extensions, it should be fairly similar.

Note 2. Spinlock is not judge by cycles, a sane implementation should be quite fast, for instant, the above implementation you should grab the lock on first try in well designed usage, if not, fix the algorithm or split the lock to prevent/reduce lock contention.

Note 3. You should also consider other things like fairness.



回答4:

Re

and the cost (how many CPU cycles)?

On x86 at least, instructions that perform memory synchronization (atomic ops, fences) have a very variable CPU cycle latency. They wait for the processor store buffers to be flushed to memory, and this varies dramatically depending on the store buffer content.

E.g., if an atomic op is straight after a memcpy() that pushes multiple cache lines out to main memory, the delay may be in the 100's of nanoseconds. The same atomic op, but after a series of register-only arithmetic instructions, may take only a few clock cycles.