CPU Relax instruction and C++11 primitives

2020-06-01 05:45发布

I've noticed that many lockless algorithms implemented using OS-specific primitives, such as the spin locks described here (which use Linux-specific atomic primitives) often make use of a "cpu relax" instruction. With GCC, this can be achieved with:

asm volatile("pause\n": : :"memory");

Specifically, this instruction is often used in the body of while loop spin locks, while waiting for a variable to set to a certain value.

C++11 doesn't seem to provide any kind of portable "cpu_relax" type instruction. Is there some reason for this? And does the "pause" statement actually accomplish anything useful?

Edit:

Also, I'd ask: why did the C++11 standards committee not decide to include a generic std::cpu_relax() or whatever? Is it too difficult to guarantee portability?

1条回答
贼婆χ
2楼-- · 2020-06-01 06:11

The PAUSE instruction is x86 specific. It's sole use is in spin-lock wait loops, where it:

Improves the performance of spin-wait loops. When executing a “spin-wait loop,” processors will suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop.

Also:

Inserting a pause instruction in a spinwait loop greatly reduces the processor’s power consumption.

Where you put this instruction in a spin-lock loop is also x86_64 specific. I cannot speak for the C++11 standards folk, but I think it is reasonable for them to conclude that the right place for this magic is in the relevant library... along with all the other magic required to implement atomics, mutexes etc.

NB: the PAUSE does not release the processor to allow another thread to run. It is not a "low-level" pthread_yield(). (Although on Intel Hyperthreaded cores, it does prevent the spin-lock thread from hogging the core.) The essential function of the PAUSE appears to be to turn off the usual instruction execution optimisations and pipelining, which slows the thread down (a bit), but having discovered the lock is busy, this reduces the rate at which the lock variable is touched, so that the cache system is not being pounded by the waiter while the current owner of the lock is trying to get on with real work.

Note that the primitives being used to "hand roll" spin-locks, mutexes etc. are not OS specific, but processor-specific.

I'm not sure I would describe a "hand rolled" spin-lock as "lockless" !

FWIW, the Intel recommendation for a spin-lock ("Intel® 64 and IA-32 Architectures Optimization Reference Manual") is:

  Spin_Lock:
    CMP   lockvar, 0     // Check if lock is free.
    JE    Get_lock
    PAUSE                // Short delay.
    JMP   Spin_Lock
  Get_Lock:
    MOV   EAX, 1
    XCHG  EAX, lockvar  // Try to get lock.
    CMP   EAX, 0        // Test if successful.
    JNE   Spin_Lock

Clearly one can write something which compiles to this, using a std::atomic_flag... or use pthread_spin_lock(), which on my machine is:

  pthread_spin_lock:
    lock decl (%rdi)
    jne    wait
    xor    %eax, %eax
    ret
  wait:
    pause
    cmpl   $0, (%rdi)
    jg     pthread_spin_lock
    jmp    wait

which is hard to fault, really.

查看更多
登录 后发表回答