Is x86 CMPXCHG atomic, if so why does it need LOCK

2020-01-24 07:08发布

The Intel documentation says

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

My question is

  1. Can CMPXCHG operate with memory address? From the document it seems not but can anyone confirm that only works with actual VALUE in registers, not memory address?

  2. If CMPXCHG isn't atomic and a high level language level CAS has to be implemented through LOCK CMPXCHG (with LOCK prefix), what's the purpose of introducing such an instruction at all?

3条回答
够拽才男人
2楼-- · 2020-01-24 07:51

You are mixing up high-level locks with the low-level CPU feature that happened to be named LOCK.

The high-level locks that lock-free algorithms try to avoid can guard arbitrary code fragments whose execution may take arbitrary time and thus, these locks will have to put threads into wait state until the lock is available which is a costly operation, e.g. implies maintaining a queue of waiting threads.

This is an entirely different thing than the CPU LOCK prefix feature which guards a single instruction only and thus might hold other threads for the duration of that single instruction only. Since this is implemented by the CPU itself, it doesn’t require additional software efforts.

Therefore the challenge of developing lock-free algorithms is not the removal of synchronization entirely, it boils down to reduce the critical section of the code to a single atomic operation which will be provided by the CPU itself.

查看更多
Evening l夕情丶
3楼-- · 2020-01-24 07:53

The LOCK prefix is to lock the memory access for the current command, so that other commands that are in the CPU pipeline can access the memory at this time. Using the LOCK prefix, the execution of the command won't be interrupted by another command in the CPU pipeline due to memory access of other commands that are executed at the same time. The INTEL manual says:

The LOCK prefix can be prepended only to the following in structions and only to those forms of the instructions where the destination operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, CMPXCHG16B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. If the LOCK prefix is used with one of these instructions and the source operand is a memory operand, an undefined opcode exception (#UD) may be generated.

查看更多
走好不送
4楼-- · 2020-01-24 08:08

It seems like part what you're really asking is:

Why isn't the lock prefix implicit for cmpxchg with a memory operand, like it is for xchg?

The simple answer (that others have given) is simply that Intel designed it this way. But this leads to the question:

Why did Intel do that? Is there a use-case for cmpxchg without lock?

On a single-CPU system, cmpxchg is atomic with respect to other threads, or any other code running on the same CPU core. (But not to "system" observers like a memory-mapped I/O device, or a device doing DMA reads of normal memory, so lock cmpxchg was relevant even on uniprocessor CPU designs).

Context switches can only happen on interrupts, and interrupts happen before or after an instruction, not in the middle. Any code running on the same CPU will see the cmpxchg as either fully executed or not at all.


For example, the Linux kernel is normally compiled with SMP support, so it uses lock cmpxchg for atomic CAS. But when booted on a single-processor system, it will patch the lock prefix to a nop everywhere that code was inlined, since nop cmpxchg runs much faster than lock cmpxchg. For more info, see this LWN article about Linux's "SMP alternatives" system. It can even patch back to lock prefixes before hot-plugging a second CPU.

Read more about atomicity of single instructions on uniprocessor systems in this answer, and in @supercat's answer + comments on Can num++ be atomic for int num. See my answer there for lots of details about how atomicity really works / is implemented for read-modify-write instructions like lock cmpxchg.


(This same reasoning also applies to cmpxchg8b / cmpxchg16b, and xadd, which usually only used for synchonization / atomic ops, not to make single-threaded code run faster. Obviously memory-destination add [mem], reg is useful outside of the lock add [mem], reg case.)

查看更多
登录 后发表回答