Do locked instructions provide a barrier between w

2019-02-22 19:50发布

问题:

On x86, lock-prefixed instructions such as lock cmpxchg provide barrier semantics in addition to their atomic operation: for normal memory access on write-back memory regions, reads and writes are not re-ordered across lock-prefixed instructions, per section 8.2.2 of Volume 3 of the Intel SDM:

Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.

This section applies only to write-back memory types. In the same list, you find an exception where it notes that weakly ordered stores are not ordered:

  • Reads are not reordered with other reads.
  • Writes are not reordered with older reads.
  • Writes to memory are not reordered with other writes, with the following exceptions: —

    streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and —

    string operations (see Section 8.2.4.1).

Note that there is no exception made for non-temporal instructions in any other items in the list, e.g., in the item referring to lock-prefixed instructions.

In various other sections of the guide, it is mentioned that the mfence and/or sfence instructions can be used to order memory when weakly ordered (non-temporal) instructions are used. These sections generally don't mention lock-prefixed instruction as an alternative.

All that leaves me uncertain: do lock-prefixed instructions provide the same full barrier that mfence provides between weakly ordered (non-temporal) instructions on WB memory? The same question applies again but to any type of access on WC memory.

回答1:

Lock-prefixed instructions are strictly more powerful than Intel's mfence. AMD64's mfence is a fully serializing instructions, so it's strictly stronger than Lock-prefixed instructions. There are also 32-bit x86 AMD processors that support mfence and I would expect that it behaves the same way as on AMD64. The rest of this answer applies only to Intel's mfence.

An mfence that immediately precedes or follows a lock-prefixed instruction is redundant. From Section 8.2.5:

The I/O instructions, locking instructions, the LOCK prefix, and serializing instructions force stronger ordering on the processor.

(Locking instructions here refer to those with implicit lock. Elsewhere in the manual, they refer also to instructions prefixed with lock explicitly.)

"stronger" here means stronger than the default ordering discussed in Section 8.2.2 (quoted in the question). Also from 8.2.5:

Like the I/O and locking instructions, the processor waits until all previous instructions have been completed and all buffered writes have been drained to memory before executing the serializing instruction.

Section 8.3 discusses serializing instructions, which does not mention the lock prefix at all. But it says this:

The following instructions are memory-ordering instructions, not serializing instructions. These drain the data memory subsystem. They do not serialize the instruction execution stream:

• Non-privileged memory-ordering instructions — SFENCE, LFENCE, and MFENCE.

It's critical to note that the lock prefix does not make an instruction serializing like the ones listed in Section 8.3. The main difference is that the lock prefix allows the following instructions to be fetched. In addition, a locked-prefixed instruction is not ordered with respect to software prefetch instructions. From the Intel manual V2:

A PREFETCHh instruction is considered a hint to this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, a PREFETCHh instruction is not ordered with respect to the fence instructions (MFENCE, SFENCE, and LFENCE) or locked memory references. A PREFETCHh instruction is also unordered with respect to CLFLUSH and CLFLUSHOPT instructions, other PREFETCHh instructions, or any other general instruction. It is ordered with respect to serializing instructions such as CPUID, WRMSR, OUT, and MOV CR.

The same applies to all software prefetch instructions, not just PREFETCHh.

What makes the lock prefix more powerful than mfence is that it not only provides serialization with respect to all instructions (except software prefetch), but also locks access to shared memory so that all other logical processors cannot access the memory until it retires, thereby providing atomicity.

Now I've seen the quote from Necrolis's answer which says that the lock prefix may not serialize load operations that reference weakly ordered memory types. But I think that this statement is very old and written for very old processors at a time where Intel did not want to expose the full guarantees of the lock prefix. Also, the quote only says "may not", which is not really contradictory.

This can also be confirmed from the AMD manual V2 7.4.2:

All previous loads and stores complete to memory or I/O space before a memory access for an I/O, locked or serializing instruction is issued.

All loads and stores associated with the I/O and locked instructions complete to memory (no buffered stores) before a load or store from a subsequent instruction is issued.


@PeterCordes's experiments show that, on Skylake, locking instructions don't seem to block ALU instructions from being executed out-of-order while mfence does serialize ALU instructions (potentially behaving identically to lfence + a store-buffer flush like a locked instruction).



回答2:

Bus locks (via the LOCK opcode prefix) produce a full fence*, however, on WC memory they don't provide the load fence, this is documented in the Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, 8.1.2:

For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.

*See Intel's 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, 8.2.3.9 for an example