可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am trying to understand what is a memory barrier exactly. Based on what I know so far, a memory barrier (for example: mfence) is used to prevent the re-ordering of instructions from before to after and from after to before the memory barrier.

This is an example of a memory barrier in use:

instruction 1
instruction 2
instruction 3
mfence
instruction 4
instruction 5
instruction 6

Now my question is: Is the mfence instruction just a marker telling the CPU in what order to execute the instructions? Or is it an instruction that the CPU actually executes like it executes other instructions (for example: mov).

回答1:

Every byte sequence that the CPU encounters amongst its code is an instruction that the CPU executes. There are no other kinds of instructions.

You can see this clearly in both the Intel instruction set reference and the specific page for mfence.

MFENCE
Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction.

The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction). MFENCE does not serialize the instruction stream. Weakly ordered memory types can be used to achieve higher processor performance through such techniques as out-of-order issue, speculative reads, write-combining, and write-collapsing. The degree to which a consumer of data recognizes or knows that the data is weakly ordered varies among applications and may be unknown to the producer of this data. The MFENCE instruction provides a performance-efficient way of ensuring load and store ordering between routines that produce weakly-order ed results and routines that consume that data.

Processors are free to fetch and cache data speculatively from regions of system memory that use the WB, WC, and WT memory types. This speculative fetching can occur at any time and is not tied to instruction execution. Thus, it is not ordered with respect to executions of the MFENCE instruction; data can be brought into the caches speculatively just before, during, or after the execution of an MFENCE instruction.

As you can see from the excerpt the MFence instruction does quite a bit of work, rather than just being a marker of some sort.

回答2:

I'll explain the impact that mfence has on the flow of the pipeline. Consider the Skylake pipeline for example. Consider the following sequence of instructions:

inst1
store1
inst2
load1
inst3
mfence
inst4
store2
load2
inst5

The instructions gets decoded into a sequence of uops in the same program order. Then all uops are passed in order to the scheduler. Normally, without fences, all uops get issued for execution out-of-order. However, when the scheduler receives the mfence uop, it needs to make sure that no memory uops downstream the mfence get executed until all upstream memory uops become globally visible (which means that the stores have retired and the loads have at least completed). This applies to all memory accesses irrespective of the memory type of the region being accessed. This can be achieved by either having the scheduler not to issue any downstream store or load uops to the store or load buffers, respectively, until the buffers get drained or by issuing downstream store or load uops and marking them so that they can be distinguished from all existing memory uops in the buffers. All non-memory uops above or below the fence can still be executed out-of-order. In the example, once store1 retires and load1 completes (by receiving the data and holding it in some internal register), the mfence instruction is considered to have completed execution. I think that mfence may or may not occupy any resources in the backend (ROB or RS) and it may get translated to more than one uop.

Intel has a patent submitted in 1999 that describes how mfence works. Since this is a very old patent, the implementation might have changed or it might be different in different processors. I'll summarize the patent here. mfence gets decoded into three uops. Unfortunately, it's not clear exactly what these uops are used for. Entries are then allocated from the reservation station is allocated to hold the uops and also allocated from the load and store buffers. This means that a load buffer can hold entries for either true load requests or for fences (which are basically bogus load requests). Similarly, the store buffer can hold entries for true store requests and for fences. The mfence uop is not dispatched until all earlier load or store uops (in the respective buffers) have been retired. When that happens, the mfence uop itself get sent to the L1 cache controller as a memory request. The controller checks whether all previous requests have completed. In that case, it will simply be treated as a NOP and the uop will get deallcoated from the buffers. Otherwise, the cache controller rejects the mfence uop.

回答3:

mfence is an instruction.

To get it on Linux:

1/ Write a file mfence.c

#include <stdio.h>

int main(){
    printf("Disass me\n");
    asm volatile ("mfence" ::: "memory");
    return 0;
}

2/ Compile

gcc mfence.c mfence

3/ Disassemble

objdump -d mfence | grep -A 10 "<main>:"

000000000000063a <main>:
 63a:   55                      push   %rbp
 63b:   48 89 e5                mov    %rsp,%rbp
 63e:   48 8d 3d 9f 00 00 00    lea    0x9f(%rip),%rdi        # 6e4 <_IO_stdin_used+0x4>
 645:   e8 c6 fe ff ff          callq  510 <puts@plt>
 64a:   0f ae f0                mfence 
 64d:   b8 00 00 00 00          mov    $0x0,%eax
 652:   5d                      pop    %rbp
 653:   c3                      retq   
 654:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
 65b:   00 00 00

4/ Observe that at line 64a mfence is the (3 bits) instruction (0f ae f0)

So that is a cpu instruction (like mov): The processor needs to decode previous instructions before getting to it otherwise it couldn't guess it's alignement.

For example 0f ae f0 could appear in an address so the cpu cannot use it as a maker.

Finally, it is just an old school instruction, and at its execution point in the pipeline, it will synchronize the memory access futher in the pipeline before executing the next instruction.

Note: on Windows use the macro _ReadWriteBarrier in to produce a mfence

回答4:

Your question has the wrong assumptions. The MFENCE does not prevent the reordering of instructions (see highlighted quote). For example if there is a stream of 1000 instructions that only operate on registers and a MFENCE instruction is placed in the middle then it will have no effect on how the CPU reorders those instructions.

The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction). MFENCE does not serialize the instruction stream.

Instead, the MFENCE instruction prevents the reordering of loads and stores to the cache and main memory.