I am trying to understand what is a memory barrier exactly.
Based on what I know so far, a memory barrier (for example: mfence
) is used to prevent the re-ordering of instructions from before to after and from after to before the memory barrier.
This is an example of a memory barrier in use:
instruction 1
instruction 2
instruction 3
mfence
instruction 4
instruction 5
instruction 6
Now my question is: Is the mfence
instruction just a marker telling the CPU in what order to execute the instructions? Or is it an instruction that the CPU actually executes like it executes other instructions (for example: mov
).
Every byte sequence that the CPU encounters amongst its code is an instruction that the CPU executes. There are no other kinds of instructions.
You can see this clearly in both the Intel instruction set reference and the specific page for mfence.
MFENCE
Performs a serializing operation on all load-from-memory
and store-to-memory instructions that were issued prior
the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes
the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows
the MFENCE instruction.
The MFENCE instruction is ordered with respect to all load and store instructions, other
MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID
instruction). MFENCE does not serialize the instruction stream.
Weakly ordered memory types can be used to achieve higher processor performance through such techniques as
out-of-order issue, speculative reads, write-combining,
and write-collapsing. The degree to which a consumer of
data recognizes or knows that the data is weakly ordered varies among applications and may be unknown to the
producer of this data. The MFENCE instruction provides
a performance-efficient way of ensuring load and store
ordering between routines that produce weakly-order
ed results and routines that consume that data.
Processors are free to fetch and cache data speculatively
from regions of system memory that use the WB, WC, and
WT memory types. This speculative fetching can occur at any time and is not tied to instruction execution. Thus, it
is not ordered with respect to executions of the MFENCE
instruction; data can be brought into the caches speculatively just before, during, or after the execution of an MFENCE instruction.
As you can see from the excerpt the MFence
instruction does quite a bit of work, rather than just being a marker of some sort.
I'll explain the impact that mfence
has on the flow of the pipeline. Consider the Skylake pipeline for example. Consider the following sequence of instructions:
inst1
store1
inst2
load1
inst3
mfence
inst4
store2
load2
inst5
The instructions gets decoded into a sequence of uops in the same program order. Then all uops are passed in order to the scheduler. Normally, without fences, all uops get issued for execution out-of-order. However, when the scheduler receives the mfence
uop, it needs to make sure that no memory uops downstream the mfence
get executed until all upstream memory uops become globally visible (which means that the stores have retired and the loads have at least completed). This applies to all memory accesses irrespective of the memory type of the region being accessed. This can be achieved by either having the scheduler not to issue any downstream store or load uops to the store or load buffers, respectively, until the buffers get drained or by issuing downstream store or load uops and marking them so that they can be distinguished from all existing memory uops in the buffers. All non-memory uops above or below the fence can still be executed out-of-order. In the example, once store1
retires and load1
completes (by receiving the data and holding it in some internal register), the mfence
instruction is considered to have completed execution. I think that mfence
may or may not occupy any resources in the backend (ROB or RS) and it may get translated to more than one uop.
Intel has a patent submitted in 1999 that describes how mfence
works. Since this is a very old patent, the implementation might have changed or it might be different in different processors. I'll summarize the patent here. mfence
gets decoded into three uops. Unfortunately, it's not clear exactly what these uops are used for. Entries are then allocated from the reservation station is allocated to hold the uops and also allocated from the load and store buffers. This means that a load buffer can hold entries for either true load requests or for fences (which are basically bogus load requests). Similarly, the store buffer can hold entries for true store requests and for fences. The mfence
uop is not dispatched until all earlier load or store uops (in the respective buffers) have been retired. When that happens, the mfence
uop itself get sent to the L1 cache controller as a memory request. The controller checks whether all previous requests have completed. In that case, it will simply be treated as a NOP and the uop will get deallcoated from the buffers. Otherwise, the cache controller rejects the mfence
uop.
mfence is an instruction.
To get it on Linux:
1/ Write a file mfence.c
#include <stdio.h>
int main(){
printf("Disass me\n");
asm volatile ("mfence" ::: "memory");
return 0;
}
2/ Compile
gcc mfence.c mfence
3/ Disassemble
objdump -d mfence | grep -A 10 "<main>:"
000000000000063a <main>:
63a: 55 push %rbp
63b: 48 89 e5 mov %rsp,%rbp
63e: 48 8d 3d 9f 00 00 00 lea 0x9f(%rip),%rdi # 6e4 <_IO_stdin_used+0x4>
645: e8 c6 fe ff ff callq 510 <puts@plt>
64a: 0f ae f0 mfence
64d: b8 00 00 00 00 mov $0x0,%eax
652: 5d pop %rbp
653: c3 retq
654: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
65b: 00 00 00
4/ Observe that at line 64a mfence
is the (3 bits) instruction (0f ae f0)
So that is a cpu instruction (like mov
): The processor needs to decode previous instructions before getting to it otherwise it couldn't guess it's alignement.
For example 0f ae f0
could appear in an address so the cpu cannot use it as a maker.
Finally, it is just an old school instruction, and at its execution point in the pipeline, it will synchronize the memory access futher in the pipeline before executing the next instruction.
Note: on Windows use the macro _ReadWriteBarrier
in to produce a mfence
Your question has the wrong assumptions. The MFENCE does not prevent the reordering of instructions (see highlighted quote). For example if there is a stream of 1000 instructions that only operate on registers and a MFENCE instruction is placed in the middle then it will have no effect on how the CPU reorders those instructions.
The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction). MFENCE does not serialize the instruction stream.
Instead, the MFENCE instruction prevents the reordering of loads and stores to the cache and main memory.