Ok, I have been reading the following Qs from SO regarding x86 CPU fences (LFENCE
, SFENCE
and MFENCE
):
Does it make any sense instruction LFENCE in processors x86/x86_64?
What is the impact SFENCE and LFENCE to caches of neighboring cores?
Is the MESI protocol enough, or are memory barriers still required? (Intel CPUs)
and:
and I must be honest I am still not totally sure when a fence is required. I am trying to understand from the perspective of removing fully-blown locks and trying to use more fine-granular locking via fences, to minimise latency delays.
Firstly here are two specific questions I do not understand:
Sometimes when doing a store a CPU will write to its store buffer instead of the L1 cache. I do not however understand the terms on which a CPU will do this?
CPU2 may wish to load a value which has been written in to CPU1's store buffer. As I understand it, the problem is CPU2 cannot see the new value in CPU1's store buffer. Why can't the MESI protocol just include flushing store buffers as part of its protocol??
More generally, could somebody please attempt to describe the overall scenario and help explain when LFENCE
/MFENCE
and SFENCE
instructions are required?
NB One of the problems reading around this subject is the number of articles written "generally" for multiple CPU architectures, when I am only interested in the Intel x86-64 architecture specifically.
The simplest answer: you must use one of 3 fences (
LFENCE
,SFENCE
,MFENCE
) to provide one of 6 data Consistency:C++11:
Initially, you should consider this problem from the point of view of the degree of order of memory access, which is well documented and standardized in C++11. You should read first: http://en.cppreference.com/w/cpp/atomic/memory_order
x86/x86_64:
1. Acquire-Release Consistency: Then, it is important to understand that in the x86 to access to conventional RAM (marked by default as WB - Write Back, and the same effect with WT (Write Throught) or UC (Uncacheable)) by using asm
MOV
without any additional commands automatically provides order of memory for Acquire-Release Consistency -std::memory_order_acq_rel
. I.e. for this memory makes sense to use onlystd::memory_order_seq_cst
only for provide Sequential Consistency. Ie when you are using:std::memory_order_relaxed
orstd::memory_order_acq_rel
then the compiled assembler code forstd::atomic::store()
(orstd::atomic::load()
) will be the same - onlyMOV
without anyL/S/MFENCE
.Note: But you must know, that not only CPU but and C++-compiler can reorder operations with memory, and all 6 memory barriers always affect on the C++-compiler regardless of CPU architecture.
Then, you must know, how can it be compiled from C++ to ASM (native machine code) or how can you write it on assembler. To provide any Consistency exclude Sequential you can simple write
MOV
, for exampleMOV reg, [addr]
andMOV [addr], reg
etc.2. Sequential Consistency: But to provide Sequential Consistency you must use implicit (
LOCK
) or explicit fences (L/S/MFENCE
) as described here: Why GCC does not use LOAD(without fence) and STORE+SFENCE for Sequential Consistency?LOAD
(without fence) andSTORE
+MFENCE
LOAD
(without fence) andLOCK XCHG
MFENCE
+LOAD
andSTORE
(without fence)LOCK XADD
( 0 ) andSTORE
(without fence)For example, GCC uses 1, but MSVC uses 2. (But you must know, that MSVS2012 has a bug: Does the semantics of `std::memory_order_acquire` requires processor instructions on x86/x86_64? )
Then, you can read Herb Sutter, your link: https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c
Exception to the rule:
This rule is true for access by using
MOV
to conventional RAM marked by default as WB - Write Back. Memory is marking in the Page Table, in each PTE (Page Table Enrty) for each Page (4 KB continuous memory).But there are some exceptions:
If we marks memory in Page Table as Write Combined (
ioremap_wc()
in POSIX), then automaticaly provides only Acquire Consistency, and we must act as in the following paragraph.See answer to my question: https://stackoverflow.com/a/27302931/1558037
In both cases 1 & 2 you must use additional
SFENCE
between two writes to the same address even if you want Acquire-Release Consistency, because here automaticaly provides only Acquire Consistency and you must do Release (SFENCE
) yourself.Answer to your two questions:
From the point of view of the user the cache L1 and Store Buffer act differently. L1 fast, but Store-Buffer faster.
Store-Buffer - is a simple Queue where stores only Writes, and which can not be reordered - it is made for performance increase and Hide Latency of access to cache (L1 - 1ns, L2 - 3ns, L3 - 10ns) (CPU-Core think that Write has stored to the cache and executes next command, but at the same time your Writes only saved to the Store-Buffer and will be saved to the cache L1/2/3 later), i.e. CPU-Core don't need to wait when Writes will have been stored to cache.
Cache L1/2/3 - look like transparent associate array (address - value). It is fast but not the fastest, because x86 automatically provides Acquire-Release Consistency by using cache coherent protocol MESIF/MOESI. It is done for simpler multithread programming, but decrease performance. (Truly, we can use Write Contentions Free algorithms and data structures without using cache coherent, i.e. without MESIF/MOESI for example over PCI Express). Protocols MESIF/MOESI works over QPI which connects Cores in CPU and Cores between different CPUs in multiprocessor systems (ccNUMA).
Yes.
MESI protocol can't just include flushing store buffers as part of its protocol, because:
But manualy flushing Store Buffer on current CPU-Core - yes, you can do it by execute
SFENCE
command. You can useSFENCE
in two cases:Note:
Do we need
LFENCE
in any cases on x86/x86_64? - the question is not always clear: Does it make any sense instruction LFENCE in processors x86/x86_64?Other platform:
Then, you can read as in theory (for a spherical processor in vacuo) with Store-Buffer and Invalidate-Queue, your link: http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf
And how you can provide Sequential Consistency on other platforms, not only with L/S/MFENCE and LOCK but and with LL/SC: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html