The C++11 standard defines a memory model (1.7, 1.10) which contains memory orderings, which are, roughly, "sequentially-consistent", "acquire", "consume", "release", and "relaxed". Equally roughly, a program is correct only if it is race-free, which happens if all actions can be put in some order in which one action happens-before another one. The way that an action X happens-before an action Y is that either X is sequenced before Y (within one thread), or X inter-thread-happens-before Y. The latter condition is given, among others, when
- X synchronizes with Y, or
- X is dependency-ordered before Y.
Synchronizing-with happens when X is an atomic store with "release" ordering on some atomic variable, and Y is an atomic load with "acquire" ordering on the same variable. Being dependency-ordered-before happens for the analogous situation where Y is load with "consume" ordering (and a suitable memory access). The notion of synchronizes-with extends the happens-before relationship transitively across actions being sequenced-before one another within a thread, but being dependency-ordered-before is extended transitively only through a strict subset of sequenced-before called carries-dependency, which follows a largish set of rules, and notably can be interrupted with std::kill_dependency
.
Now then, what is the purpose of the notion of "dependency ordering"? What advantage does it provide over the simpler sequenced-before / synchronizes-with ordering? Since the rules for it are stricter, I assume that can be implemented more efficiently.
Can you give an example of a program where switching from release/acquire to release/consume is both correct and provides a non-trivial advantage? And when would std::kill_dependency
provide an improvement? High-level arguments would be nice, but bonus points for hardware-specific differences.
I'd like to record a partial finding, even though it's not a real answer and doesn't mean that there won't be a big bounty for a proper answer.
After staring at 1.10 for a while, and in particular the very helpful note in paragraph 11, I think this isn't actually so hard. The big difference between synchronizes-with (henceforth: s/w) and dependency-ordered-before (dob) is that a happens-before relationship can be established by concatenating sequenced-before (s/b) and s/w arbitrarily, but not so for dob. Note one of the definitions for inter-thread happens before:
But the analogous statement for
is missing!A
is dependency-ordered beforeX
So with release/acquire (i.e. s/w) we can order arbitrary events:
But now consider an arbitrary sequence of events like this:
In this sequenece, it is still true that
A2
happens-beforeC2
(becauseA2
is s/bB2
andB2
inter-thread happens beforeC2
on account of dob; but we could argue that you can never actually tell!). However, it is not true thatA2
happens-beforeD2
. The eventsA2
andD2
are not ordered with respect to one another, unless it actually holds thatC2
carries dependency toD2
. This is a stricter requirement, and absent that requirement,A2
-to-D2
cannot be ordered "across" the release/consume pair.In other words, a release/consume pair only propagates an ordering of actions which carry a dependency from one to the next. Everything that's not dependent is not ordered across the release/consume pair.
Furthermore, note that the ordering is restored if we append a final, stronger release/acquire pair:
Now, by the quoted rule,
D2
inter-thread happens beforeF2
, and therefore so doC2
andB2
, and soA2
happens-beforeF2
. But note that there is still no ordering betweenA2
andD2
— the ordering is only betweenA2
and later events.In summary and in closing, dependency carrying is a strict subset of general sequencing, and release/consume pairs provide an ordering only among actions that carry dependency. As long as no stronger ordering is required (e.g. by passing through a release/acquire pair), there is theoretically a potential for additional optimization, since everything that is not in the dependency chain may be reordered freely.
Maybe here is an example that makes sense?
As written, the code is race-free and the assertion will hold, because the release/acquire pair orderes the store
x = 51
before the load in the assertion. However, by changing "acquire" into "consume", this would no longer be true and the program would have a data race onx
, sincex = 51
carries no dependency into the store tofoo
. The optimization point is that this store can be reordered freely without concern to whatfoo
is doing, because there is no dependency.Jeff Preshing has a great blog post answering this question. I can't add anything myself, but think anyone wondering about consume vs. acquire should read his post:
http://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/
He shows a specific C++ example with corresponding benchmarked assembly code across three different architectures. Compared to
memory_order_acquire
,memory_order_consume
potentially offers a 3x speedup on PowerPC, 1.6x speedup on ARM, and negligible speedup on x86 which has strong consistency anyway. The catch is that as of when he wrote it, only GCC actually treated consume semantics any differently from acquire, and probably because of a bug. Nonetheless, it demonstrates that a speedup is available if the compiler writers can figure out how to take advantage of it.Load-consume is much like load-acquire, except that it induces happens-before relationships only to expression evaluations that are data-dependent on the load-consume. Wrapping an expression with
kill_dependency
results in a value that no longer carries a dependency from the load-consume.The key use case is for the writer to construct a data structure sequentially, then swing a shared pointer to the new structure (using a
release
oracq_rel
atomic). The reader uses load-consume to read the pointer, and dereferences it to get to the data structure. The dereference creates a data dependency, so the reader is guaranteed to see the initialized data.There are two reasons for providing load-consume. The primary reason is that ARM and Power loads are guaranteed to consume, but require additional fencing to turn them into acquires. (On x86, all loads are acquires, so consume provides no direct performance advantage under naive compilation.) The secondary reason is that the compiler can move later operations without data dependence up to before the consume, which it can't do for an acquire. (Enabling such optimizations is the big reason for building all of this memory ordering into the language.)
Wrapping a value with
kill_dependency
allows computation of an expression that depends on the value to be moved to before the load-consume. This is useful e.g. when the value is an index into an array that was previously read.Note that the use of consume results in a happens-before relation that is no longer transitive (though it is still guaranteed to be acyclic). For example, the store to
bar
happens before the store to foo, which happens before the dereference ofy
, which happens before the read ofbar
(in the commented-out assert), but the store tobar
doesn't happen before the read ofbar
. This leads to a rather more complicated definition of happens-before, but you can imagine how it works (start with sequenced-before, then propagate through any number of release-consume-dataDependency or release-acquire-sequencedBefore links)Data dependency ordering was introduced by N2492 with the following rationale:
emphasis mine
the motivating use case presented there is
rcu_dereference()
from the Linux kernel