Im trying to see how the fence is applied.
I have this code (which Blocks indefinitely):
static void Main()
{
bool complete = false;
var t = new Thread(() => {
bool toggle = false;
while(!complete) toggle = !toggle;
});
t.Start();
Thread.Sleep(1000);
complete = true;
t.Join(); // Blocks indefinitely
}
Writing volatile bool _complete;
solve the issue .
Acquire fence :
An acquire-fence prevents other reads/writes from being moved before the fence;
But if I illustrate it using an arrow ↓
( Think of the arrowhead as pushing everything away.)
so now - the code can look like :
var t = new Thread(() => {
bool toggle = false;
while( !complete )
↓↓↓↓↓↓↓ // instructions can't go up before this fence.
{
toggle = !toggle;
}
});
I don't understand how the illustrated drawing represent a solution for solving this issue.
I do know that while(!complete)
now reads the real value. but how is it related to complete = true;
location to the fence ?
Like most of my answers pertaining to memory barriers I will use an arrow notation where ↓ represents an acquire-fence (volatile read) and ↑ represents a release-fence (volatile write). Remember, no other read or write can move past an arrow head (though they can move past the tail).
Let us first analyze the writing thread. I will assume that
complete
is declared asvolatile
1.Thread.Start
,Thread.Sleep
, andThread.Join
will generate full fences and that is why I have up and down arrows on either side of each of those calls.One important thing to notice here is that it is the
Thread.Join
call that is preventing the write tocomplete
from floating any further down. The effect here is that the write gets committed to main memory immediately. It is not the volatility ofcomplete
itself that is causing it to get flushed to main memory. It is theThread.Join
call and the memory barrier it generates that is causing that behavior.Now we will analyze the reading thread. This is a bit trickier to visualize because of the while loop though, but let us start with this.
Maybe we can visualize it better if we unwind the loop. For brevity I will only show the first 4 iterations.
Now that we have the loop unwound I think you can see how that any potential movement of the read of
complete
is going to be severely limited.2 Yes, it can get shuffled around a little bit by the compiler or hardware, but it is pretty much locked into being read on every iteration. Remember, the read ofcomplete
is still free to move, but the fence that it created does not move with it. That fence is locked into place. This is what causes the behavior often called a "fresh read". Ifvolatile
were omitted oncomplete
then the compiler would be free to use an optimization technique called "lifting". That is where a read of a memory address can get extracted or lifted outside the loop. In the absence ofvolatile
that optimization would be legal because all of the reads ofcomplete
would be allowed to float up (or lifted) until they are all ultimately outside of the loop. At that point the compiler would then coalesce them all into a one-time read just before starting the loop.3Let me summarize a few important points right now.
Thread.Join
that is causing the write tocomplete
to get committed to main memory so that the worker thread will eventually pick it up. The volatility ofcomplete
is irrelevant on the writing thread (which is probably surprising to most).complete
that is preventing that read from getting lifted outside of the loop which in turn creates the "fresh read" behavior. The volatility ofcomplete
on the reading thread makes a huge difference (which is probably obvious to most).1Marking
complete
asvolatile
on the writing thread is not necessary because x86 writes already have volatile semantics, but more importantly because the fence that is created by it does not cause the "committed write" behavior anyway.2Keep in mind, that reads and writes can move through the tail of arrow, but the arrow is locked in place. That is why you cannot bubble up all of the reads outside of the loop.
3The lifting optimization must also ensure that the actual behavior of the thread is consistent with what the programmer originally intended. That requirement is easy to satisfy in this case because the compiler can easily see that
complete
is never written to on that thread.Making
complete
volatile does two things:It prevents the C# compiler or the jitter from making optimizations that would cache the value of
complete
.It introduces a fence that tells the processor that caching optimizations of other reads and writes that involve either pre-fetching reads or delaying writes need to be de-optimized to ensure consistency.
Let's consider the first. The jitter is perfectly within its rights to see that the body of the loop:
does not modify
complete
and therefore whatever valuecomplete
has at the beginning of the loop is the value that it is going to have forever. So the jitter is allowed to generate code as though you'd writtenor, more likely:
Making
complete
volatile prevents both optimizations.But what you are looking for is the second effect of volatile. Suppose your two threads are running on different processors. Each has its own processor cache, which is a copy of main memory. Let's suppose that both processors have made a copy of main memory in which
complete
is false. When one processor's cache setscomplete
to true, ifcomplete
is not volatile then the "toggling" processor is not required to notice that fact; it has its own cache in whichcomplete
is still false and it would be expensive to go back to main memory every time.Marking
complete
as volatile eliminates this optimization. How it eliminates it is an implementation detail of the processor. Perhaps on every volatile write the write gets written to main memory and every other processor discards their cache. Or perhaps there is some other strategy. How the processors choose to make it happen is up to the manufacturer.The point is that any time you make a field volatile and then read or write it, you are massively disrupting the ability of the compiler, the jitter and the processor to optimize your code. Try to not use volatile fields in the first place; use higher-level constructs, and don't share memory between threads.
Thinking about instructions is probably counterproductive. Rather than thinking about a bunch of instructions just concentrate on the sequence of reads and writes. Everything else is irrelevant.
Suppose you have a block of memory, and part of it is copied to two caches. For performance reasons, you read and write mostly to the caches. Every now and then you re-synchronize the caches with main memory. What effect does this have on a sequence of reads and writes?
Suppose we want this to happen to a single integer variable:
Suppose what really happens is this:
How is what really happened in any way different from this?
It isn't different. Caching turns "write read write read" into "write read read write". It moves one of the reads backwards in time, and, in this case equivalently, moves one of the writes forwards in time.
This example just involves two reads and two writes to one location, but you can imagine a scenario where there are many reads and many writes to many locations. The processor has wide lattitude to move reads backwards in time and move writes forwards in time. The precise rules for what moves are legal and which are not differ from processor to processor.
A fence is a barrier that prevents reads from moving backwards or writes from moving forwards past it. So if we had:
No matter what caching strategy a processor uses, it is now not allowed to move read 4 to any point before the fence. Similarly it is not allowed to move write 3 ahead in time to any point after the fence. How a processor implements a fence is up to it.