In this article Memory Barriers and JVM Concurrency!, i was told volatile is implemented by different memory barriers instructions,while synchronized and atomic are implemented by lock prefixed instruction. But i get bellow code in some other article:
java code:
volatile Singleton instance = new Singleton();
assembly instruction(x86):
0x01a3de1d: movb $0x0,0x1104800(%esi);
0x01a3de24: lock addl $0x0,(%esp);
So which one is right?And what is the difference between Memory Barriers and lock prefixed instruction without considering my poor english?
Short answer
Lock instructions are used to perform complex memory instructions atomically.
Memory barriers are used to order (partially or totally) memory accesses.
Java volatile
Java volatile
keyword guarantees that changes to volatile variables are seen by
all threads as they are written in the program. The whole and only point of volatile is that
accesses to volatile variables are total ordered, so if you access variable X
and then
variable Y
, both volatile, the access to X is seen before the one to Y by all
processors!
This requires ordering the memory accesses so requires memory barriers.
A memory barrier on IA32e may be implemented with the "fences" instructions (mfence, lfence,
sfence
) or with a lock
instruction. But this latter option is just a side
effect of lock
and not its primary use.
Locked instructions are serialized and so
have a total order. This is inefficient to order memory accesses only but works and it is used
in older processors that lack the "fences".
So the lock you see is actually a barrier (Linux kernel used the same instruction too).
Complete answer
By "Complex memory instructions" above I mean Read-Modify-Write instructions (in Intel naming), these
are instructions that internally consists of three operation:
take the value from memory, change it and store it back.
If during the instruction the bus is not held locked another processor can change the value
after it has been read from memory but before it is stored back.
Example
x = 0
CPU 1 CPU 2
loop: loop:
inc [X] inc [x]
j loop j loop
If each CPU executes its own loop 10 times, what value will be stored in x?
You cannot tell in a determinist way. The pseudo instruction inc [X]
must be implemented with three micro-operations
as
CPU 1 CPU 2
loop: loop:
mov r, [X] mov r, [X]
inc r inc r
mov [x], r mov [x], r
j loop j loop
This situation may have happened:
CPU1: mov r, [X] X is 0, CPU1 r is 0, CPU2 r is 0
CPU1: inc r X is 0, CPU1 r is 1, CPU2 r is 0
CPU2: mov r, [X] X is 0, CPU1 r is 1, CPU2 r is 0
CPU1: mov [X], r X is 1, CPU1 r is 1, CPU2 r is 0
CPU1: mov r, [X] X is 1, CPU1 r is 1, CPU2 r is 0
CPU1: inc r X is 1, CPU1 r is 2, CPU2 r is 0
CPU1: mov [X], r X is 2, CPU1 r is 2, CPU2 r is 0
CPU2: inc r X is 2, CPU1 r is 2, CPU2 r is 1
CPU2: mov [X], r X is 1, CPU1 r is 2, CPU2 r is 1
Note how X is 1 instead of 3.
By locking the inc
instruction the CPU assert a lock on the system bus as inc
starts and until it retires. This forces a pattern like this (example)
CPU1: mov r, [X] X is 0, CPU1 r is 0, CPU2 r is 0, CPU2 cannot use bus
CPU1: inc r X is 0, CPU1 r is 1, CPU2 r is 0, CPU2 cannot use bus
CPU1: mov [X], r X is 1, CPU1 r is 1, CPU2 r is 0, CPU2 cannot use bus
CPU1: mov r, [X] X is 1, CPU1 r is 1, CPU2 r is 0, CPU2 cannot use bus
CPU1: inc r X is 1, CPU1 r is 2, CPU2 r is 0, CPU2 cannot use bus
CPU1: mov [X], r X is 2, CPU1 r is 2, CPU2 r is 0, CPU2 cannot use bus
CPU2: mov r, [X] X is 2, CPU1 r is 1, CPU2 r is 2, CPU1 cannot use bus
CPU2: inc r X is 2, CPU1 r is 2, CPU2 r is 3, CPU1 cannot use bus
CPU2: mov [X], r X is 3, CPU1 r is 2, CPU2 r is 3, CPU1 cannot use bus
Memory barriers instead are used to order memory accesses.
Processor execute instruction out
of order, this means that even if you send a CPU the instructions A B C
it can
execute C A B
.
However processors are required to respect dependencies and instructions
are executed out of order only when this won't change the program behavior.
A very important aspect to remember is the distinction between instruction execution and instruction
retirement because a processor keeps its architectural state (the state a program can see) consistent only for retired instructions.
Normally programs see a result of an instruction only when it is retired even if it has been executed already! But with memory access the matter is a slightly different as they have the globally visible side effect of modifying the main memory and that cannot be undone!
So as seen from a program on a CPU all memory access of that CPU happens in program order, however the processor makes no
efforts to guarantee that other processors see the memory accesses in the same order! They see the execution order or worst the propagation order due cache hierarchy and memory topology!
The order of memory accesses is different as observed by different processors.
So the CPU allow the programmer to control how memory accesses are ordered with barriers, a barrier stop other memory instructions (on the same CPU) from being executed until all the previous one
are executed/retired/propagated (this depending on the architecture an barrier type).
Example
x = 0, y = 0
CPU 1 CPU2
mov [x], 1 loop:
mov [y], 1 mov r, [y]
jrz loop ;Jump if r is 0
mov s, [x]
There is no need of locks. However without barriers it is possible that CPU2 s
is 0 after the program.
This is due the fact that the mov [y], 1
write of CPU1 can be reordered and executed
before the write to x!
From CPU 1 perspective nothing changed, but for CPU 2 the order has changed!
With barriers
x = 0, y = 0
CPU 1 CPU2
mov [x], 1 loop:
sync mov r, [y]
mov [y], 1 jrz loop ;Jump if r is 0
mov s, [x]
Using sync
as a memory barrier pseudo instruction. Now the write to y
cannot be
reordered and must wait for the write to x
to be visible to CPU2.
Things are a little bit more elaborated than this simple picture of mine, different processors
have different kind of barriers and memory ordering. Different architectures have different cache/memory topology which require special handling.
Abstracting this is not easy, Java has a simple memory model which makes generated code more
complex, C++11 have a more elaborate memory model that lets you explore better the effects of memory
barriers.
Before reading abstract notation like happens-before it is useful to search on Google for
memory ordering problem for common architecture (IA32e, IA64, ARM, SPARC, Power, Alpha) so that
you can see what the real problem is and how can be solved.
And the IA32e architecture is a bad architecture to test on as its relaxed memory order is indeed quite strong and most of the problems cannot happen on this architecture. If you have a multiprocessor phone you can test on ARM. If you like an extreme example take the Alpha architecture where even dependent accesses are reordered!