difference between Memory Barriers and lock prefix

In this article Memory Barriers and JVM Concurrency!, i was told volatile is implemented by different memory barriers instructions,while synchronized and atomic are implemented by lock prefixed instruction. But i get bellow code in some other article:

java code:

volatile Singleton instance = new Singleton();

assembly instruction(x86):

0x01a3de1d: movb $0x0,0x1104800(%esi);

0x01a3de24: lock addl $0x0,(%esp);

So which one is right?And what is the difference between Memory Barriers and lock prefixed instruction without considering my poor english?

Short answer

Lock instructions are used to perform complex memory instructions atomically.
Memory barriers are used to order (partially or totally) memory accesses.

Java volatile

Java volatile keyword guarantees that changes to volatile variables are seen by all threads as they are written in the program. The whole and only point of volatile is that accesses to volatile variables are total ordered, so if you access variable X and then variable Y, both volatile, the access to X is seen before the one to Y by all processors!

This requires ordering the memory accesses so requires memory barriers.
A memory barrier on IA32e may be implemented with the "fences" instructions (mfence, lfence, sfence) or with a lock instruction. But this latter option is just a side effect of lock and not its primary use.
Locked instructions are serialized and so have a total order. This is inefficient to order memory accesses only but works and it is used in older processors that lack the "fences".

So the lock you see is actually a barrier (Linux kernel used the same instruction too).

Complete answer

By "Complex memory instructions" above I mean Read-Modify-Write instructions (in Intel naming), these are instructions that internally consists of three operation: take the value from memory, change it and store it back.

If during the instruction the bus is not held locked another processor can change the value after it has been read from memory but before it is stored back.

Example x = 0

 CPU 1              CPU 2

 loop:              loop:    
    inc [X]            inc [x]
    j loop             j loop

If each CPU executes its own loop 10 times, what value will be stored in x?
You cannot tell in a determinist way. The pseudo instruction inc [X] must be implemented with three micro-operations as

  CPU 1              CPU 2

 loop:              loop:    
    mov r, [X]         mov r, [X]
    inc r              inc r
    mov [x], r         mov [x], r
    j loop             j loop

This situation may have happened:

CPU1: mov r, [X]    X is 0, CPU1 r is 0, CPU2 r is 0
CPU1: inc r         X is 0, CPU1 r is 1, CPU2 r is 0
CPU2: mov r, [X]    X is 0, CPU1 r is 1, CPU2 r is 0
CPU1: mov [X], r    X is 1, CPU1 r is 1, CPU2 r is 0
CPU1: mov r, [X]    X is 1, CPU1 r is 1, CPU2 r is 0
CPU1: inc r         X is 1, CPU1 r is 2, CPU2 r is 0
CPU1: mov [X], r    X is 2, CPU1 r is 2, CPU2 r is 0 
CPU2: inc r         X is 2, CPU1 r is 2, CPU2 r is 1 
CPU2: mov [X], r    X is 1, CPU1 r is 2, CPU2 r is 1

Note how X is 1 instead of 3.
By locking the inc instruction the CPU assert a lock on the system bus as inc starts and until it retires. This forces a pattern like this (example)

CPU1: mov r, [X]    X is 0, CPU1 r is 0, CPU2 r is 0, CPU2 cannot use bus
CPU1: inc r         X is 0, CPU1 r is 1, CPU2 r is 0, CPU2 cannot use bus
CPU1: mov [X], r    X is 1, CPU1 r is 1, CPU2 r is 0, CPU2 cannot use bus

CPU1: mov r, [X]    X is 1, CPU1 r is 1, CPU2 r is 0, CPU2 cannot use bus
CPU1: inc r         X is 1, CPU1 r is 2, CPU2 r is 0, CPU2 cannot use bus
CPU1: mov [X], r    X is 2, CPU1 r is 2, CPU2 r is 0, CPU2 cannot use bus

CPU2: mov r, [X]    X is 2, CPU1 r is 1, CPU2 r is 2, CPU1 cannot use bus
CPU2: inc r         X is 2, CPU1 r is 2, CPU2 r is 3, CPU1 cannot use bus
CPU2: mov [X], r    X is 3, CPU1 r is 2, CPU2 r is 3, CPU1 cannot use bus

Memory barriers instead are used to order memory accesses.
Processor execute instruction out of order, this means that even if you send a CPU the instructions A B C it can execute C A B.

However processors are required to respect dependencies and instructions are executed out of order only when this won't change the program behavior.
A very important aspect to remember is the distinction between instruction execution and instruction retirement because a processor keeps its architectural state (the state a program can see) consistent only for retired instructions. Normally programs see a result of an instruction only when it is retired even if it has been executed already! But with memory access the matter is a slightly different as they have the globally visible side effect of modifying the main memory and that cannot be undone!

So as seen from a program on a CPU all memory access of that CPU happens in program order, however the processor makes no efforts to guarantee that other processors see the memory accesses in the same order! They see the execution order or worst the propagation order due cache hierarchy and memory topology! The order of memory accesses is different as observed by different processors.

So the CPU allow the programmer to control how memory accesses are ordered with barriers, a barrier stop other memory instructions (on the same CPU) from being executed until all the previous one are executed/retired/propagated (this depending on the architecture an barrier type).

Example

  x = 0, y = 0

  CPU 1            CPU2
  mov [x], 1       loop:
  mov [y], 1         mov r, [y]
                     jrz loop   ;Jump if r is 0
                   mov s, [x]

There is no need of locks. However without barriers it is possible that CPU2 s is 0 after the program.
This is due the fact that the mov [y], 1 write of CPU1 can be reordered and executed before the write to x!
From CPU 1 perspective nothing changed, but for CPU 2 the order has changed!

With barriers

  x = 0, y = 0

  CPU 1            CPU2
  mov [x], 1       loop:
  sync               mov r, [y]
  mov [y], 1         jrz loop   ;Jump if r is 0
                   mov s, [x]

Using sync as a memory barrier pseudo instruction. Now the write to y cannot be reordered and must wait for the write to x to be visible to CPU2.

Things are a little bit more elaborated than this simple picture of mine, different processors have different kind of barriers and memory ordering. Different architectures have different cache/memory topology which require special handling. Abstracting this is not easy, Java has a simple memory model which makes generated code more complex, C++11 have a more elaborate memory model that lets you explore better the effects of memory barriers.

Before reading abstract notation like happens-before it is useful to search on Google for memory ordering problem for common architecture (IA32e, IA64, ARM, SPARC, Power, Alpha) so that you can see what the real problem is and how can be solved.

And the IA32e architecture is a bad architecture to test on as its relaxed memory order is indeed quite strong and most of the problems cannot happen on this architecture. If you have a multiprocessor phone you can test on ARM. If you like an extreme example take the Alpha architecture where even dependent accesses are reordered!