I've had this question for quite a while now, trying to read lots of resources and understanding what is going on - but I've still failed to get a good understanding of why things are the way they are.
Simply put I'm trying to test how a CAS
would perform vs synchronized
in contended and not environments. I've put up this JMH
test:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@State(Scope.Benchmark)
public class SandBox {
Object lock = new Object();
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder().include(SandBox.class.getSimpleName())
.jvmArgs("-ea", "-Xms10g", "-Xmx10g")
.shouldFailOnError(true)
.build();
new Runner(opt).run();
}
@State(Scope.Thread)
public static class Holder {
private long number;
private AtomicLong atomicLong;
@Setup
public void setUp() {
number = ThreadLocalRandom.current().nextLong();
atomicLong = new AtomicLong(number);
}
}
@Fork(1)
@Benchmark
public long sync(Holder holder) {
long n = holder.number;
synchronized (lock) {
n = n * 123;
}
return n;
}
@Fork(1)
@Benchmark
public AtomicLong cas(Holder holder) {
AtomicLong al = holder.atomicLong;
al.updateAndGet(x -> x * 123);
return al;
}
private Object anotherLock = new Object();
private long anotherNumber = ThreadLocalRandom.current().nextLong();
private AtomicLong anotherAl = new AtomicLong(anotherNumber);
@Fork(1)
@Benchmark
public long syncShared() {
synchronized (anotherLock) {
anotherNumber = anotherNumber * 123;
}
return anotherNumber;
}
@Fork(1)
@Benchmark
public AtomicLong casShared() {
anotherAl.updateAndGet(x -> x * 123);
return anotherAl;
}
@Fork(value = 1, jvmArgsAppend = "-XX:-UseBiasedLocking")
@Benchmark
public long syncSharedNonBiased() {
synchronized (anotherLock) {
anotherNumber = anotherNumber * 123;
}
return anotherNumber;
}
}
And the results:
Benchmark Mode Cnt Score Error Units
spinLockVsSynchronized.SandBox.cas avgt 5 212.922 ± 18.011 ns/op
spinLockVsSynchronized.SandBox.casShared avgt 5 4106.764 ± 1233.108 ns/op
spinLockVsSynchronized.SandBox.sync avgt 5 2869.664 ± 231.482 ns/op
spinLockVsSynchronized.SandBox.syncShared avgt 5 2414.177 ± 85.022 ns/op
spinLockVsSynchronized.SandBox.syncSharedNonBiased avgt 5 2696.102 ± 279.734 ns/op
In the non-shared case CAS
is by far faster, which I would expect. But in shared case, things are the other way around - and this I can't understand. I don't think this is related to biased locking, as that would happen after a threads holds the lock for 5 seconds (AFAIK) and this does not happen and the test is just proof of that.
I honestly hope it's just my tests that are wrong, and someone having jmh
expertise would come along and just point me to the wrong set-up here.
You should read, re-read, and accept @Holger's excellent answer, as the insights it provides are far more valuable than a single set of benchmark numbers from one developer's workstation.
I tweaked your benchmarks to make them a bit more apples-to-apples, but if you read @Holger's answer, you'll see why this isn't a terribly useful test. I'm going to include my changes and my results simply to show how results can vary from one machine (or one JRE version) to another.
First, my version of benchmarks:
And then my first batch of results:
Recall that you saw
synchronized
coming out ahead under high contention. On my workstation, the atomic version fared better. If you use my version of your benchmarks, what results do you see? It won't surprise me in the least if they're substantially different.Here's another set run under a months-old Java 9 EA release:
The difference is less dramatic here. It's not terribly unusual to see a difference across major JRE versions, but who's to say you won't see them across minor releases too?
At the end of the day, the results are close. Very close. The performance of
synchronized
has come a long way since the early Java versions. If you are not writing HFT algorithms or something else that's incredibly latency sensitive, you should prefer the solution that's most easily proven correct. It is generally easier to reason aboutsynchronized
than lock-free algorithms and data structures. If you cannot demonstrate a measurable difference in your application, thensynchronized
is what you should use.The main misconception is the assumption that you are comparing “
CAS
vs.synchronized
”. Given, how sophisticated JVMs implementsynchronized
, you are comparing the performance of aCAS
-based algorithm usingAtomicLong
with the performance of theCAS
-based algorithm used to implementsynchronized
.Similar to
Lock
, the internal information for an object monitor basically consist of anint
status telling whether it has been owned and how often it is nested, a reference to the current owner thread and a queue of threads waiting to be able to acquire it. The expensive aspect is the waiting queue. Putting a thread into the queue, removing it from thread scheduling, and eventually waking it up when the current owner releases the monitor, are operations that can take a significant time.However, in the uncontended case, the waiting queue is, of course, not involved. Acquiring the monitor consist of a single
CAS
to change the status from “unowned” (usually zero) to “owned, acquired once” (guess the typical value). If successful, the thread can proceed with the critical action, followed by a release which implies just writing the “unowned” state with the necessary memory visibility and waking up another blocked thread, if there is one.Since the wait queue is the significantly more expensive thing, implementations usually try to avoid it even in the contended case by performing some amount of spinning, making several repeated
CAS
attempts before falling back to enqueuing the thread. If the critical action of the owner is as simple as a single multiplication, chances are high that the monitor will be released during the spinning phase already. Note thatsynchronized
is “unfair”, allowing a spinning thread to proceed immediately, even if there are already enqueued threads waiting far longer.If you compare the fundamental operations performed by the
synchronized(lock){ n = n * 123; }
when no queuing is involved and byal.updateAndGet(x -> x * 123);
, you’ll notice that they are roughly on par. The main difference is that theAtomicLong
approach will repeat the multiplication on contention while for thesynchronized
approach, there is a risk of being put into the queue if no progress has been made during spinning.But
synchronized
allows lock coarsening for code repeatedly synchronizing on the same object, which might be relevant for a benchmark loop calling thesyncShared
method. Unless there’s also a way to fuse multipleCAS
updates of anAtomicLong
, this can givesynchronized
a dramatic advantage. (See also this article covering several aspects discussed above)Note that due to the “unfair” nature of
synchronized
, creating far more threads than CPU cores doesn’t have to be a problem. In the best case, “number of threads minus number of cores” threads end up on the queue, never waking up, while the remaining threads succeed in the spinning phase, one thread on each core. But likewise, threads not running on a CPU core can’t reduce the performance of theAtomicLong
update as they can neither, invalidate the current value to other threads nor make a failedCAS
attempt.In either case, when
CAS
ing on the member variable of an unshared object or when performingsynchronized
on an unshared object, the JVM may detect the local nature of the operation and elide most of the associated costs. But this may depend on several subtle environmental aspects.The bottom line is that there is no easy decision between atomic updates and
synchronized
blocks. Things get far more interesting with more expensive operations, which may raise the likelihood of threads getting enqueued in the contended case ofsynchronized
, which may make it acceptable that the operation has to be repeated in the contended case of an atomic update.Note that CAS can give you more fine-grained ordering (non-)guarantees than a synchronized block, especially with java-9 varhandles which provide ordering options aligned with the C++11 memory model.
If all you want to do is some statistics-keeping from multiple threads then a read-compute-update loop with the most relaxed memory orderings available (plain read; plain and weak CAS) may perform better on weakly ordered platforms since it won't need any barriers and the cas won't have to do wasteful internal looping if it's implemented on top of LL/SC. Additionally it will also give the JITs more freedom to reorder the instructions around those atomics.
compareAndExchange
can eliminate an additional read on loop repetition.Another complication is also how you measure performance. All of the implementations should have progress-guarantees, i.e. even under contention at least one can finish at a time. So in principle you could be wasting CPU cycles on multiple threads attempting to update your variable concurrently but still be better on the measure of 99th percentile latency because atomic operations won't resort to descheduling the thread and worse on worst-case latency because they're not fair. So just measuring averages might not tell the whole story here.
First of all the code you are writing is java which will create java byte code which translates to different atomic operations on different instruction sets (Arm vs powerpc vs X86...) which can behave different on implementations by different vendors and even between architectures from the same vendor (such as intel core 2 duo and skylake). So its really hard to answer your question!
This paper states that for the tested X86 architectures one execution of any atomic operation perform similarly (very small differnce between CAS, Fetch and add, swap) while CAS may fail and need to be executed multiple times. In case of one thread it will never fail however.
This stackoverflow post states:
Lets look at the necessary operations in case of CAS:
Fetch x, multiply x, Cas on x, check if cas succeeded
Now this is efficient in a non contented case because there is a minimal number of operations needed. But in case the cacheline is contended it is not very efficient because all threads are actively spinning while most of them fail. Furthermore I remember that spinning on a contended cacheline with an atomic operation is very expensive.
Now the important part in synchronize is:
It depends on how this waiting is implemented.
the synchronized method could put the thread to sleep for a random time after failing to acquire the monitor, Also instead of using an atomic operation to check if the monitor is free it can do it with a simple read (this is quicker but I couldn't find a link to prove it).
My bet is that the waiting in synchronized is implemented in a smart way and optimized for situations with contention with one of the methods above or something similar and therefore it is quicker in the contended scenario.
The tradeof is that in non contended situations it is slower.
I still admit I have no proof.