I'm playing around with JMH ( http://openjdk.java.net/projects/code-tools/jmh/ ) and I just stumbled on a strange result.
I'm benchmarking ways to make a shallow copy of an array and I can observe the expected results (that looping through the array is a bad idea and that there is no significant difference between #clone()
, System#arraycopy()
and Arrays#copyOf()
, performance-wise).
Except that System#arraycopy()
is one-quarter slower when the array's length is hard-coded... Wait, what ? How can this be slower ?
Does anyone has an idea of what could be the cause ?
The results (throughput):
# JMH 1.11 (released 17 days ago)
# VM version: JDK 1.8.0_05, VM 25.5-b02
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/jre/bin/java
# VM options: -Dfile.encoding=UTF-8 -Duser.country=FR -Duser.language=fr -Duser.variant
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
Benchmark Mode Cnt Score Error Units
ArrayCopyBenchmark.ArraysCopyOf thrpt 20 67100500,319 ± 455252,537 ops/s
ArrayCopyBenchmark.ArraysCopyOf_Class thrpt 20 65246374,290 ± 976481,330 ops/s
ArrayCopyBenchmark.ArraysCopyOf_Class_ConstantSize thrpt 20 65068143,162 ± 1597390,531 ops/s
ArrayCopyBenchmark.ArraysCopyOf_ConstantSize thrpt 20 64463603,462 ± 953946,811 ops/s
ArrayCopyBenchmark.Clone thrpt 20 64837239,393 ± 834353,404 ops/s
ArrayCopyBenchmark.Loop thrpt 20 21070422,097 ± 112595,764 ops/s
ArrayCopyBenchmark.Loop_ConstantSize thrpt 20 24458867,274 ± 181486,291 ops/s
ArrayCopyBenchmark.SystemArrayCopy thrpt 20 66688368,490 ± 582416,954 ops/s
ArrayCopyBenchmark.SystemArrayCopy_ConstantSize thrpt 20 48992312,357 ± 298807,039 ops/s
And the benchmark class:
import java.util.Arrays;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
@State(Scope.Benchmark)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
public class ArrayCopyBenchmark {
private static final int LENGTH = 32;
private Object[] array;
@Setup
public void before() {
array = new Object[LENGTH];
for (int i = 0; i < LENGTH; i++) {
array[i] = new Object();
}
}
@Benchmark
public Object[] Clone() {
Object[] src = this.array;
return src.clone();
}
@Benchmark
public Object[] ArraysCopyOf() {
Object[] src = this.array;
return Arrays.copyOf(src, src.length);
}
@Benchmark
public Object[] ArraysCopyOf_ConstantSize() {
Object[] src = this.array;
return Arrays.copyOf(src, LENGTH);
}
@Benchmark
public Object[] ArraysCopyOf_Class() {
Object[] src = this.array;
return Arrays.copyOf(src, src.length, Object[].class);
}
@Benchmark
public Object[] ArraysCopyOf_Class_ConstantSize() {
Object[] src = this.array;
return Arrays.copyOf(src, LENGTH, Object[].class);
}
@Benchmark
public Object[] SystemArrayCopy() {
Object[] src = this.array;
int length = src.length;
Object[] array = new Object[length];
System.arraycopy(src, 0, array, 0, length);
return array;
}
@Benchmark
public Object[] SystemArrayCopy_ConstantSize() {
Object[] src = this.array;
Object[] array = new Object[LENGTH];
System.arraycopy(src, 0, array, 0, LENGTH);
return array;
}
@Benchmark
public Object[] Loop() {
Object[] src = this.array;
int length = src.length;
Object[] array = new Object[length];
for (int i = 0; i < length; i++) {
array[i] = src[i];
}
return array;
}
@Benchmark
public Object[] Loop_ConstantSize() {
Object[] src = this.array;
Object[] array = new Object[LENGTH];
for (int i = 0; i < LENGTH; i++) {
array[i] = src[i];
}
return array;
}
}
As usual, these kind of questions are quickly answered by studying the generated code. JMH provides you with
-prof perfasm
on Linux, and-prof xperfasm
on Windows. If you run the benchmark on JDK 8u40, then you will see (note I used-bm avgt -tu ns
to make scores more comprehensible):Why are these benchmarks perform differently? Let's first do
-prof perfnorm
to dissect (I dropped the lines that do not matter):So,
ConstantSize
somehow does more L1-dcache-stores, but one less LLC-load. Hm, so that's what we are looking for, more stores in the constant case.-prof perfasm
conveniently highlights the hot parts in assembly:default
:ConstantSize
:So there is that pesky
rex.W rep stos %al,%es:(%rdi)
consuming a significant time. This zeroes the newly allocated array. InConstantSize
test, the JVM could not correlate that you are overwriting the entire target array, and so it had to pre-zero it before diving into the actual array copy.If you look at the generated code on JDK 9b82 (the latest available), then you will see it folds both patterns in non-zeroed copy, as you can see with
-prof perfasm
, and can also confirm with-prof perfnorm
:Of course, all these nanobenchmarks for arraycopy are susceptible for weird alignment-induced performance differences in the vectorized copying stubs, but that's another (horror) story, that I don't have courage to tell.