The following Java program takes on average between 0.50s and 0.55s to run:
public static void main(String[] args) {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");
System.out.println("n = " + n);
}
If I replace 2 * (i * i)
with 2 * i * i
, it takes between 0.60 and 0.65s to run. How come?
I ran each version of the program 15 times, alternating between the two. Here are the results:
2*(i*i) | 2*i*i
----------+----------
0.5183738 | 0.6246434
0.5298337 | 0.6049722
0.5308647 | 0.6603363
0.5133458 | 0.6243328
0.5003011 | 0.6541802
0.5366181 | 0.6312638
0.515149 | 0.6241105
0.5237389 | 0.627815
0.5249942 | 0.6114252
0.5641624 | 0.6781033
0.538412 | 0.6393969
0.5466744 | 0.6608845
0.531159 | 0.6201077
0.5048032 | 0.6511559
0.5232789 | 0.6544526
The fastest run of 2 * i * i
took longer than the slowest run of 2 * (i * i)
. If they were both as efficient, the probability of this happening would be less than 1/2^15 = 0.00305%.
There is a slight difference in the ordering of the bytecode.
2 * (i * i)
:vs
2 * i * i
:At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.
So we need to dig deeper into the lower level (JIT)1.
Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling for the
2 * (i * i)
case:We see that there is 1 register that is "spilled" onto the stack.
And for the
2 * i * i
version:Here we observe much more "spilling" and more accesses to the stack
[RSP + ...]
, due to more intermediate results that need to be preserved.Thus the answer to the question is simple:
2 * (i * i)
is faster than2 * i * i
because the JIT generates more optimal assembly code for the first case.But of course it is obvious that neither the first nor the second version is any good; the loop could really benefit from vectorization, since any x86-64 CPU has at least SSE2 support.
So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot, all the while missing out on various other opportunities.
In fact, modern x86-64 CPUs break down the instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers, loop optimization takes a lot more finesse than a simple unrolling for optimal performance. According to Agner Fog's optimization guide:
Regarding those load times - even the fastest L1D hit costs 4 cycles, an extra register and µop, so yes, even a few accesses to memory will hurt performance in tight loops.
But back to the vectorization opportunity - to see how fast it can be, we can compile a similar C application with GCC, which outright vectorizes it (AVX2 is shown, SSE2 is similar)2:
With run times:
1 To get JIT generated assembly output, get a debug JVM and run with
-XX:+PrintOptoAssembly
2 The C version is compiled with the
-fwrapv
flag, which enables GCC to treat signed integer overflow as a two's-complement wrap-around.ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html
ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer
On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:
Output:
So why? The Byte code is this:
The difference being:
With brackets (
2 * (i * i)
):Without brackets (
2 * i * i
):Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.
When the multiplication is
2 * (i * i)
, the JVM is able to factor out the multiplication by2
from the loop, resulting in this equivalent but more efficient code:but when the multiplication is
(2 * i) * i
, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.Here are a few reasons why I think this is the case:
if (n == 0) n = 1
statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same2 * (i * i)
versionHere is the test code that I used to draw these conclusions:
And here are the results:
More of an addendum. I did repro the experiment using the latest Java 8 JVM from IBM:
and this shows very similar results:
( second results using 2 * i * i ).
Interestingly enough, when running on the same machine, but using Oracle java:
results are on average a bit slower:
Long story short: even the minor version number of HotSpot matter here, as subtle differences within the JIT implementation can have notable effects.
The two methods of adding do generate slightly different byte code:
For
2 * (i * i)
vs:For
2 * i * i
.And when using a JMH benchmark like this:
The difference is clear:
What you observe is correct, and not just an anomaly of your benchmarking style (i.e. no warmup, see How do I write a correct micro-benchmark in Java?)
Running again with Graal:
You see that the results are much closer, which makes sense, since Graal is an overall better performing, more modern, compiler.
So this is really just up to how well the JIT compiler is able to optimize a particular piece of code, and doesn't necessarily have a logical reason to it.
While not directly related to the question's environment, just for the curiosity, I did the same test on .Net Core 2.1, x64, release mode. Here is the interesting result, confirming similar phonemenia (other way around) happening over the dark side of the force. Code:
Result:
2 * (i * i)
2 * i * i