The following Java program takes on average between 0.50s and 0.55s to run:
public static void main(String[] args) {
long startTime = System.nanoTime();
int n = 0;
for (int i = 0; i < 1000000000; i++) {
n += 2 * (i * i);
}
System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");
System.out.println("n = " + n);
}
If I replace 2 * (i * i)
with 2 * i * i
, it takes between 0.60 and 0.65s to run. How come?
I ran each version of the program 15 times, alternating between the two. Here are the results:
2*(i*i) | 2*i*i
----------+----------
0.5183738 | 0.6246434
0.5298337 | 0.6049722
0.5308647 | 0.6603363
0.5133458 | 0.6243328
0.5003011 | 0.6541802
0.5366181 | 0.6312638
0.515149 | 0.6241105
0.5237389 | 0.627815
0.5249942 | 0.6114252
0.5641624 | 0.6781033
0.538412 | 0.6393969
0.5466744 | 0.6608845
0.531159 | 0.6201077
0.5048032 | 0.6511559
0.5232789 | 0.6544526
The fastest run of 2 * i * i
took longer than the slowest run of 2 * (i * i)
. If they were both as efficient, the probability of this happening would be less than 1/2^15 = 0.00305%.
I got similar results:
I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.
Finally, here is a
javap -c -v <.java>
decompile of each:vs.
FYI -
Kasperd asked in a comment of the accepted answer:
I don't have enough reputation to answer this in the comments, but these are the same ISA. It's worth pointing out that the GCC version uses 32-bit integer logic and the JVM compiled version uses 64-bit integer logic internally.
R8 to R15 are just new X86_64 registers. EAX to EDX are the lower parts of the RAX to RDX general purpose registers. The important part in the answer is that the GCC version is not unrolled. It simply executes one round of the loop per actual machine code loop. While the JVM version has 16 rounds of the loop in one physical loop (based on rustyx answer, I did not reinterpret the assembly). This is one of the reasons why there are more registers being used since the loop body is actually 16 times longer.
I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .
The result are here:
On my PC (Core i7 860, doing nothing much apart reading on my smartphone):
n += i*i
thenn*2
is first2 * (i * i)
is second.The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).
Now then, reading bytecode:
javap -c -v ./target/classes/org/sample/MyBenchmark.class
I am not expert on bytecode but we
iload_2
before weimul
: that's probably where you get the difference: I can suppose that the JVM optimize readingi
twice (i
is already here, there is no need to load it again) whilst in the2*i*i
it can't.Interesting observation using Java 11 and switching off loop unrolling with the following VM option:
The loop with the
2 * (i * i)
expression results in a more compact native code1:in comparison with the
2 * i * i
version:Java version:
Benchmark results:
Benchmark source code:
1 - VM options used:
-XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:LoopUnrollLimit=0