I'm playing with jmh
and in the section about looping they said that
You might notice the larger the repetitions count, the lower the "perceived" cost of the operation being measured. Up to the point we do each addition with 1/20 ns, well beyond what hardware can actually do. This happens because the loop is heavily unrolled/pipelined, and the operation to be measured is hoisted from the loop. Morale: don't overuse loops, rely on JMH to get the measurement right.
I tried it myself
@Benchmark
@OperationsPerInvocation(1)
public int measurewrong_1() {
return reps(1);
}
@Benchmark
@OperationsPerInvocation(1000)
public int measurewrong_1000() {
return reps(1000);
}
and got the following result:
Benchmark Mode Cnt Score Error Units
MyBenchmark.measurewrong_1 avgt 15 2.425 ± 0.137 ns/op
MyBenchmark.measurewrong_1000 avgt 15 0.036 ± 0.001 ns/op
It indeed shows that the MyBenchmark.measurewrong_1000
is dramatically faster than MyBenchmark.measurewrong_1
. But I cannot really understand the optimization JVM does to make this performance improvement.
What do they mean the loop is unrolled/pipelined?
Loop unrolling makes pipelining possible. So the pipeline-able CPU (for example RISC) can execute the unrolled code in parallel.
So if your CPU is able to execute 5 pipelines in parallel, your loop will be unrolled in the way:
IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back
From Oracle White paper:
more information about pipelining: Classic RISC pipeline
Loop Pipelining = Software Pipelining.
Basically, it's a technique that is used to optimize the efficiency of sequential loop iterations, by executing some of the instructions in the body of the loop - in parrallel.
Of course, this can be done only when certain conditions are met, such as each iteration not being dependent on another etc.
From insidehpc.com:
See more here:
Sofware pipelining explained
Software pipelining - Wikipedia
Loop unrolling is a tecnhique to flatten multiple loop iterations by repeating the loop body.
E.g. in the given example
can be unrolled by JIT compiler to something like
Then the extended loop body can be further optimized to
Obviously computing
16 * (x + y)
is much faster than computing(x + y)
16 times.