The code below demonstrates the problem unequivocally, which is:
The exact same block of code becomes slower after a busy spin pause.
Note that of course I'm not using Thread.sleep
. Also note that there are no conditionals leading to a HotSpot/JIT de-optimization as I'm changing the pause using a math operation, not an IF
.
- There is a block of math operations that I want to time.
- First I time the block pausing 1 nanosecond before I start my measurement. I do that 20,000 times.
- Then I change the pause from 1 nanosecond to 5 seconds and proceed to measure the latency as usual. I do that 15 times.
- Then I print the last 30 measurements, so you can see 15 measurements with the pause of 1 nanosecond and 15 measurements with the pause of 5 seconds.
As you can see below, the discrepancy is big, especially in the very first measurement after the pause change. Why is that!?
$ java -server -cp . JvmPauseLatency
Sat Apr 29 10:34:28 EDT 2017 => Please wait 75 seconds for the results...
Sat Apr 29 10:35:43 EDT 2017 => Calculation: 4.0042328611017236E11
Results:
215
214
215
214
215
214
217
215
216
214
216
213
215
214
215
2343 <----- FIRST MEASUREMENT AFTER PAUSE CHANGE
795
727
942
778
765
856
762
801
708
692
765
776
780
754
The code:
import java.util.Arrays;
import java.util.Date;
import java.util.Random;
public class JvmPauseLatency {
private static final int WARMUP = 20000;
private static final int EXTRA = 15;
private static final long PAUSE = 5 * 1000000000L; // in nanos
private final Random rand = new Random();
private int count;
private double calculation;
private final long[] results = new long[WARMUP + EXTRA];
private long interval = 1; // in nanos
private long busyPause(long pauseInNanos) {
final long start = System.nanoTime();
long until = Long.MAX_VALUE;
while(System.nanoTime() < until) {
until = start + pauseInNanos;
}
return until;
}
public void run() {
long testDuration = ((WARMUP * 1) + (EXTRA * PAUSE)) / 1000000000L;
System.out.println(new Date() +" => Please wait " + testDuration + " seconds for the results...");
while(count < results.length) {
double x = busyPause(interval);
long latency = System.nanoTime();
calculation += x / (rand.nextInt(5) + 1);
calculation -= calculation / (rand.nextInt(5) + 1);
calculation -= x / (rand.nextInt(6) + 1);
calculation += calculation / (rand.nextInt(6) + 1);
latency = System.nanoTime() - latency;
results[count++] = latency;
interval = (count / WARMUP * (PAUSE - 1)) + 1; // it will change to PAUSE when it reaches WARMUP
}
// now print the last (EXTRA * 2) results so you can compare before and after the pause change (from 1 to PAUSE)
System.out.println(new Date() + " => Calculation: " + calculation);
System.out.println("Results:");
long[] array = Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length);
for(long t: array) System.out.println(t);
}
public static void main(String[] args) {
new JvmPauseLatency().run();
}
}
TL;DR
http://www.brendangregg.com/activebenchmarking.html
casual benchmarking: you benchmark A, but actually measure B, and
conclude you've measured C.
Problem N1. The very first measurement after the pause change.
It looks like you are faced with on-stack replacement. When OSR occurs, the VM is paused, and the stack frame for the target function is replaced by an equivalent frame.
The root case is wrong microbenchmark - it was not properly warmed up. Just insert the following line into your benchmark before while loop in order to fix it:
System.out.println("WARMUP = " + busyPause(5000000000L));
How to check this - just run your benchmark with -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls
. I've modified your code - now it prints interval into system output before every call:
interval = 1
interval = 1
interval = 5000000000
689 145 4 JvmPauseLatency::busyPause (19 bytes) made not entrant
689 146 3 JvmPauseLatency::busyPause (19 bytes)
Installing method (3) JvmPauseLatency.busyPause(J)J
698 147 % 4 JvmPauseLatency::busyPause @ 6 (19 bytes)
Installing osr method (4) JvmPauseLatency.busyPause(J)J @ 6
702 148 4 JvmPauseLatency::busyPause (19 bytes)
705 146 3 JvmPauseLatency::busyPause (19 bytes) made not entrant
Installing method (4) JvmPauseLatency.busyPause(J)J
interval = 5000000000
interval = 5000000000
interval = 5000000000
interval = 5000000000
Usually OSR occurs at tier 4 so in order to disable it you can use the following options:
-XX:-TieredCompilation
disable tiered compilation
-XX:-TieredCompilation -XX:TieredStopAtLevel=3
disable tiered compilation to 4 level
-XX:+TieredCompilation -XX:TieredStopAtLevel=4 -XX:-UseOnStackReplacement
disable OSR
Problem N2. How to measure.
Let's start from the article https://shipilev.net/blog/2014/nanotrusting-nanotime. In few words:
- JIT can compile only method - in your test you have one loop, so only OSR is available for your test
- you are trying to measure something small, maybe smaller than
nanoTime()
call(see What is the cost of volatile write?)
- microarchitecture level – caches, CPU pipeline stalls are important, for example, TLB miss or branch misprediction take more time than the test execution time
So in order to avoid all these pitfalls you can use JMH based benchmark like this:
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.VerboseMode;
import java.util.Random;
import java.util.concurrent.TimeUnit;
@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 2, time = 3, timeUnit = TimeUnit.SECONDS)
@Fork(value = 2)
public class LatencyTest {
public static final long LONG_PAUSE = 5000L;
public static final long SHORT_PAUSE = 1L;
public Random rand;
@Setup
public void initI() {
rand = new Random(0xDEAD_BEEF);
}
private long busyPause(long pauseInNanos) {
Blackhole.consumeCPU(pauseInNanos);
return pauseInNanos;
}
@Benchmark
@BenchmarkMode({Mode.AverageTime})
public long latencyBusyPauseShort() {
return busyPause(SHORT_PAUSE);
}
@Benchmark
@BenchmarkMode({Mode.AverageTime})
public long latencyBusyPauseLong() {
return busyPause(LONG_PAUSE);
}
@Benchmark
@BenchmarkMode({Mode.AverageTime})
public long latencyFunc() {
return doCalculation(1);
}
@Benchmark
@BenchmarkMode({Mode.AverageTime})
public long measureShort() {
long x = busyPause(SHORT_PAUSE);
return doCalculation(x);
}
@Benchmark
@BenchmarkMode({Mode.AverageTime})
public long measureLong() {
long x = busyPause(LONG_PAUSE);
return doCalculation(x);
}
private long doCalculation(long x) {
long calculation = 0;
calculation += x / (rand.nextInt(5) + 1);
calculation -= calculation / (rand.nextInt(5) + 1);
calculation -= x / (rand.nextInt(6) + 1);
calculation += calculation / (rand.nextInt(6) + 1);
return calculation;
}
public static void main(String[] args) throws RunnerException {
Options options = new OptionsBuilder()
.include(LatencyTest.class.getName())
.verbosity(VerboseMode.NORMAL)
.build();
new Runner(options).run();
}
}
Please note that I've changed busy loop implementation to Blackhole#consumeCPU() in order to avoid OS related effects. So my results are:
Benchmark Mode Cnt Score Error Units
LatencyTest.latencyBusyPauseLong avgt 4 15992.216 ± 106.538 ns/op
LatencyTest.latencyBusyPauseShort avgt 4 6.450 ± 0.163 ns/op
LatencyTest.latencyFunc avgt 4 97.321 ± 0.984 ns/op
LatencyTest.measureLong avgt 4 16103.228 ± 102.338 ns/op
LatencyTest.measureShort avgt 4 100.454 ± 0.041 ns/op
Please note that the results are almost additive, i.e. latencyFunc + latencyBusyPauseShort = measureShort
Problem N3. The discrepancy is big.
What is wrong with your test? It does not warm-up JVM properly, i.e. it uses one parameter to warm-up and another to test. Why is this important? JVM uses profile-guided optimizations, for example, it counts how often a branch has been taken and generates "the best"(branch-free) code for the particular profile. So then we are trying to warm-up JVM our benchmark with parameter 1, JVM generates "optimal code" where branch in while loop has been never taken. Here is an event from JIT compilation log(-XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation
):
<branch prob="0.0408393" not_taken="40960" taken="1744" cnt="42704" target_bci="42"/>
After property change JIT uses uncommon trap in order to process your code which is not optimal. I've created a benchmark which is based on original one with minor changes:
- busyPause replaced by consumeCPU from JMH in order to have pure java benchmark without interactions with system(actually nano time uses userland function
vdso clock_gettime
and we unable to profile this code)
- all calculations are removed
_
import java.util.Arrays;
public class JvmPauseLatency {
private static final int WARMUP = 2000 ;
private static final int EXTRA = 10;
private static final long PAUSE = 70000L; // in nanos
private static volatile long consumedCPU = System.nanoTime();
//org.openjdk.jmh.infra.Blackhole.consumeCPU()
private static void consumeCPU(long tokens) {
long t = consumedCPU;
for (long i = tokens; i > 0; i--) {
t += (t * 0x5DEECE66DL + 0xBL + i) & (0xFFFFFFFFFFFFL);
}
if (t == 42) {
consumedCPU += t;
}
}
public void run(long warmPause) {
long[] results = new long[WARMUP + EXTRA];
int count = 0;
long interval = warmPause;
while(count < results.length) {
consumeCPU(interval);
long latency = System.nanoTime();
latency = System.nanoTime() - latency;
results[count++] = latency;
if (count == WARMUP) {
interval = PAUSE;
}
}
System.out.println("Results:" + Arrays.toString(Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length)));
}
public static void main(String[] args) {
int totalCount = 0;
while (totalCount < 100) {
new JvmPauseLatency().run(0);
totalCount ++;
}
}
}
And results are
Results:[62, 66, 63, 64, 62, 62, 60, 58, 65, 61, 127, 245, 140, 85, 88, 114, 76, 199, 310, 196]
Results:[61, 63, 65, 64, 62, 65, 82, 63, 67, 70, 104, 176, 368, 297, 272, 183, 248, 217, 267, 181]
Results:[62, 65, 60, 59, 54, 64, 63, 71, 48, 59, 202, 74, 400, 247, 215, 184, 380, 258, 266, 323]
In order to fix this benchmark just replace new JvmPauseLatency().run(0)
with new JvmPauseLatency().run(PAUSE);
and here is the results:
Results:[46, 45, 44, 45, 48, 46, 43, 72, 50, 47, 46, 44, 54, 45, 43, 43, 43, 48, 46, 43]
Results:[44, 44, 45, 45, 43, 46, 46, 44, 44, 44, 43, 49, 45, 44, 43, 49, 45, 46, 45, 44]
If you want to change "pause" dynamically - you have to warm-up JVM dynamically, i.e.
while(count < results.length) {
consumeCPU(interval);
long latency = System.nanoTime();
latency = System.nanoTime() - latency;
results[count++] = latency;
if (count >= WARMUP) {
interval = PAUSE;
} else {
interval = rnd.nextBoolean() ? PAUSE : 0;
}
}
Problem N4. What about interpreter -Xint?
In case of switch-based interpreter we have a lot of problems and the main is indirect branch instructions. I've made 3 experiments:
- random warmup
- constant warmup with 0 pause
- the whole test uses pause 0 including
Every experiment was started by the following command sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,branch-misses java -Xint JvmPauseLatency
and the results are:
Performance counter stats for 'java -Xint JvmPauseLatency':
272,822,274,275 cycles
723,420,125,590 instructions # 2.65 insn per cycle
26,994,494 cache-references
8,575,746 cache-misses # 31.769 % of all cache refs
2,060,138,555 bus-cycles
2,930,155 branch-misses
86.808481183 seconds time elapsed
Performance counter stats for 'java -Xint JvmPauseLatency':
2,812,949,238 cycles
7,267,497,946 instructions # 2.58 insn per cycle
6,936,666 cache-references
1,107,318 cache-misses # 15.963 % of all cache refs
21,410,797 bus-cycles
791,441 branch-misses
0.907758181 seconds time elapsed
Performance counter stats for 'java -Xint JvmPauseLatency':
126,157,793 cycles
158,845,300 instructions # 1.26 insn per cycle
6,650,471 cache-references
909,593 cache-misses # 13.677 % of all cache refs
1,635,548 bus-cycles
775,564 branch-misses
0.073511817 seconds time elapsed
In case of branch miss latency and footprint grows non-linearly due to huge memory footprint.