Benchmarking Java HashMap Get (JMH vs Looping)

My ultimate goal is to create a comprehensive set of benchmarks for several Java primitive collection libraries using the standard Java collections as a baseline. In the past I have used the looping method of writing these kinds of micro-benchmarks. I put the function I am benchmarking in a loop and iterate 1 million+ times so the jit has a chance to warmup. I take the total time of the loop and then divide by the number of iterations to get an estimate for the amount of time a single call to the function I am benchmarking would take. After recently reading about the JMH project and specifically this example: JMHSample_11_Loops I see the issue with this approach.

My Machine:

Windows 7 64-bit
Core i7-2760QM @ 2.40 GHz
8.00 GB Ram
jdk1.7.0_45 64-bit

Here is a stripped down simple example of the looping method code described above:

    public static void main(String[] args) {
    HashMap<Long, Long> hmap = new HashMap<Long, Long>();
    long val = 0;

    //populating the hashmap
    for (long idx = 0; idx < 10000000; idx++) {
        hmap.put(idx, idx);
    }


    Stopwatch s = Stopwatch.createStarted();
    long x = 0;
    for (long idx = 0; idx < 10000000; idx++) {
       x =  hmap.get(idx);
    }
    s.stop();
    System.out.println(s); //5.522 s
    System.out.println(x); //9999999

    //5.522 seconds / 10000000 = 552.2 nanoseconds
}

Here is my attempt at rewriting this benchmark using JMH:

package com.test.benchmarks;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.HashMap;
import java.util.concurrent.TimeUnit;


@State(Scope.Thread)
public class MyBenchmark {


    private HashMap<Long, Long> hmap = new HashMap<Long, Long>();
    private long key;

    @Setup(Level.Iteration)
    public void setup(){

        key = 0;

        for(long i = 0; i < 10000000; i++) {
            hmap.put(i, i);
        }
    }


    @Benchmark
    @BenchmarkMode(Mode.SampleTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public long testGetExistingKey() throws InterruptedException{

        if(key >= 10000000) key=0;
        return hmap.get(key++);
    }


    public static void main(String[] args) throws RunnerException {

        Options opt = new OptionsBuilder()
                .include(".*" + MyBenchmark.class.getSimpleName() + ".*")
                .warmupIterations(5)
                .measurementIterations(25)
                .forks(1)
                .build();

        new Runner(opt).run();

    }

}

Here are the results:

 Result: 31.163 ±(99.9%) 11.732 ns/op [Average]
   Statistics: (min, avg, max) = (0.000, 31.163, 939008.000), stdev = 1831.428
   Confidence interval (99.9%): [19.431, 42.895]
  Samples, N = 263849
        mean =     31.163 ±(99.9%) 11.732 ns/op
         min =      0.000 ns/op
  p( 0.0000) =      0.000 ns/op
  p(50.0000) =      0.000 ns/op
  p(90.0000) =      0.000 ns/op
  p(95.0000) =    427.000 ns/op
  p(99.0000) =    428.000 ns/op
  p(99.9000) =    428.000 ns/op
  p(99.9900) =    856.000 ns/op
  p(99.9990) =   9198.716 ns/op
  p(99.9999) = 939008.000 ns/op
         max = 939008.000 ns/op


# Run complete. Total time: 00:02:07

Benchmark                                Mode   Samples        Score  Score error    Units
c.t.b.MyBenchmark.testGetExistingKey   sample    263849       31.163       11.732    ns/op

As far as I can tell, the same benchmark in JMH has hashmap gets at 31 nanoseconds vs 552 nanoseconds for the looping test. 31 nanoseconds seems a little too fast for me. Looking at Latency Numbers Every Programmer Should Know a main memory reference is around 100 nanoseconds. L2 cache reference is roughly 7 nanoseconds, but the HashMap with 10 million Long keys and values well exceeds L2. Also the JMH results look strange to me. 90% of the get calls take 0.0 nanoseconds?

I am assuming this is user error. Any help/pointers would be appreciated. Thanks.

UPDATE

Here are the results from doing AverageTime run. This is much more inline with my expectations. Thanks @oleg-estekhin! In the comments below I mentioned that I had done the AverageTime test previously and had similar results as SampleTime. I believe doing that run I had used a HashMap with far fewer entries and that the faster lookups did make sense.

Result: 266.306 ±(99.9%) 139.359 ns/op [Average]
  Statistics: (min, avg, max) = (27.266, 266.306, 1917.271), stdev = 410.904
  Confidence interval (99.9%): [126.947, 405.665]


# Run complete. Total time: 00:07:17

Benchmark                                Mode   Samples        Score  Score error    Units
c.t.b.MyBenchmark.testGetExistingKey     avgt       100      266.306      139.359    ns/op

First, looping test measures average time, while your JMH code is configured for sample time. From the Mode.SampleTime javadoc:

Sample time: samples the time for each operation.

Individual executions of Map.get() are pretty fast to the point when the underlying time measurement system will report 0 for some of the executions due to time measurement granularity (read Nanotrusting the Nanotime blog post by the JMH author for more information).

In the sample mode the benchmarks collects individual sample times into an array and then calculates averages and percentiles using that array. When more then half of the array values are zero (in your particular setup more than 90% of array values are zero, as indicated by the p(90.0000) = 0.000 ns/op) the average is bound to be pretty low but when you see p(50) = 0 (and especially p(90) = 0) in your output the only conclusion you can reliable make is that these results are garbage and you need to find another way to measure that code.

You should use Mode.AverageTime (or Mode.Throughput) benchmark mode. Leave Mode.SampleTime for situations when individual invocation takes substantial time.
You could add a "baseline" benchmark which executes the if () and key++ in order to isolate the time required for key bookkeeping and the actual Map.get() time, but you will need to explain the results (the blog post linked above describes pitfalls with subtracting "baselines" from "real" measurements).
You could try to use Blackhole.consumeCPU() to increase the execution time of individual invocation (see the previous point about "baselines" and associated pitfalls).