Rust benchmark optimized out

I am trying to benchmark getting keys from a Rust hash map. I have the following benchmark:

#[bench]
fn rust_get(b: &mut Bencher) {
    let (hash, keys) =
        get_random_hash::<HashMap<String, usize>>(&HashMap::with_capacity, &rust_insert_fn);
    let mut keys = test::black_box(keys);
    b.iter(|| {
        for k in keys.drain(..) {
            hash.get(&k);
        }
    });
}

where get_random_hash is defined as:

fn get_random_hash<T>(
    new: &Fn(usize) -> T,
    insert: &Fn(&mut T, String, usize) -> (),
) -> (T, Vec<String>) {
    let mut keys = Vec::with_capacity(HASH_SIZE);
    let mut hash = new(HASH_CAPACITY);
    for i in 0..HASH_SIZE {
        let k: String = format!("{}", Uuid::new_v4());
        keys.push(k.clone());
        insert(&mut hash, k, i);
    }
    return (hash, keys);
}

and rust_insert_fn is:

fn rust_insert_fn(map: &mut HashMap<String, usize>, key: String, value: usize) {
    map.insert(key, value);
}

However, when I run the benchmark, it is clearly optimized out:

test benchmarks::benchmarks::rust_get        ... bench:           1 ns/iter (+/- 0)

I thought test::black_box would solve the problem but it doesn't look like it does. I have even tried wrapping thehash.get(&k) in the for loop withtest::black_box` but that still optimizes the code. How should I correctly get the code to run without being optimized out?

EDIT - Even the following does optimizes out the get operation:

#[bench]
fn rust_get(b: &mut Bencher) {
  let (hash, keys) = get_random_hash::<HashMap<String, usize>>(&HashMap::with_capacity, &rust_insert_fn);
  let mut keys = test::black_box(keys);
  b.iter(|| {
    let mut n = 0;
    for k in keys.drain(..) {
      hash.get(&k);
      n += 1;
    };
    return n;
  });
}

Interestingly, the following benchmarks work:

#[bench]
fn rust_get_random(b: &mut Bencher) {
  let (hash, _) = get_random_hash::<HashMap<String, usize>>(&HashMap::with_capacity, &rust_insert_fn);
  b.iter(|| {
    for _ in 0..HASH_SIZE {
      hash.get(&format!("{}", Uuid::new_v4()));
    }
  });
}

#[bench]
fn rust_insert(b: &mut Bencher) {
  b.iter(|| {
    let mut hash = HashMap::with_capacity(HASH_CAPACITY);
    for i in 0..HASH_SIZE {
      let k: String = format!("{}", Uuid::new_v4());
      hash.insert(k, i);
    }
  });
}

but this also does not:

#[bench]
fn rust_del(b: &mut Bencher) {
  let (mut hash, keys) = get_random_hash::<HashMap<String, usize>>(&HashMap::with_capacity, &rust_insert_fn);
  let mut keys = test::black_box(keys);
  b.iter(|| {
    for k in keys.drain(..) {
      hash.remove(&k);
    };
  });
}

Here is the full gist.

How does a compiler optimizer work?

An optimizer is nothing more than a pipeline of analyses and transformations. Each individual analysis or transformation is relatively simple, and the optimal order to apply them is unknown and generally determined by heuristics.

How does this affect my benchmark?

Benchmarks are complicated in that in general you wish to measure optimized code, but at the same time some analyses or transformations may remove the code you were interested in rendering the benchmark useless.

It is therefore important to have a passing acquaintance with the analyses and transformation passes of the particular optimizer you are using so as to be able to understand:

which ones are undesirable,
how to foil them.

As mentioned, most passes are relatively simple, and therefore foiling them is relatively simple as well. The difficulty lies in the fact that there is a hundred or more of them and you have to know which one is kicking in to be able to foil it.

What optimizations am I running afoul of?

There are a few specific optimizations which very often play often with benchmarks:

Constant Propagation: allows evaluating part of the code at compile-time,
Loop Invariant Code Motion: allows lifting the evaluation of some piece of code outside the loop,
Dead Code Elimination: removes code that is not useful.

What? How dare the optimizer mangle my code so?

The optimizer operates under the so-called as-if rule. This basic rule allows the optimizer to perform any transformation which does not change the output of the program. That is, it should not change the observable behavior of the program in general.

On top of that, a few changes are generally explicitly allowed. The most obvious being that the run-time is expected to shrink, this in turn means that thread interleaving may differ, and some languages give even more wiggle room.

I used black_box!

What is black_box? It's a function whose definition is specifically opaque to the optimizer. This has some implications on the optimizations the compiler is allowed to perform since it may have side-effects. This therefore mean:

the transformed code must perform the very same number of calls to black_box than the original code,
the transformed code must perform said calls in the same order with regard to the passed in arguments,
no assumption can be made on the value returned by black_box.

Thus, surgical use of black_box can foil certain optimizations. Blind use, however, may not foil the right ones.

What optimizations am I running afoul of?

Let's start from the naive code:

#[bench]
fn rust_get(b: &mut Bencher) {
    let (hash, mut keys): (HashMap<String, usize>, _) =
        get_random_hash(&HashMap::with_capacity, &rust_insert_fn);

    b.iter(|| {
        for k in keys.drain(..) {
            hash.get(&k);
        }
    });
}

The assumption is that the loop inside b.iter() will iterate over all keys and perform a hash.get() for each of them:

The result of hash.get() is unused,
hash.get() is a pure function, meaning that is has no side-effect.

Thus, this loop can be rewritten as:

b.iter(|| { for k in keys.drain(..) {} })

We are running afoul of Dead Code Elimination (or some variant): the code serves no purpose, thus it is eliminated.

It may even be that the compiler is smart enough to realize that for k in keys.drain(..) {} can be optimized into drop(keys).

A surgical application of black_box can, however, foil DCE:

b.iter(|| {
    for k in keys.drain(..) {
        black_box(hash.get(&k));
    }
});

As per the effects of black_box described above:

the loop can no longer be optimized out, as it would change the number of calls to black_box,
each call to black_box must be performed with the expected argument.

There is still one possible hurdle: Constant Propagation. Specifically if the compiler realizes that all keys yield the same value, it could optimize out hash.get(&k) and replace it by said value.

This can be achieved by obfuscating the keys: let mut keys = black_box(keys);, as you did above, or the map. If you were to benchmark an empty map, the latter would be necessary, here they are equal.

We thus get:

#[bench]
fn rust_get(b: &mut Bencher) {
    let (hash, keys): (HashMap<String, usize>, _) =
        get_random_hash(&HashMap::with_capacity, &rust_insert_fn);
    let mut keys = test::black_box(keys);

    b.iter(|| {
        for k in keys.drain(..) {
            test::black_box(hash.get(&k));
        }
    });
}

A final tip.

Benchmarks are complicated enough that you should be extra careful to only benchmark what you wish to benchmark.

In this particular case, there are two method calls:

keys.drain(),
hash.get().

Since the benchmark name suggests, to me, that what you aim for is to measure the performance of get, I can only assume that the call to keys.drain(..) is a mistake.

Thus, the benchmark really should be:

#[bench]
fn rust_get(b: &mut Bencher) {
    let (hash, keys): (HashMap<String, usize>, _) =
        get_random_hash(&HashMap::with_capacity, &rust_insert_fn);
    let keys = test::black_box(keys);

    b.iter(|| {
        for k in &keys {
            test::black_box(hash.get(k));
        }
    });
}

In this instance, this is even more critical in that the closure passed to b.iter() is expected to run multiple times: if you drain the keys the first time, what's left afterward? An empty Vec...

... which may actually be all that is really happening here; since b.iter() runs the closure until its time stabilizes, it may just be draining the Vec in the first run and then time an empty loop.