I have a simple idea I'm trying to benchmark in Rust. However, when I go to measure it using test::Bencher
, the base case that I'm trying to compare against:
#![feature(test)]
extern crate test;
#[cfg(test)]
mod tests {
use test::black_box;
use test::Bencher;
const ITERATIONS: usize = 100_000;
struct CompoundValue {
pub a: u64,
pub b: u64,
pub c: u64,
pub d: u64,
pub e: u64,
}
#[bench]
fn bench_in_place(b: &mut Bencher) {
let mut compound_value = CompoundValue {
a: 0,
b: 2,
c: 0,
d: 5,
e: 0,
};
let val: &mut CompoundValue = &mut compound_value;
let result = b.iter(|| {
let mut f : u64 = black_box(0);
for _ in 0..ITERATIONS {
f += val.a + val.b + val.c + val.d + val.e;
}
f = black_box(f);
return f;
});
assert_eq!((), result);
}
}
is optimized away entirely by the compiler, resulting in:
running 1 test
test tests::bench_in_place ... bench: 0 ns/iter (+/- 1)
As you can see in the gist, I have tried to employ the suggestions set forth in the documentation, namely:
- Making use of the
test::black_box
method to hide implementation details from the compiler. - Returning the calculated value from the closure passed to the
iter
method.
Are there any other tricks I can try?
The problem here is the compiler can see that the result of the loop is the same every time
iter
calls the closure (just add some constant tof
) becauseval
never changes.Looking at the assembly (by passing
--emit asm
to the compiler) demonstrates this:The section between
.LBB0_2:
andjne .LBB0_2
is what the call toiter
compiles down to, it is repeatedly running the code in the closure that you passed to it. The#APP
#NO_APP
pairs are theblack_box
calls. You can see that theiter
loop doesn't do much:movq
is just moving data from register to/from other registers and the stack, andaddq
/decq
are just adding and decrementing some integers.Looking above that loop, there's
movl $700000, %edx
: This is loading the constant700_000
into the edx register... and, suspiciously,700000 = ITEARATIONS * (0 + 2 + 0 + 5 + 0)
. (The other stuff in the code isn't so interesting.)The way to disguise this is to
black_box
the input, e.g. I might start off with the benchmark written like:In particular,
val
isblack_box
'd inside the closure, so that the compiler can't precompute the addition and reuse it for each call.However, this is still optimised to be very fast: 1 ns/iter for me. Checking the assembly again reveals the problem (I've trimmed the assembly down to just the loop that contains the
APP
/NO_APP
pairs, i.e. the calls toiter
's closure):Now the compiler has seen that
val
doesn't change over the course of thefor
loop, so it has correctly transformed the loop into just summing all the elements ofval
(that's the sequence of 4addq
s) and then multiplying that byITERATIONS
(theimulq
).To fix this, we can do the same thing: move the
black_box
deeper, so that the compiler can't reason about the value between different iterations of the loop:This version now takes 137,142 ns/iter for me, although the repeated calls to
black_box
probably cause non-trivial overhead (having to repeatedly write to the stack, and then read it back).We can look at the asm, just to be sure:
Now the call to
iter
is two loops: the outer loop that calls the closure many times (.LBB0_2:
tojne .LBB0_2
), and thefor
loop inside the closure (.LBB0_3:
tojne .LBB0_3
). The inner loop is indeed doing a call toblack_box
(APP
/NO_APP
) followed by 5 additions. The outer loop is settingf
to zero (xorl %edi, %edi
), running the inner loop, and thenblack_box
ingf
(the secondAPP
/NO_APP
).(Benchmarking exactly what you want to benchmark can be tricky!)
The problem with your benchmark is that the optimizer knows that your CompoundValue is going to be immutable during the benchmark, thus it can strengh-reduce the loop and thus compile it down to a constant value.
The solution is to use test::black_box on the parts of your CompoundValue. Or even better, try to get rid of the loop (unless you want to benchmark loop performance), and let Bencher.iter(..) do it's job.