benchmarking, code reordering, volatile

2019-02-01 00:20发布

I decide I want to benchmark a particular function, so I naïvely write code like this:

#include <ctime>
#include <iostream>

int SlowCalculation(int input) { ... }

int main() {
    std::cout << "Benchmark running..." << std::endl;
    std::clock_t start = std::clock();
    int answer = SlowCalculation(42);
    std::clock_t stop = std::clock();
    double delta = (stop - start) * 1.0 / CLOCKS_PER_SEC;
    std::cout << "Benchmark took " << delta << " seconds, and the answer was "
              << answer << '.' << std::endl;
    return 0;
}

A colleague pointed out that I should declare the start and stop variables as volatile to avoid code reordering. He suggested that the optimizer could, for example, effectively reorder the code like this:

    std::clock_t start = std::clock();
    std::clock_t stop = std::clock();
    int answer = SlowCalculation(42);

At first I was skeptical that such extreme reordering was allowed, but after some research and experimentation, I learned that it was.

But volatile didn't feel like the right solution; isn't volatile really just for memory mapped I/O?

Nevertheless, I added volatile and found that not only did the benchmark take significantly longer, it also was wildly inconsistent from run to run. Without volatile (and getting lucky to ensure the code wasn't reordered), the benchmark consistently took 600-700 ms. With volatile, it often took 1200 ms and sometimes more than 5000 ms. The disassembly listings for the two versions showed virtually no difference other than a different selection of registers. This makes me wonder if there is another way to avoid the code reordering that doesn't have such overwhelming side effects.

My question is:

What is the best way to prevent code reordering in benchmarking code like this?

My question is similar to this one (which was about using volatile to avoid elision rather than reordering), this one (which didn't answer how to prevent reordering), and this one (which debated whether the issue was code reordering or dead code elimination). While all three are on this exact topic, none actually answer my question.

Update: The answer appears to be that my colleague was mistaken and that reordering like this isn't consistent with the standard. I've upvoted everyone who said so and am awarding the bounty to the Maxim.

I've seen one case (based on the code in this question) where Visual Studio 2010 reordered the clock calls as I illustrated (only in 64-bit builds). I'm trying to make a minimal case to illustrate that so that I can file a bug on Microsoft Connect.

For those who said that volatile should be much slower because it forces reads and writes to memory, this isn't quite consistent with the code being emitted. In my answer on this question, I show the disassembly for the the code with and without volatile. Inside the loop, everything is kept in registers. The only significant differences appear to be register selection. I do not understand x86 assembly well enough to know why the performance of the non-volatile version is consistently fast while the volatile version is inconsistently (and sometimes dramatically) slower.

8条回答
SAY GOODBYE
2楼-- · 2019-02-01 00:39

Related problem: how to stop the compiler from hoisting a tiny repeated calculation out of a loop

I couldn't find this anywhere - so adding my own answer 11 years after the question was asked ;).

Using volatile on variables is not what you want for that. That will cause the compiler to load and store those variable from and to RAM every single time (assuming there is a side effect of that that must be preserved: aka - good for I/O registers). When you are bench marking you are not interested in measuring how long it takes to get something from memory, or write it there. Often you just want your variable to be in CPU registers.

volatile is usable if you assign to it once outside a loop that doesn't get optimized away (like summing an array), as an alternative to printing the result. (Like the long-running function in the question). But not inside a tiny loop; that will introduce store/reload instructions and store-forwarding latency.


I think that the ONLY way to submit your compiler into not optimizing your benchmark code to hell is by using asm. This allows you to fool the compiler into thinking it doesn't know anything about your variables content, or usage, so it has to do everything every single time, as often as your loop asks it to.

For example, if I wanted to benchmark m & -m where m is some uint64_t, I could try:

uint64_t const m = 0x0000080e70100000UL;
for (int i = 0; i < loopsize; ++i)
{
  uint64_t result = m & -m;
}

The compiler would obviously say: I'm not even going to calculate that; since you're not using the result. Aka, it would actually do:

for (int i = 0; i < loopsize; ++i)
{
}

Then you can try:

uint64_t const m = 0x0000080e70100000UL;
static uint64_t volatile result;
for (int i = 0; i < loopsize; ++i)
{
  result = m & -m;
}

and the compiler says, ok - so you want me to write to result every time and do

uint64_t const m = 0x0000080e70100000UL;
uint64_t tmp = m & -m;
static uint64_t volatile result;
for (int i = 0; i < loopsize; ++i)
{
  result = tmp;
}

Spending a lot of time writing to the memory address of result loopsize times, just as you asked.

Finally you could also make m volatile, but the result would look like this in assembly:

507b:   ba e8 03 00 00          mov    $0x3e8,%edx
  # top of loop
5080:   48 8b 05 89 ef 20 00    mov    0x20ef89(%rip),%rax        # 214010 <m_test>
5087:   48 8b 0d 82 ef 20 00    mov    0x20ef82(%rip),%rcx        # 214010 <m_test>
508e:   48 f7 d8                neg    %rax
5091:   48 21 c8                and    %rcx,%rax
5094:   48 89 44 24 28          mov    %rax,0x28(%rsp)
5099:   83 ea 01                sub    $0x1,%edx
509c:   75 e2                   jne    5080 <main+0x120>

Reading from memory twice and writing to it once, besides the requested calculation with registers.

The correct way to do this is therefore:

for (int i = 0; i < loopsize; ++i)
{
  uint64_t result = m & -m;
  asm volatile ("" : "+r" (m) : "r" (result));
}

which results in the assembly code (from gcc8.2 on the Godbolt compiler explorer):

 # gcc8.2 -O3 -fverbose-asm
    movabsq $8858102661120, %rax      #, m
    movl    $1000, %ecx     #, ivtmp_9     # induction variable tmp_9
.L2:
    mov     %rax, %rdx      # m, tmp91
    neg     %rdx            # tmp91
    and     %rax, %rdx      # m, result
       # asm statement here,  m=%rax   result=%rdx
    subl    $1, %ecx        #, ivtmp_9
    jne     .L2
    ret     

Doing exactly the three requested assembly instructions inside the loop, plus a sub and jne for the loop overhead.

The trick here is that by using the asm volatile1 and tell the compiler

  1. "r" input operand: it uses the value of result as input so the compiler has to materialize it in a register.
  2. "+r" input/output operand: m stays in the same register but is (potentially) modified.
  3. volatile: it has some mysterious side effect and/or is not a pure function of the inputs; the compiler must execute it as many times as the source does. This forces the compiler to leave your test snippet alone and inside the loop. See the gcc manual's Extended Asm#Volatile section.

footnote 1: The volatile is required here or the compiler will turn this into an empty loop. Non-volatile asm (with any output operands) is considered a pure function of its inputs that can be optimized away if the result is unused. Or CSEd to only run once if used multiple times with the same input.


Everything below is not mine-- and I do not necessarily agree with it. --Carlo Wood

If you had used asm volatile ("" : "=r" (m) : "r" (result)); (with an "=r" write-only output), the compiler might choose the same register for m and result, creating a loop-carried dependency chain that tests the latency, not throughput, of the calculation.

From that, you'd get this asm:

5077:   ba e8 03 00 00          mov    $0x3e8,%edx
507c:   0f 1f 40 00             nopl   0x0(%rax)    # alignment padding
  # top of loop
5080:   48 89 e8                mov    %rbp,%rax    # copy m
5083:   48 f7 d8                neg    %rax         # -m
5086:   48 21 c5                and    %rax,%rbp    # m &= -m   instead of using the tmp as the destination.
5089:   83 ea 01                sub    $0x1,%edx
508c:   75 f2                   jne    5080 <main+0x120>

This will run at 1 iteration per 2 or 3 cycles (depending on whether your CPU has mov-elimination or not.) The version without a loop-carried dependency can run at 1 per clock cycle on Haswell and later, and Ryzen. Those CPUs have the ALU throughput to run at least 4 uops per clock cycle.

This asm corresponds to C++ that looks like this:

for (int i = 0; i < loopsize; ++i)
{
  m = m & -m;
}

By misleading the compiler with a write-only output constraint, we've created asm that doesn't look like the source (which looked like it was computing a new result from a constant every iteration, not using result as an input to the next iteration..)

You might want to microbenchmark latency, so you can more easily detect the benefit of compiling with -mbmi or -march=haswell to let the compiler use blsi %rax, %rax and calculate m &= -m; in one instruction. But it's easier to keep track of what you're doing if the C++ source has the same dependency as the asm, instead of fooling the compiler into introducing a new dependency.

查看更多
放我归山
3楼-- · 2019-02-01 00:43

You could make two C files, SlowCalculation compiled with g++ -O3 (high level of optimization), and the benchmark one compiled with g++ -O1 (lower level, still optimized - that may be sufficient for that benchmarking part).

According to the man page, reordering of code happens during -O2 and -O3 optimizations levels.

Since optimization happens during compilation, not linkage, the benchmark side should not be affected by code reordering.

Assuming you are using g++ - but there should be something equivalent in another compiler.

查看更多
登录 后发表回答