benchmarking, code reordering, volatile

2019-02-01 00:20发布

I decide I want to benchmark a particular function, so I naïvely write code like this:

#include <ctime>
#include <iostream>

int SlowCalculation(int input) { ... }

int main() {
    std::cout << "Benchmark running..." << std::endl;
    std::clock_t start = std::clock();
    int answer = SlowCalculation(42);
    std::clock_t stop = std::clock();
    double delta = (stop - start) * 1.0 / CLOCKS_PER_SEC;
    std::cout << "Benchmark took " << delta << " seconds, and the answer was "
              << answer << '.' << std::endl;
    return 0;
}

A colleague pointed out that I should declare the start and stop variables as volatile to avoid code reordering. He suggested that the optimizer could, for example, effectively reorder the code like this:

    std::clock_t start = std::clock();
    std::clock_t stop = std::clock();
    int answer = SlowCalculation(42);

At first I was skeptical that such extreme reordering was allowed, but after some research and experimentation, I learned that it was.

But volatile didn't feel like the right solution; isn't volatile really just for memory mapped I/O?

Nevertheless, I added volatile and found that not only did the benchmark take significantly longer, it also was wildly inconsistent from run to run. Without volatile (and getting lucky to ensure the code wasn't reordered), the benchmark consistently took 600-700 ms. With volatile, it often took 1200 ms and sometimes more than 5000 ms. The disassembly listings for the two versions showed virtually no difference other than a different selection of registers. This makes me wonder if there is another way to avoid the code reordering that doesn't have such overwhelming side effects.

My question is:

What is the best way to prevent code reordering in benchmarking code like this?

My question is similar to this one (which was about using volatile to avoid elision rather than reordering), this one (which didn't answer how to prevent reordering), and this one (which debated whether the issue was code reordering or dead code elimination). While all three are on this exact topic, none actually answer my question.

Update: The answer appears to be that my colleague was mistaken and that reordering like this isn't consistent with the standard. I've upvoted everyone who said so and am awarding the bounty to the Maxim.

I've seen one case (based on the code in this question) where Visual Studio 2010 reordered the clock calls as I illustrated (only in 64-bit builds). I'm trying to make a minimal case to illustrate that so that I can file a bug on Microsoft Connect.

For those who said that volatile should be much slower because it forces reads and writes to memory, this isn't quite consistent with the code being emitted. In my answer on this question, I show the disassembly for the the code with and without volatile. Inside the loop, everything is kept in registers. The only significant differences appear to be register selection. I do not understand x86 assembly well enough to know why the performance of the non-volatile version is consistently fast while the volatile version is inconsistently (and sometimes dramatically) slower.

8条回答
太酷不给撩
2楼-- · 2019-02-01 00:24

The usual way to prevent reordering is a compile barrier i.e asm volatile ("":::"memory"); (with gcc). This is an asm instruction which does nothing, but we tell the compiler it will clobber memory, so it is not permitted to reorder code across it. The cost of this is only the actual cost of removing the reorder, which obviously is not the case for changing the optimisation level etc as suggested elsewhere.

I believe _ReadWriteBarrier is equivalent for Microsoft stuff.

Per Maxim Yegorushkin's answer, reordering is unlikely to be the cause of your issues though.

查看更多
走好不送
3楼-- · 2019-02-01 00:28

The correct way to do this in C++ is to use a class, e.g. something like

class Timer
{
    std::clock_t startTime;
    std::clock_t* targetTime;

public:
    Timer(std::clock_t* target) : targetTime(target) { startTime = std::clock(); }
    ~Timer() { *target = std::clock() - startTime; }
};

and use it like this:

std::clock_t slowTime;
{
    Timer timer(&slowTime);
    int answer = SlowCalculation(42);
}

Mind you, I don't actually believe your compiler will ever re-order like this.

查看更多
啃猪蹄的小仙女
4楼-- · 2019-02-01 00:30

Reordering described by your colleague just breaks 1.9/13

Sequenced before is an asymmetric, transitive, pair-wise relation between evaluations executed by a single thread (1.10), which induces a partial order among those evaluations. Given any two evaluations A and B, if A is sequenced before B, then the execution of A shall precede the execution of B. If A is not sequenced before B and B is not sequenced before A, then A and B are unsequenced . [ Note: The execution of unsequenced evaluations can overlap. —end note ] Evaluations A and B are indeterminately sequenced when either A is sequenced before B or B is sequenced before A, but it is unspecified which. [ Note: Indeterminately sequenced evaluations cannot overlap, but either could be executed first. —end note ]

So basically you should not think about reordering while you don't use threads.

查看更多
冷血范
5楼-- · 2019-02-01 00:35

A colleague pointed out that I should declare the start and stop variables as volatile to avoid code reordering.

Sorry, but your colleague is wrong.

The compiler does not reorder calls to functions whose definitions are not available at compile time. Simply imagine the hilarity that would ensue if compiler reordered such calls as fork and exec or moved code around these.

In other words, any function with no definition is a compile time memory barrier, that is, the compiler does not move subsequent statements before the call or prior statements after the call.

In your code calls to std::clock end up calling a function whose definition is not available.

I can not recommend enough watching atomic Weapons: The C++ Memory Model and Modern Hardware because it discusses misconceptions about (compile time) memory barriers and volatile among many other useful things.

Nevertheless, I added volatile and found that not only did the benchmark take significantly longer, it also was wildly inconsistent from run to run. Without volatile (and getting lucky to ensure the code wasn't reordered), the benchmark consistently took 600-700 ms. With volatile, it often took 1200 ms and sometimes more than 5000 ms

Not sure if volatile is to blame here.

The reported run-time depends on how the benchmark is run. Make sure you disable CPU frequency scaling so that it does not turn on turbo mode or switches frequency in the middle of the run. Also, micro-benchmarks should be run as real-time priority processes to avoid scheduling noise. It could be that during another run some background file indexer starts competing with your benchmark for the CPU time. See this for more details.

A good practice is to measure times it takes to execute the function a number of times and report min/avg/median/max/stdev/total time numbers. High standard deviation may indicate that the above preparations are not performed. The first run often is the longest because the CPU cache may be cold and it may take many cache misses and page faults and also resolve dynamic symbols from shared libraries on the first call (lazy symbol resolution is the default run-time linking mode on Linux, for example), while subsequent calls are going to execute with much less overhead.

查看更多
在下西门庆
6楼-- · 2019-02-01 00:35

There are a couple of ways that I can think of. The idea is to create compile time barriers so that compiler does not reorder a set of instructions.

One possible way to avoid reordering would be to enforce dependency among instructions that cannot be resolved by compiler (e.g. passing a pointer to the function and using that pointer in later instruction). I'm not sure how that would affect the performance of the actual code that you are interested in benchmarking.

Another possibility is to make the function SlowCalculation(42); an extern function (define this function in a separate .c/.cpp file and link the file to your main program) and declare start and stop as global variables. I do not know what are the optimizations offered by the link-time/inter-procedural optimizer of your compiler.

Also, if you compile at O1 or O0, most probably the compiler would not bother reordering instructions. Your question is somewhat related to (Compile time barriers - compiler code reordering - gcc and pthreads)

查看更多
做个烂人
7楼-- · 2019-02-01 00:36

Volatile ensures one thing, and one thing only: reads from a volatile variable will be read from memory every time -- the compiler won't assume that the value can be cached in a register. And likewise, writes will be written through to memory. The compiler won't keep it around in a register "for a while, before writing it out to memory".

In order to prevent compiler reordering you may use so called compiler fences. MSVC includes 3 compiler fences:

_ReadWriteBarrier() - full fence

_ReadBarrier() - two-sided fence for loads

_WriteBarrier() - two-sided fence for stores

ICC includes __memory_barrier() full fence.

Full fences are usually the best choice because there is no need in finer-granularity on this level (compiler fences are basically costless in run-time).

Statment reordering (which most compiler do when optimization is enabled), thats the also main reason why certain program fail to operation operation when compiled with compiler optimzation.

Will suggest to read http://preshing.com/20120625/memory-ordering-at-compile-time to see potential issues we can face with compiler reoredering etc.

查看更多
登录 后发表回答