Is it realistic to use -O3 or -Ofast to compile yo

2019-02-23 15:27发布

When compiling the benchmark code below with -O3 I was impressed by the difference it made in latency so i began to wonder whether the compiler is not "cheating" by removing code somehow. Is there a way to check for that? Am I safe to benchmark with -O3? Is it realistic to expect 15x gains in speed?

Results without -O3: Average: 239 nanos Min: 230 nanos (9 million iterations)
Results with-O3: Average: 14 nanos, Min: 12 nanos (9 million iterations)

int iterations = stoi(argv[1]);
int load = stoi(argv[2]);

long long x = 0;

for(int i = 0; i < iterations; i++) {

    long start = get_nano_ts(); // START clock

    for(int j = 0; j < load; j++) {
        if (i % 4 == 0) {
            x += (i % 4) * (i % 8);
        } else {
            x -= (i % 16) * (i % 32);
        }
    }

    long end = get_nano_ts(); // STOP clock

    // (omitted for clarity)
}

cout << "My result: " << x << endl;

Note: I am using clock_gettime to measure:

long get_nano_ts() {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec * 1000000000 + ts.tv_nsec;
}

3条回答
Anthone
2楼-- · 2019-02-23 16:14

You should always benchmark with optimizations turned on. However it is important to make sure the things you want to time do not get optimized away by the compiler.

One way to do this by printing out calculation results after the timer has stopped:

long long x = 0;

for(int i = 0; i < iterations; i++) {

    long start = get_nano_ts(); // START clock

    for(int j = 0; j < load; j++) {
        if (i % 4 == 0) {
            x += (i % 4) * (i % 8);
        } else {
            x -= (i % 16) * (i % 32);
        }
    }

    long end = get_nano_ts(); // STOP clock

    // now print out x so the compiler doesn't just ignore it:
    std::cout << "check: " << x << '\n',

    // (omitted for clarity)
}

When comparing benchmarks for several different algorithms that can also serve as a check that each algorithm is producing the same results.

查看更多
虎瘦雄心在
3楼-- · 2019-02-23 16:21

The compiler will certainly be "cheating" and removing unnecessary code when compiling with optimization enabled. It actually goes great length to speed up your code which almost always will lead to impressive speed-ups. If it was somehow able to derive a formula that calculates the result in constant time instead of using this loop, it would. A constant factor 15 is nothing out of the ordinary.

But this does not mean that you should profile un-optimized builds! Indeed, when using languages like C and C++, the performance of un-optimized builds is pretty much completely meaningless. You need not worry about that at all.

Of course, this can interfere with micro-benchmarks as the one you showed above. Two points to that:

  1. More often than not, such micro optimization do not matter either. Prefer profiling your actual program and then removing bottlenecks.
  2. If you actually want such a micro benchmark, make it depend on some runtime input and display the result. That way, the compiler cannot remove the functionality itself, just make it reasonably fast.

Since you seem to be doing that, the code you show has a good chance of being a reasonable micro benchmark. One thing you should watch out for is whether your compiler moves both calls to get_nano_ts(); to the same side of the loop. It is allowed to do this since "run time" does not count as observable side effect. (The standard does not even mandate your machine operating at finite speed.) It was argued here that this usually is not a problem, though I cannot really judge whether the answer given is valid or not.

If your program does not do anything expensive other then the thing you want to benchmark (which it, if possible, should not do anyways), you can also move the time measurement "outside", e.g. with time.

查看更多
祖国的老花朵
4楼-- · 2019-02-23 16:32

It can be very difficult to benchmark what you think you are measuring. In the case of the inner loop:

for (int j = 0;  j < load;  ++j)
        if (i % 4 == 0)
                x += (i % 4) * (i % 8);
        else    x -= (i % 16) * (i % 32);

A shrewd compiler might be able to see through that and change the code to something like:

 x = load * 174;   // example only

I know that isn't equivalent, but there is some fairly simple expression which can replace that loop.

The way to be sure is to use the gcc -S compiler option and look at the assembly code it generates.

查看更多
登录 后发表回答