When compiling the benchmark code below with -O3
I was impressed by the difference it made in latency so i began to wonder whether the compiler is not "cheating" by removing code somehow. Is there a way to check for that? Am I safe to benchmark with -O3
? Is it realistic to expect 15x gains in speed?
Results without -O3
: Average: 239 nanos Min: 230 nanos (9 million iterations)
Results with-O3
: Average: 14 nanos, Min: 12 nanos (9 million iterations)
int iterations = stoi(argv[1]);
int load = stoi(argv[2]);
long long x = 0;
for(int i = 0; i < iterations; i++) {
long start = get_nano_ts(); // START clock
for(int j = 0; j < load; j++) {
if (i % 4 == 0) {
x += (i % 4) * (i % 8);
} else {
x -= (i % 16) * (i % 32);
}
}
long end = get_nano_ts(); // STOP clock
// (omitted for clarity)
}
cout << "My result: " << x << endl;
Note: I am using clock_gettime
to measure:
long get_nano_ts() {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1000000000 + ts.tv_nsec;
}
You should always benchmark with optimizations turned on. However it is important to make sure the things you want to time do not get optimized away by the compiler.
One way to do this by printing out calculation results after the timer has stopped:
When comparing benchmarks for several different algorithms that can also serve as a check that each algorithm is producing the same results.
The compiler will certainly be "cheating" and removing unnecessary code when compiling with optimization enabled. It actually goes great length to speed up your code which almost always will lead to impressive speed-ups. If it was somehow able to derive a formula that calculates the result in constant time instead of using this loop, it would. A constant factor 15 is nothing out of the ordinary.
But this does not mean that you should profile un-optimized builds! Indeed, when using languages like C and C++, the performance of un-optimized builds is pretty much completely meaningless. You need not worry about that at all.
Of course, this can interfere with micro-benchmarks as the one you showed above. Two points to that:
Since you seem to be doing that, the code you show has a good chance of being a reasonable micro benchmark. One thing you should watch out for is whether your compiler moves both calls to
get_nano_ts();
to the same side of the loop. It is allowed to do this since "run time" does not count as observable side effect. (The standard does not even mandate your machine operating at finite speed.) It was argued here that this usually is not a problem, though I cannot really judge whether the answer given is valid or not.If your program does not do anything expensive other then the thing you want to benchmark (which it, if possible, should not do anyways), you can also move the time measurement "outside", e.g. with time.
It can be very difficult to benchmark what you think you are measuring. In the case of the inner loop:
A shrewd compiler might be able to see through that and change the code to something like:
I know that isn't equivalent, but there is some fairly simple expression which can replace that loop.
The way to be sure is to use the
gcc -S
compiler option and look at the assembly code it generates.