I chose David's answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what happens when setting the optimization flags on.
Jerry Coffin's answer explained what happens when setting the optimization flags for this example. What remains unanswered is why superCalculationA runs slower than superCalculationB, when B performs one extra memory reference and one addition for each iteration. Nemo's post shows the assembler output. I confirmed this compiling with the -S
flag on my PC, 2.9GHz Sandy Bridge (i5-2310), running Ubuntu 12.04 64-bit, as suggested by Matteo Italia.
I was experimenting with for-loops performance when I stumbled upon the following case.
I have the following code that does the same computation in two different ways.
#include <cstdint>
#include <chrono>
#include <cstdio>
using std::uint64_t;
uint64_t superCalculationA(int init, int end)
{
uint64_t total = 0;
for (int i = init; i < end; i++)
total += i;
return total;
}
uint64_t superCalculationB(int init, int todo)
{
uint64_t total = 0;
for (int i = init; i < init + todo; i++)
total += i;
return total;
}
int main()
{
const uint64_t answer = 500000110500000000;
std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
double elapsed;
std::printf("=====================================================\n");
start = std::chrono::high_resolution_clock::now();
uint64_t ret1 = superCalculationA(111, 1000000111);
end = std::chrono::high_resolution_clock::now();
elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
start = std::chrono::high_resolution_clock::now();
uint64_t ret2 = superCalculationB(111, 1000000000);
end = std::chrono::high_resolution_clock::now();
elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
if (ret1 == answer)
{
std::printf("The first method, i.e. superCalculationA, succeeded.\n");
}
if (ret2 == answer)
{
std::printf("The second method, i.e. superCalculationB, succeeded.\n");
}
std::printf("=====================================================\n");
return 0;
}
Compiling this code with
g++ main.cpp -o output --std=c++11
leads to the following result:
=====================================================
Elapsed time: 2.859 s | 2859.441 ms | 2859440.968 us
Elapsed time: 2.204 s | 2204.059 ms | 2204059.262 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
My first question is: why is the second loop running 23% faster than the first?
On the other hand, if I compile the code with
g++ main.cpp -o output --std=c++11 -O1
The results improve a lot,
=====================================================
Elapsed time: 0.318 s | 317.773 ms | 317773.142 us
Elapsed time: 0.314 s | 314.429 ms | 314429.393 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
and the difference in time almost disappears.
But I could not believe my eyes when I set the -O2 flag,
g++ main.cpp -o output --std=c++11 -O2
and got this:
=====================================================
Elapsed time: 0.000 s | 0.000 ms | 0.328 us
Elapsed time: 0.000 s | 0.000 ms | 0.208 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
So, my second question is: What is the compiler doing when I set -O1 and -O2 flags that leads to this gigantic performance improvement?
I checked Optimized Option - Using the GNU Compiler Collection (GCC), but that did not clarify things.
By the way, I am compiling this code with g++ (GCC) 4.9.1.
EDIT to confirm Basile Starynkevitch's assumption
I edited the code, now main
looks like this:
int main(int argc, char **argv)
{
int start = atoi(argv[1]);
int end = atoi(argv[2]);
int delta = end - start + 1;
std::chrono::time_point<std::chrono::high_resolution_clock> t_start, t_end;
double elapsed;
std::printf("=====================================================\n");
t_start = std::chrono::high_resolution_clock::now();
uint64_t ret1 = superCalculationB(start, delta);
t_end = std::chrono::high_resolution_clock::now();
elapsed = (t_end - t_start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
t_start = std::chrono::high_resolution_clock::now();
uint64_t ret2 = superCalculationA(start, end);
t_end = std::chrono::high_resolution_clock::now();
elapsed = (t_end - t_start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
std::printf("Results were %s\n", (ret1 == ret2) ? "the same!" : "different!");
std::printf("=====================================================\n");
return 0;
}
These modifications really increased computation time, both for -O1
and -O2
. Both are giving me around 620 ms now. Which proves that -O2 was really doing some computation at compile time.
I still do not understand what these flags are doing to improve performance, and -Ofast
does even better, at about 320ms.
Also notice that I have changed the order in which functions A and B are called to test Jerry Coffin's assumption. Compiling this code with no optimizer flags still gives me around 2.2 secs in B and 2.8 secs in A. So I figure that it is not a cache thing. Just reinforcing that I am not talking about optimization in the first case (the one with no flags), I just want to know what makes the seconds loop run faster than the first.
Processors are complicated. Execution time depends on many things, many of which are outside your control. Just a few possibilities:
a. Your computer probably doesn't have a constant clock speed. It could be that the clock speed is usually set rather low to avoid wasting energy / battery life / producing excessive heat. When your program starts running, the OS figures out that power is needed and increases the clock speed. To verify, change the order of the calls - if the second loop executed is always faster than the first one, that may be the reason.
b. The exact execution speed, especially for a tight loop like yours, depends on how instructions are aligned in memory. Some processors may run a loop faster if it is completely contained within one cache line instead of two, or in two cache lines instead of three. Some compilers will add nop instructions to align loops on cache lines to optimise for this, most don't. Quite possible that one of the loops was aligned better by pure luck and therefore runs faster.
c. The exact execution speed may depend on the exact order in which instructions are dispatched. Slightly different code may run at different speeds due to subtle differences in the code which may be processor dependent, and anyway may be hard for the compiler to consider.
d. There is some evidence that Intel processors may have problems with artificially short loops which may happen only with artificial benchmarks. Your code is quite close to "artificial". There have been cases discussed in other threads where very short loops ran unexpectedly slow, and adding instructions made them run faster.