I have long wondered what is more efficient with regards to making better use of CPU caches (which are known to benefit from locality of reference) - two loops each iterating over the same mathematical set of numbers, each with a different loop body, or having one loop that "concatenates" the two bodies into one, and thus accomplishes identical total result, but all in itself?
In my opinion, having two loops would introduce fewer cache misses and evictions because more instructions and data used by the loop fit in the cache. Am I right?
Assuming:
- Cost of
f
andg
each is negligible compared to cost of completing the entire loop containing each f
andg
use most of the cache each by itself, and so the cache will be invalidated by one being called after another (which would be the case with a single-loop-body version)- Intel Core Duo CPU
- C language source code
gcc
compiler, no switches
The set being iterated over is a mathematical set, not a container of numbers in memory like a vector or a list. See the example below.
Please no answers of the "premature optimization is evil" character :-)
An example of the two-loops version that I am advocating for:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
{
j += f(i);
}
for(int i = 0; i < 1000000; i++)
{
k += g(i);
}
This seems like something the compiler could optimize for you so instead of trying to figure it out yourself and making it fast, use whatever method makes your code more clear and readable. If you really must know, time both methods for input size and calculation type that your application uses (try the code you have now but repeat your calculations many many times and disable optimization).