Suppose a1
, b1
, c1
, and d1
point to heap memory and my numerical code has the following core loop.
const int n = 100000;
for (int j = 0; j < n; j++) {
a1[j] += b1[j];
c1[j] += d1[j];
}
This loop is executed 10,000 times via another outer for
loop. To speed it up, I changed the code to:
for (int j = 0; j < n; j++) {
a1[j] += b1[j];
}
for (int j = 0; j < n; j++) {
c1[j] += d1[j];
}
Compiled on MS Visual C++ 10.0 with full optimization and SSE2 enabled for 32-bit on a Intel Core 2 Duo (x64), the first example takes 5.5 seconds and the double-loop example takes only 1.9 seconds. My question is: (Please refer to the my rephrased question at the bottom)
PS: I am not sure, if this helps:
Disassembly for the first loop basically looks like this (this block is repeated about five times in the full program):
movsd xmm0,mmword ptr [edx+18h]
addsd xmm0,mmword ptr [ecx+20h]
movsd mmword ptr [ecx+20h],xmm0
movsd xmm0,mmword ptr [esi+10h]
addsd xmm0,mmword ptr [eax+30h]
movsd mmword ptr [eax+30h],xmm0
movsd xmm0,mmword ptr [edx+20h]
addsd xmm0,mmword ptr [ecx+28h]
movsd mmword ptr [ecx+28h],xmm0
movsd xmm0,mmword ptr [esi+18h]
addsd xmm0,mmword ptr [eax+38h]
Each loop of the double loop example produces this code (the following block is repeated about three times):
addsd xmm0,mmword ptr [eax+28h]
movsd mmword ptr [eax+28h],xmm0
movsd xmm0,mmword ptr [ecx+20h]
addsd xmm0,mmword ptr [eax+30h]
movsd mmword ptr [eax+30h],xmm0
movsd xmm0,mmword ptr [ecx+28h]
addsd xmm0,mmword ptr [eax+38h]
movsd mmword ptr [eax+38h],xmm0
movsd xmm0,mmword ptr [ecx+30h]
addsd xmm0,mmword ptr [eax+40h]
movsd mmword ptr [eax+40h],xmm0
The question turned out to be of no relevance, as the behavior severely depends on the sizes of the arrays (n) and the CPU cache. So if there is further interest, I rephrase the question:
Could you provide some solid insight into the details that lead to the different cache behaviors as illustrated by the five regions on the following graph?
It might also be interesting to point out the differences between CPU/cache architectures, by providing a similar graph for these CPUs.
PPS: Here is the full code. It uses TBB Tick_Count
for higher resolution timing, which can be disabled by not defining the TBB_TIMING
Macro:
#include <iostream>
#include <iomanip>
#include <cmath>
#include <string>
//#define TBB_TIMING
#ifdef TBB_TIMING
#include <tbb/tick_count.h>
using tbb::tick_count;
#else
#include <time.h>
#endif
using namespace std;
//#define preallocate_memory new_cont
enum { new_cont, new_sep };
double *a1, *b1, *c1, *d1;
void allo(int cont, int n)
{
switch(cont) {
case new_cont:
a1 = new double[n*4];
b1 = a1 + n;
c1 = b1 + n;
d1 = c1 + n;
break;
case new_sep:
a1 = new double[n];
b1 = new double[n];
c1 = new double[n];
d1 = new double[n];
break;
}
for (int i = 0; i < n; i++) {
a1[i] = 1.0;
d1[i] = 1.0;
c1[i] = 1.0;
b1[i] = 1.0;
}
}
void ff(int cont)
{
switch(cont){
case new_sep:
delete[] b1;
delete[] c1;
delete[] d1;
case new_cont:
delete[] a1;
}
}
double plain(int n, int m, int cont, int loops)
{
#ifndef preallocate_memory
allo(cont,n);
#endif
#ifdef TBB_TIMING
tick_count t0 = tick_count::now();
#else
clock_t start = clock();
#endif
if (loops == 1) {
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
}
} else {
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
a1[j] += b1[j];
}
for (int j = 0; j < n; j++) {
c1[j] += d1[j];
}
}
}
double ret;
#ifdef TBB_TIMING
tick_count t1 = tick_count::now();
ret = 2.0*double(n)*double(m)/(t1-t0).seconds();
#else
clock_t end = clock();
ret = 2.0*double(n)*double(m)/(double)(end - start) *double(CLOCKS_PER_SEC);
#endif
#ifndef preallocate_memory
ff(cont);
#endif
return ret;
}
void main()
{
freopen("C:\\test.csv", "w", stdout);
char *s = " ";
string na[2] ={"new_cont", "new_sep"};
cout << "n";
for (int j = 0; j < 2; j++)
for (int i = 1; i <= 2; i++)
#ifdef preallocate_memory
cout << s << i << "_loops_" << na[preallocate_memory];
#else
cout << s << i << "_loops_" << na[j];
#endif
cout << endl;
long long nmax = 1000000;
#ifdef preallocate_memory
allo(preallocate_memory, nmax);
#endif
for (long long n = 1L; n < nmax; n = max(n+1, long long(n*1.2)))
{
const long long m = 10000000/n;
cout << n;
for (int j = 0; j < 2; j++)
for (int i = 1; i <= 2; i++)
cout << s << plain(n, m, j, i);
cout << endl;
}
}
(It shows FLOP/s for different values of n
.)
The second loop involves a lot less cache activity, so it's easier for the processor to keep up with the memory demands.
It's not because of a different code, but because of caching: RAM is slower than the CPU registers and a cache memory is inside the CPU to avoid to write the RAM every time a variable is changing. But the cache is not big as the RAM is, hence, it maps only a fraction of it.
The first code modifies distant memory addresses alternating them at each loop, thus requiring continuously to invalidate the cache.
The second code don't alternate: it just flow on adjacent addresses twice. This makes all the job to be completed in the cache, invalidating it only after the second loop starts.
Imagine you are working on a machine where
n
was just the right value for it only to be possible to hold two of your arrays in memory at one time, but the total memory available, via disk caching, was still sufficient to hold all four.Assuming a simple LIFO caching policy, this code:
would first cause
a
andb
to be loaded into RAM and then be worked on entirely in RAM. When the second loop starts,c
andd
would then be loaded from disk into RAM and operated on.the other loop
will page out two arrays and page in the other two every time around the loop. This would obviously be much slower.
You are probably not seeing disk caching in your tests but you are probably seeing the side effects of some other form of caching.
There seems to be a little confusion/misunderstanding here so I will try to elaborate a little using an example.
Say
n = 2
and we are working with bytes. In my scenario we thus have just 4 bytes of RAM and the rest of our memory is significantly slower (say 100 times longer access).Assuming a fairly dumb caching policy of if the byte is not in the cache, put it there and get the following byte too while we are at it you will get a scenario something like this:
With
cache
a[0]
anda[1]
thenb[0]
andb[1]
and seta[0] = a[0] + b[0]
in cache - there are now four bytes in cache,a[0], a[1]
andb[0], b[1]
. Cost = 100 + 100.a[1] = a[1] + b[1]
in cache. Cost = 1 + 1.c
andd
.Total cost =
(100 + 100 + 1 + 1) * 2 = 404
With
cache
a[0]
anda[1]
thenb[0]
andb[1]
and seta[0] = a[0] + b[0]
in cache - there are now four bytes in cache,a[0], a[1]
andb[0], b[1]
. Cost = 100 + 100.a[0], a[1], b[0], b[1]
from cache and cachec[0]
andc[1]
thend[0]
andd[1]
and setc[0] = c[0] + d[0]
in cache. Cost = 100 + 100.(100 + 100 + 100 + 100) * 2 = 800
This is a classic cache thrash scenario.
The Original Question
Conclusion:
Case 1 is a classic interpolation problem that happens to be an inefficient one. I also think that this was one of the leading reasons of why many machine architectures and developers ended up building and designing multi-core systems with the ability to do multi-threaded applications as well as parallel programming.
Looking at it from this kind of an approach without involving how the Hardware, OS and Compiler(s) works together to do heap allocations that involves working with RAM, Cache, Page Files, etc.; the mathematics that are at the foundation of these algorithms shows us which of these two is the better solution. We can use an analogy where a
Boss
orSummation
that will represent aFor Loop
that has to travel between workersA
&B
we can easily see that Case 2 is at least 1/2 as fast if not a little more than Case 1 due to the difference in the distance that is needed to travel and the time taken between the workers. This math lines up almost virtually and perfectly with both the Bench Mark Times as well as the amount of differences in Assembly Instructions.I will now begin to explain how all of this works below.
Assessing The Problem
The OP's code:
And
The Consideration
Considering the OP's original question about the 2 variants of the for loops and his amended question towards the behavior of caches along with many of the other excellent answers and useful comments; I'd like to try and do something different here by taking a different approach about this situation and problem.
The Approach
Considering the two loops and all of the discussion about cache and page filing I'd like to take another approach as to looking at this from a different perspective. One that doesn't involve the cache and page files nor the executions to allocate memory, in fact this approach doesn't even concern the actual hardware or the software at all.
The Perspective
After looking at the code for a while it became quite apparent what the problem is and what is generating it. Lets break this down into an algorithmic problem and look at it from the perspective of using mathematical notations then apply an analogy to the math problems as well as to the algorithms.
What We Do Know
We know is that his loop will run 100,000 times. We also know that
a1
,b1
,c1
&d1
are pointers on a 64-bit architecture. Within C++ on a 32-bit machine all pointers are 4 bytes and on a 64-bit machine they are 8 bytes in size since pointers are of a fixed length. We know that we have 32 bytes in which to allocate for in both cases. The only difference is we are allocating 32 bytes or 2 sets of 2-8bytes on each iteration where in the 2nd case we are allocating 16 bytes for each iteration for both of the independent loops. So both loops still equals 32 bytes in total allocations. With this information let's go ahead and show the general math, algorithm and analogy of it. We do know the amount of times that the same set or group of operations will have to be performed in both cases. We do know the amount of memory that needs to be allocated in both cases. We can asses that the overall work load of the allocations between both cases will be approximately the same.What We Don't Know
We do not know how long it will take for each case unless if we set a counter and run a bench mark test. However the bench marks were already included from the original question and from some of the answers and comments as well and we can see a significant difference between the two and this is the whole reasoning of this question to this problem and for the answering of it to begin with.
Let's Investigate
It is already apparent that many have already done this by looking at the heap allocations, bench mark tests, looking at RAM, Cache and Page Files. Looking at specific data points and specific iteration indexes was also included and the various conversations about this specific problem has many people starting to question other related things about it. So how do we begin to look at this problem by using mathematical algorithms and applying an analogy to it? We start off by making a couple of assertions! Then we build out our algorithm from there.
Our Assertions:
F1()
,F2()
,f(a)
,f(b)
,f(c)
andf(d)
.The Algorithms:
1st Case: - Only one summation but two independent function calls.
2nd Case: - Two summations but each has its own function call.
If you noticed
F2()
only exists inSum
where bothSum1
andSum2
only containsF1()
. This will also be evident later on as well when we begin to conclude that there is sort of an optimization happening from the second algorithm.The iterations through the first case
Sum
callsf(a)
that will add to its selff(b)
then it callsf(c)
that will do the same but addf(d)
to itself for each100000 iterations
. In the second case we haveSum1
andSum2
And both act the same as if they were the same function being called twice in a row. In this case we can treatSum1
andSum2
as just plain oldSum
whereSum
in this case looks like this:Sum n=1 : [1,100000] { f(a) = f(a) + f(b); }
and now this looks like an optimization where we can just consider it to be the same function.Summary with Analogy
With what we seen in the second case it almost appears as if there is optimization since both for loops have the same exact signature, but this isn't the real issue. The issue isn't the work that is being done by
f(a)
,f(b)
,f(c)
&f(d)
in both cases and the comparison between the two it is the difference in the distance that the Summation has to travel in both cases that gives you the difference in time execution.Think of the
For Loops
as being theSummations
that does the iterations as being aBoss
that is giving orders to two peopleA
&B
and that their jobs are to meatC
&D
respectively and to pick up some package from them and return it. In the analogy here the for loop or summation iterations and condition checks themselves doesn't actually represent theBoss
. What actually represents theBoss
here is not from the actual mathematical algorithms directly, but from the actual concept ofScope
andCode Block
within a routine or sub-routine, method, function, translation unit, etc. The first algorithm has 1 scope where the 2nd algorithm has 2 consecutive scopes.Within the first case on each call slip the
Boss
goes toA
and gives the order andA
goes off to fetchB's
package then theBoss
goes toC
and gives the orders to do the same and receive the package fromD
on each iteration.Within the second case the
Boss
works directly withA
to go and fetchB's
package until all packages are received. Then theBoss
works withC
to do the same for getting all ofD's
packages.Since we are working with an 8 byte pointer and dealing with Heap allocation let's consider this problem here. Let us say that the
Boss
is 100 feet fromA
and thatA
is 500 feet fromC
. We don't need to worry about how far theBoss
is initially fromC
because of the order of executions. In both cases theBoss
initially travels fromA
first then toB
. This analogy isn't to say that this distance is exact; it is just a use test case scenario to show the workings of the algorithms. In many cases when doing heap allocations and working with the cache and page files, these distances between address locations may not vary that much in differences or they can very significantly depending on the nature of the data types and the array sizes.The Test Cases:
First Case: On first iteration the
Boss
has to initially go 100 feet to give the order slip toA
andA
goes off and does his thing, but then theBoss
has to travel 500 feet toC
to give him his order slip. Then on the next iteration and every other iteration after theBoss
has to go back and forth 500 feet between the two.Second Case:
The Boss
has to travel 100 feet on the first iteration toA
, but after that he is already there and just waits forA
to get back until all slips are filled. Then theBoss
has to travel 500 feet on the first iteration toC
becauseC
is 500 feet fromA
since thisBoss( Summation, For Loop )
is being called right after working withA
and then just waits like he did withA
until all ofC's
order slips are done.The Difference In Distances Traveled
The Comparison of Arbitrary Values
We can easily see that 600 is far less than 10 million. Now this isn't exact, because we don't know the actual difference in distance between which address of RAM or from which Cache or Page File each call on each iteration is going to be due to many other unseen variables, but this is just an assessment of the situation to be aware of and trying to look at it from the worst case scenario.
So by these numbers it would almost look as if Algorithm One should be 99% slower than Algorithm Two; however, this is only the
The Boss's
part or responsibility of the algorithms and it doesn't account for the actual workersA
,B
,C
, &D
and what they have to do on each and every iteration of the Loop. So the bosses job only accounts for about 15 - 40% of the total work being done. So the bulk of the work which is done through the workers has a slight bigger impact towards keeping the ratio of the speed rate differences to about 50-70%The Observation: - The differences between the two algorithms
In this situation it is the structure of the process of the work being done and it does go to show that Case 2 is more efficient from both that partial optimization of having a similar function declaration and definition where it is only the variables that differ by name. And we also see that the total distance traveled in Case 1 is much farther than it is in Case 2 and we can consider this distance traveled our Time Factor between the two algorithms. Case 1 has considerable more work to do than Case 2 does. This was also seen in the evidence of the
ASM
that was shown between both cases. Even with what was already said about these cases, it also doesn't account for the fact that in Case 1 the boss will have to wait for bothA
&C
to get back before he can go back toA
again on the next iteration and it also doesn't account for the fact that ifA
orB
is taking an extremely long time then both theBoss
and the other worker(s) are also waiting at an idle. In Case 2 the only one being idle is theBoss
until the worker gets back. So even this has an impact on the algorithm.The OPs Amended Question(s)
Regarding These Questions
As I have demonstrated without a doubt, there is an underlying issue even before the Hardware and Software becomes involved. Now as for the management of memory and caching along with page files, etc. which all works together in an integrated set of systems between:
The Architecture
{ Hardware, Firmware, some Embedded Drivers, Kernels and ASM Instruction Sets },The OS
{ File and Memory Management systems, Drivers and the Registry },The Compiler
{ Translation Units and Optimizations of the Source Code }, and even theSource Code
itself with its set(s) of distinctive algorithms; we can already see that there is a bottleneck that is happening within the first algorithm before we even apply it to any machine with any arbitraryArchitecture
,OS
, andProgrammable Language
compared to the second algorithm. So there already existed a problem before involving the intrinsics of a modern computer.The Ending Results
However; it is not to say that these new questions are not of importance because they themselves are and they do play a role after all. They do impact the procedures and the overall performance and that is evident with the various graphs and assessments from many who have given their answer(s) and or comment(s). If you pay attention to the analogy of the
Boss
and the two workersA
&B
who had to go and retrieve packages fromC
&D
respectively and considering the mathematical notations of the two algorithms in question you can see that without even the involvement of the computerCase 2
is approximately 60% faster thanCase 1
and when you look at the graphs and charts after these algorithms have been applied to source code, compiled and optimized and executed through the OS to perform operations on the given hardware you even see a little more degradation between the differences in these algorithms.Now if the "Data" set is fairly small it may not seem all that bad of a difference at first but since
Case 1
is about60 - 70%
slower thanCase 2
we can look at the growth of this function as being in terms of the differences in time executions:And this approximation is the average difference between these two loops both algorithmically and machine operations involving software optimizations and machine instructions. So when the data set grows linearly, so does the difference in time between the two. Algorithm 1 has more fetches than algorithm 2 which is evident when the
Boss
had to travel back and forth the maximum distance betweenA
&C
for every iteration after the first iteration while Algorithm 2 theBoss
had to travel toA
once and then after being done withA
he had to travel a maximum distance only one time when going fromA
toC
.So trying to have the
Boss
focusing on doing two similar things at once and juggling them back and forth instead of focusing on similar consecutive tasks is going to make him quite angry by the end of the day because he had to travel and work twice as much. Therefor do not lose the scope of the situation by letting your boss getting into an interpolated bottleneck because the boss's spouse and children wouldn't appreciate it.Upon further analysis of this, I believe this is (at least partially) caused by data alignment of the four pointers. This will cause some level of cache bank/way conflicts.
If I've guessed correctly on how you are allocating your arrays, they are likely to be aligned to the page line.
This means that all your accesses in each loop will fall on the same cache way. However, Intel processors have had 8-way L1 cache associativity for a while. But in reality, the performance isn't completely uniform. Accessing 4-ways is still slower than say 2-ways.
EDIT : It does in fact look like you are allocating all the arrays separately. Usually when such large allocations are requested, the allocator will request fresh pages from the OS. Therefore, there is a high chance that large allocations will appear at the same offset from a page-boundary.
Here's the test code:
Benchmark Results:
EDIT: Results on an actual Core 2 architecture machine:
2 x Intel Xeon X5482 Harpertown @ 3.2 GHz:
Observations:
6.206 seconds with one loop and 2.116 seconds with two loops. This reproduces the OP's results exactly.
In the first two tests, the arrays are allocated separately. You'll notice that they all have the same alignment relative to the page.
In the second two tests, the arrays are packed together to break that alignment. Here you'll notice both loops are faster. Furthermore, the second (double) loop is now the slower one as you would normally expect.
As @Stephen Cannon points out in the comments, there is very likely possibility that this alignment causes false aliasing in the load/store units or the cache. I Googled around for this and found that Intel actually has a hardware counter for partial address aliasing stalls:
http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/~amplifierxe/pmw_dp/events/partial_address_alias.html
5 Regions - Explanations
Region 1:
This one is easy. The dataset is so small that the performance is dominated by overhead like looping and branching.
Region 2:
Here, as the data sizes increases, the amount of relative overhead goes down and the performance "saturates". Here two loops is slower because it has twice as much loop and branching overhead.I'm not sure exactly what's going on here... Alignment could still play an effect as Agner Fog mentions cache bank conflicts. (That link is about Sandy Bridge, but the idea should still be applicable to Core 2.)
Region 3:
At this point, the data no longer fits in L1 cache. So performance is capped by the L1 <-> L2 cache bandwidth.
Region 4:
The performance drop in the single-loop is what we are observing. And as mentioned, this is due to the alignment which (most likely) causes false aliasing stalls in the processor load/store units.
However, in order for false aliasing to occur, there must be a large enough stride between the datasets. This is why you don't see this in region 3.
Region 5:
At this point, nothing fits in cache. So you're bound by memory bandwidth.
I cannot replicate the results discussed here.
I don't know if poor benchmark code is to blame, or what, but the two methods are within 10% of each other on my machine using the following code, and one loop is usually just slightly faster than two - as you'd expect.
Array sizes ranged from 2^16 to 2^24, using eight loops. I was careful to initialize the source arrays so the
+=
assignment wasn't asking the FPU to add memory garbage interpreted as a double.I played around with various schemes, such as putting the assignment of
b[j]
,d[j]
toInitToZero[j]
inside the loops, and also with using+= b[j] = 1
and+= d[j] = 1
, and I got fairly consistent results.As you might expect, initializing
b
andd
inside the loop usingInitToZero[j]
gave the combined approach an advantage, as they were done back-to-back before the assignments toa
andc
, but still within 10%. Go figure.Hardware is Dell XPS 8500 with generation 3 Core i7 @ 3.4 GHz and 8 GB memory. For 2^16 to 2^24, using eight loops, the cumulative time was 44.987 and 40.965 respectively. Visual C++ 2010, fully optimized.
PS: I changed the loops to count down to zero, and the combined method was marginally faster. Scratching my head. Note the new array sizing and loop counts.
I'm not sure why it was decided that MFLOPS was a relevant metric. I though the idea was to focus on memory accesses, so I tried to minimize the amount of floating point computation time. I left in the
+=
, but I am not sure why.A straight assignment with no computation would be a cleaner test of memory access time and would create a test that is uniform irrespective of the loop count. Maybe I missed something in the conversation, but it is worth thinking twice about. If the plus is left out of the assignment, the cumulative time is almost identical at 31 seconds each.