I have always known that the rich abstractions of C++ come with a certain computational overhead but I was under the impression that this overhead would be close to negligible once the correct compiler optimisations were applied. I was curious as to what exactly the magnitude of this overhead would be, so I wrote a simple test to determine this. The test is a templated function which takes a container variable, assigns a value to each element in the container and then sums the values across the container in a separate loop. This process is repeated for a preset number of cycles.
What I found, to my considerable unease, was that the vector implementation took nearly 3 times the standard array implementation. After permuting through a vast selection of compiler optimizations without any success, I decided to bite the bullet and eyeball the assembly code directly to try and see what was causing the time penalty. I included some assembly directives which allowed me to pinpoint exactly where the array indexing operation occurred and examined the code in detail. What I found, to my complete confusion, was that the difference between the vector implementation and the array implementation was utterly insignificant. The assembly code can be found here.
This is the command I used to build the binary:
g++ -O3 vectorArrayOp.cpp -o vectorArrayOp
This is the command I used to build the assembly:
g++ -O3 -DTAGASM vectorArrayOp.cpp -S -o vectorArrayOp.s
This is the output I observe from running the binary:
gmurphy@interloper:Reference$ ./vectorArrayOp
Duration 0.027678
Duration 0.090212
The results are unchanged when you include the computed value in the stdout stream, I removed them for clarity. My system specifications are as follows (I've seen the same results on my AMD too):
Linux 3.2.0-32-generic x86_64 GNU/Linux
Intel(R) Xeon(R) CPU X5550 @ 2.67GH
g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
The code follows, I would appreciate if someone could provide me with some insight into why the timings are so different when the assembly is so similar.
#include <vector>
#include <iostream>
#include <sys/time.h>
#ifdef TAGASM
#define ASMTAG(X) asm(X)
#else
#define ASMTAG(X)
#endif
enum { DataSize=1024, NumTests=(1<<16) } ;
struct ReturnValue {ReturnValue(float _d, int _t):d(_d), t(_t){} float d; int t;} ;
template <typename Container, typename Type>
ReturnValue runTest(Container &c, Type value)
{
int tagValue(0);
timeval startTime;
gettimeofday(&startTime, NULL);
for(int i=0; i<NumTests; i++)
{
for(int j=0; j<DataSize; j++)
{
ASMTAG("preassign");
c [j] = value ;
ASMTAG("postassign");
}
for(int j=0; j<DataSize; j++)
{
ASMTAG("preadd");
tagValue += c [j] ;
ASMTAG("postadd");
}
}
timeval endTime;
gettimeofday(&endTime, NULL);
float duration((endTime.tv_sec-startTime.tv_sec)+
(endTime.tv_usec-startTime.tv_usec)/1000000.0);
//tagValue is returned in case the optimising compiler might try to remove the loops
return ReturnValue(duration, tagValue) ;
}
int main()
{
int *arrayData = new int [DataSize];
std::vector <int> vectorData(DataSize, 0) ;
ReturnValue ad = runTest(arrayData, 1);
ReturnValue vd = runTest(vectorData, 1);
std::cout<<"Duration "<<ad.d<<std::endl;
std::cout<<"Duration "<<vd.d<<std::endl;
delete [] arrayData;
return 0 ;
}
Based on these results, this is probably a compiler specific performance regression in g++ 4.6.