I'm looking around OpenMP, partially because my program need to make additions of very large vectors (millions of elements). However i see a quite large difference if i use std::vector or raw array. Which i cannot explain. I insist that the difference is only on the loop, not the initialisation of course.
The difference in time I refer to, is only timing the addition, especially not to take into account any initialization difference between vectors, arrays, etc. I'm really talking only about the sum part. The size of the vectors is not known at compile time.
I use g++
5.x on Ubuntu 16.04.
edit: I tested what @Shadow said, it got me thinking, is there something going on with optimization? If i compile with -O2
, then, using raw arrays initialized, I get back for loop scaling with number of threads. But with -O3
or -funroll-loops
, it is as if the compiler kicks in early and optimize before the pragma is seen.
I came up with the following, simple test:
#define SIZE 10000000
#define TRIES 200
int main(){
std::vector<double> a,b,c;
a.resize(SIZE);
b.resize(SIZE);
c.resize(SIZE);
double start = omp_get_wtime();
unsigned long int i,t;
#pragma omp parallel shared(a,b,c) private(i,t)
{
for( t = 0; t< TRIES; t++){
#pragma omp for
for( i = 0; i< SIZE; i++){
c[i] = a[i] + b[i];
}
}
}
std::cout << "finished in " << omp_get_wtime() - start << std::endl;
return 0;
}
I compile with
g++ -O3 -fopenmp -std=c++11 main.cpp
And get for one threads
>time ./a.out
finished in 2.5638
./a.out 2.58s user 0.04s system 99% cpu 2.619 total.
For two threads, loop takes 1.2s, for 1.23 total.
Now if I use raw arrays:
int main(){
double *a, *b, *c;
a = new double[SIZE];
b = new double[SIZE];
c = new double[SIZE];
double start = omp_get_wtime();
unsigned long int i,t;
#pragma omp parallel shared(a,b,c) private(i,t)
{
for( t = 0; t< TRIES; t++)
{
#pragma omp for
for( i = 0; i< SIZE; i++)
{
c[i] = a[i] + b[i];
}
}
}
std::cout << "finished in " << omp_get_wtime() - start << std::endl;
delete[] a;
delete[] b;
delete[] c;
return 0;
}
And i get (1 thread):
>time ./a.out
finished in 1.92901
./a.out 1.92s user 0.01s system 99% cpu 1.939 total
std::vector
is 33% slower!
For two threads:
>time ./a.out
finished in 1.20061
./a.out 2.39s user 0.02s system 198% cpu 1.208 total
As a comparison, with Eigen or Armadillo for exactly the same operation (using c = a+b overload with vector object), I get for total real time ~2.8s. They are not multi-threaded for vector additions.
Now, i thought the std::vector
has almost no overhead? What is happening here? I'd like to use nice standard library objects.
I cannot find any reference anywhere on a simple example like this.
The observed behaviour is not OpenMP-specific and has to do with the way modern operating systems manage memory. Memory is virtual, meaning that each process has its own virtual address (VA) space and a special translation mechanism is used to map pages of that VA space to frames of physical memory. Consequently, memory allocation is performed in two stages:
operator new[]
does when the allocation is big enough (smaller allocations are handled differently for reasons of efficiency)The process is split in two parts since in many cases applications do not really use at once all the memory they reserve and backing the entire reservation with physical memory might lead to waste (and unlike virtual memory, physical one is a very limited resource). Therefore, backing reservations with physical memory is performed on-demand the very first time the process writes to a region of the allocated memory space. The process is known as faulting the memory region since on most architectures it involves a soft page-fault that triggers the mapping within the OS kernel. Every time your code writes for the first time to a region of memory that is still not backed by physical memory, a soft page-fault is triggered and the OS tries to map a physical page. The process is slow as it involves finding a free page and modification on the process page table. The typical granularity of that process is 4 KiB unless some kind of large pages mechanism is in place, e.g., the Transparent Huge Pages mechanism on Linux.
What happens if you read for the first time from a page that has never been written to? Again, a soft page fault occurs, but instead of mapping a frame of physical memory, the Linux kernel maps a special "zero page". The page is mapped in CoW (copy-on-write) mode, which means that when you try to write it, the mapping to the zero page will be replaced by a mapping to a fresh frame of physical memory.
Now, take a look at the size of the arrays. Each of
a
,b
, andc
occupies 80 MB, which exceeds the cache size of most modern CPUs. One execution of the parallel loop thus has to bring 160 MB of data from the main memory and write back 80 MB. Because of how system cache works, writing toc
actually reads it once, unless non-temporal (cache-bypassing) stores are used, therefore 240 MB of data is read and 80 MB of data gets written. Multiplied by 200 outer iterations, this gives 48 GB of data read and 16 GB of data written in total.The above is not the case when
a
andb
are not initialised, i.e. the case whena
andb
are simply allocated usingoperator new[]
. Since reads in those case result in access to the zero page, and there is physically only one zero page that easily fits in the CPU cache, no real data has to be brought in from the main memory. Therefore, only 16 GB of data has to be read in and then written back. If non-temporal stores are used, no memory is read at all.This could be easily proven using LIKWID (or any other tool able to read the CPU hardware counters):
std::vector<double>
version:Version with uninitialised arrays:
Version with uninitialised array and non-temporal stores (using Intel's
#pragma vector nontemporal
):The disassembly of the two versions provided in your question when using GCC 5.3 shows that the two loops are translated to exactly the same sequence of assembly instructions sans the different code address. The sole reason for the difference in the execution time is the memory access as explained above. Resizing the vectors initialises them with zeros, which results in
a
andb
being backed up by their own physical memory pages. Not initialisinga
andb
whenoperator new[]
is used results in their backing by the zero page.Edit: It took me so long to write this that in the mean time Zulan has written a way more technical explanation.
In this case, i think the culprit is -funroll-loops, from what i just tests in O2 with and without this option.
https://gcc.gnu.org/onlinedocs/gcc-5.4.0/gcc/Optimize-Options.html#Optimize-Options
funroll-loops: Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop. It also turns on complete loop peeling (i.e. complete removal of loops with small constant number of iterations). This option makes code larger, and may or may not make it run faster.
Meaningful benchmarking is hard
The answer from Xirema has already outlined in detail the difference in the code.
std::vector::reserve
initializes the data to zero, whereasnew double[size]
does not. Note that you can usenew double[size]()
to force initalization.However your measurement doesn't include initialization, and the number of repetitions is so high that the loop costs should outweigh the small initialization even in Xirema's example. So why do the very same instructions in the loop take more time because the data is initialized?
Minimal example
Let's dig to the core of this with a code that dynamically determines whether memory is initialized or not (Based on Xirema's, but only timing the loop itself).
Results are consistent:
First glance at performance counters
Using the excellent Linux
perf
tool:Linear scaling with increasing number of repetitions also tells us, that the difference comes from within the loop. But why would initializing the memory cause more last level cache-loads and data TLB misses?
Memory is complex
To understand that, we need to understand how memory is allocated. Just because a
malloc
/new
returns some pointer to virtual memory, doesn't mean that there is physical memory behind it. The virtual memory can be in a page that is not backed by physical memory - and the physical memory is only assigned on the first page fault. Now here is wherepage-types
(fromlinux/tools/vm
- and the pid we show as output comes in handy. Looking at the page statistics during a long execution of our little benchmark:With initialization
Most of the virtual memory is in a normal
mmap,anonymous
region - something that is mapped to a physical address.Without initialization
Now here, only 1/3 of the memory is backed by dedicated physical memory, and 2/3 are mapped to a zero page. The data behind
a
andb
is all backed by a single read-only 4kiB page filled with zeros.c
(anda
,b
in the other test) have already been written to, so it has to have it's own memory.0 != 0
Now it may look weird: everything here is zero1 - why does it matter how it became zero? Whether you
memset(0)
,a[i] = 0.
, orstd::vector::reserve
- everything causes explicit writes to memory, hence a page fault if you do it on a zero page. I don't think you can/should prevent physical page allocation at that point. The only thing you could do for thememset
/reserve
is to usecalloc
to explicitly request zero'd memory, which is probably backed by azero_page
, but I doubt it is done (or makes a lot of sense). Remember that fornew double[size];
ormalloc
there is no guarantee what kind of memory you get, but that includes the possibility of zero-memory.1: Remember that the double 0.0 has all bits set to zero.
In the end the performance difference really comes only from the loop, but is caused by initialization.
std::vector
carries no overhead for the loop. In the benchmark code, raw arrays just benefit from optimization of an abnormal case of uninitialized data.I have a good hypothesis.
I've written three versions of the code: one using raw
double *
, one usingstd::unique_ptr<double[]>
objects, and one usingstd::vector<double>
, and compared the runtimes of each of these versions of the code. For my purposes, I've used a single-threaded version of the code to try to simplify the case.Total Code::
Test Results:
So all of STL objects are demonstrably slower than the raw version. Why might this be?
Let's go to the Assembly! (compiled using Godbolt.com, using the snapshot version of GCC 8.x)
There's a few things we can observe to start with. For starters, the
std::unique_ptr
andstd::vector
code are generating virtually identical assembly code.std::unique_ptr<double[]>
swaps outnew
anddelete
fornew[]
anddelete[]
. Since their runtimes are within margin of error, we'll focus on thestd::unique_ptr<double[]>
version and compare that todouble *
.Starting with
.L5
and.L22
, the code seems to be identical. The only major differences are an extra pointer arithmetic before thedelete[]
calls are made in thedouble *
version, and some extra stack cleanup code at the end in.L34
(std::unique_ptr<double[]>
version), which doesn't exist for thedouble *
version. Neither of these seem likely to have strong impact on the code speed, so we're going to ignore them for now.The code that's identical appears to be the code directly responsible for the loop. You'll notice that the code which is different (which I'll get to momentarily) doesn't contain any jump statements, which are integral to loops.
So all of the major differences appear to be specific to the initial allocation of the objects in question. This is between
time_unique_ptr():
and.L32
for thestd::unique_ptr<double[]>
version, and betweentime_pointer():
and.L22
for thedouble *
version.So what's the difference? Well, they're almost doing the same thing. Except for a few lines of code that show up in the
std::unique_ptr<double[]>
version that don't show up in thedouble *
version:std::unique_ptr<double[]>
:double *
:Well would you look at that! Some unexpected calls to
memset
that aren't part of thedouble *
code! It's quite clear thatstd::vector<T>
andstd::unique_ptr<T[]>
are contracted to "initialize" the memory they allocate, whereasdouble *
has no such contract.So this is basically a very, very round-about way of verifying what Shadow observed: When you make no attempt to "zero-fill" the arrays, the compiler will
double *
(saving precious CPU cycles), andstd::vector<double>
andstd::unique_ptr<double[]>
(costing time initializing everything).But when you do add zero-fill, the compiler recognizes that it's about to "repeat itself", optimizes out the second zero-fill for
std::vector<double>
andstd::unique_ptr<double[]>
(which results in the code not changing) and adds it to thedouble *
version, making it the same as the other two versions. You can confirm this by comparing the new version of the assembly where I've made the following change to thedouble *
version:And sure enough, the assembly now has those loops optimized into
memset
calls, the same as thestd::unique_ptr<double[]>
version! And the runtime is now comparable.(Note: the runtime of the pointer is now slower than the other two! I observed that the first function called, regardless of which one, is always about 200ms-400ms slower. I'm blaming branch prediction. Either way, the speed should be identical in all three code paths now).
So that's the lesson:
std::vector
andstd::unique_ptr
are making your code slightly safer by preventing that Undefined Behavior you were invoking in your code that used raw pointers. The consequence is that it's also making your code slower.I tested it and found out the following: The
vector
case had a runtime about 1.8 times longer than the raw array case. But this was only the case when I did not initialize the raw array. After adding a simple loop before the time measurement to initialize all entries with0.0
the raw array case took as long as thevector
case.It took a closer look and did the following: I did not initialize the raw arrays like
but did it this way:
(the other arrays accordingly).
The result was that I got no console output from initializing the arrays and it was again as fast as without initializing at all, that is about 1.8 faster than with the
vector
s.When I initialized for example only
a
"normal" and the other two vector with theif
clause I measured a time between thevector
runtime and the runtime with all arrays "fake initialized" with theif
clause.Well... that's strange...
Although I cannot explain you this behavior, I can tell you that there is not really an overhead for
std::vector
if you use it "normal". This is just a very artificial case.EDIT:
As qPCR4vir and the OP Napseis pointed out this might have to do with optimization. As soon as I turned on optimization the "real init" case was about the already mentioned factor of 1.8 slower. But without it was still about 1.1 times slower.
So I looked at the assembler code but I did not saw any difference in the 'for' loops...
The major thing to notice here is the fact that
The array version has undefined behavior
dcl.init #12 states:
And this is exactly what happens in that line:
Both
a[i]
andb[i]
are indeterminate values since the arrays are default-initialized.The UB perfectly explains the measuring results (whatever they are).
UPD: In the light of @HristoIliev and @Zulan answers I'd like to emphasize language POV once more.
The UB of reading uninitialized memory for the compiler essentialy means that it can always assume that memory is initialized, so whatever the OS does is fine with C++, even if the OS has some specific behavior for that case.
Well it turns out that it does - your code is not reading the physical memory and your measurements correspond to that.
One could say that the resulting program does not compute the sum of two arrays - it computes the sum of two more easily accessible mocks, and it is fine with C++ exactly because of the UB. If it did something else, it would still be perfectly fine.
So in the end you have two programs: one adds up two vectors and the other just does something undefined (from C++ point of view) or something unrelated (from OS point of view). What is the point of measuring their timings and comparing the results?
Fixing the UB solves the whole problem, but more importantly it validates your measurements and allows you to meaningfully compare the results.