Why is adding two std::vectors slower than raw arr

2019-06-20 04:41发布

I'm looking around OpenMP, partially because my program need to make additions of very large vectors (millions of elements). However i see a quite large difference if i use std::vector or raw array. Which i cannot explain. I insist that the difference is only on the loop, not the initialisation of course.

The difference in time I refer to, is only timing the addition, especially not to take into account any initialization difference between vectors, arrays, etc. I'm really talking only about the sum part. The size of the vectors is not known at compile time. I use g++ 5.x on Ubuntu 16.04.

edit: I tested what @Shadow said, it got me thinking, is there something going on with optimization? If i compile with -O2, then, using raw arrays initialized, I get back for loop scaling with number of threads. But with -O3 or -funroll-loops, it is as if the compiler kicks in early and optimize before the pragma is seen.

I came up with the following, simple test:

#define SIZE 10000000
#define TRIES 200
int main(){

    std::vector<double> a,b,c;
    a.resize(SIZE);
    b.resize(SIZE);
    c.resize(SIZE);

    double start = omp_get_wtime();
    unsigned long int i,t;
    #pragma omp parallel shared(a,b,c) private(i,t)
    {
    for( t = 0; t< TRIES; t++){
       #pragma omp for
       for( i = 0; i< SIZE; i++){
        c[i] = a[i] + b[i];
       }
    }
    }

    std::cout << "finished in " << omp_get_wtime() - start << std::endl;

    return 0;

}

I compile with

   g++ -O3 -fopenmp  -std=c++11 main.cpp

And get for one threads

>time ./a.out 
 finished in 2.5638
 ./a.out  2.58s user 0.04s system 99% cpu 2.619 total.

For two threads, loop takes 1.2s, for 1.23 total.

Now if I use raw arrays:

 int main(){
    double *a, *b, *c;
    a = new double[SIZE];
    b = new double[SIZE];
    c = new double[SIZE];
    double start = omp_get_wtime();
    unsigned long int i,t;
    #pragma omp parallel shared(a,b,c) private(i,t)
    {
       for( t = 0; t< TRIES; t++)
       {
          #pragma omp for
          for( i = 0; i< SIZE; i++)
          {
             c[i] = a[i] + b[i];
          }
       }
    }

    std::cout << "finished in " << omp_get_wtime() - start << std::endl;

    delete[] a;
    delete[] b;
    delete[] c;

    return 0;
}

And i get (1 thread):

>time ./a.out
 finished in 1.92901 
  ./a.out  1.92s user 0.01s system 99% cpu 1.939 total   

std::vector is 33% slower!

For two threads:

>time ./a.out 
finished in 1.20061                                                              
./a.out  2.39s user 0.02s system 198% cpu 1.208 total   

As a comparison, with Eigen or Armadillo for exactly the same operation (using c = a+b overload with vector object), I get for total real time ~2.8s. They are not multi-threaded for vector additions.

Now, i thought the std::vector has almost no overhead? What is happening here? I'd like to use nice standard library objects.

I cannot find any reference anywhere on a simple example like this.

6条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-06-20 04:51

The observed behaviour is not OpenMP-specific and has to do with the way modern operating systems manage memory. Memory is virtual, meaning that each process has its own virtual address (VA) space and a special translation mechanism is used to map pages of that VA space to frames of physical memory. Consequently, memory allocation is performed in two stages:

  • reservation of a region within the VA space - this is what operator new[] does when the allocation is big enough (smaller allocations are handled differently for reasons of efficiency)
  • actually backing the region with physical memory upon access to some part of the region

The process is split in two parts since in many cases applications do not really use at once all the memory they reserve and backing the entire reservation with physical memory might lead to waste (and unlike virtual memory, physical one is a very limited resource). Therefore, backing reservations with physical memory is performed on-demand the very first time the process writes to a region of the allocated memory space. The process is known as faulting the memory region since on most architectures it involves a soft page-fault that triggers the mapping within the OS kernel. Every time your code writes for the first time to a region of memory that is still not backed by physical memory, a soft page-fault is triggered and the OS tries to map a physical page. The process is slow as it involves finding a free page and modification on the process page table. The typical granularity of that process is 4 KiB unless some kind of large pages mechanism is in place, e.g., the Transparent Huge Pages mechanism on Linux.

What happens if you read for the first time from a page that has never been written to? Again, a soft page fault occurs, but instead of mapping a frame of physical memory, the Linux kernel maps a special "zero page". The page is mapped in CoW (copy-on-write) mode, which means that when you try to write it, the mapping to the zero page will be replaced by a mapping to a fresh frame of physical memory.

Now, take a look at the size of the arrays. Each of a, b, and c occupies 80 MB, which exceeds the cache size of most modern CPUs. One execution of the parallel loop thus has to bring 160 MB of data from the main memory and write back 80 MB. Because of how system cache works, writing to c actually reads it once, unless non-temporal (cache-bypassing) stores are used, therefore 240 MB of data is read and 80 MB of data gets written. Multiplied by 200 outer iterations, this gives 48 GB of data read and 16 GB of data written in total.

The above is not the case when a and b are not initialised, i.e. the case when a and b are simply allocated using operator new[]. Since reads in those case result in access to the zero page, and there is physically only one zero page that easily fits in the CPU cache, no real data has to be brought in from the main memory. Therefore, only 16 GB of data has to be read in and then written back. If non-temporal stores are used, no memory is read at all.

This could be easily proven using LIKWID (or any other tool able to read the CPU hardware counters):

std::vector<double> version:

$ likwid-perfctr -C 0 -g HA a.out
...
+-----------------------------------+------------+
|               Metric              |   Core 0   |
+-----------------------------------+------------+
|        Runtime (RDTSC) [s]        |     4.4796 |
|        Runtime unhalted [s]       |     5.5242 |
|            Clock [MHz]            |  2850.7207 |
|                CPI                |     1.7292 |
|  Memory read bandwidth [MBytes/s] | 10753.4669 |
|  Memory read data volume [GBytes] |    48.1715 | <---
| Memory write bandwidth [MBytes/s] |  3633.8159 |
| Memory write data volume [GBytes] |    16.2781 |
|    Memory bandwidth [MBytes/s]    | 14387.2828 |
|    Memory data volume [GBytes]    |    64.4496 | <---
+-----------------------------------+------------+

Version with uninitialised arrays:

+-----------------------------------+------------+
|               Metric              |   Core 0   |
+-----------------------------------+------------+
|        Runtime (RDTSC) [s]        |     2.8081 |
|        Runtime unhalted [s]       |     3.4226 |
|            Clock [MHz]            |  2797.2306 |
|                CPI                |     1.0753 |
|  Memory read bandwidth [MBytes/s] |  5696.4294 |
|  Memory read data volume [GBytes] |    15.9961 | <---
| Memory write bandwidth [MBytes/s] |  5703.4571 |
| Memory write data volume [GBytes] |    16.0158 |
|    Memory bandwidth [MBytes/s]    | 11399.8865 |
|    Memory data volume [GBytes]    |    32.0119 | <---
+-----------------------------------+------------+

Version with uninitialised array and non-temporal stores (using Intel's #pragma vector nontemporal):

+-----------------------------------+------------+
|               Metric              |   Core 0   |
+-----------------------------------+------------+
|        Runtime (RDTSC) [s]        |     1.5889 |
|        Runtime unhalted [s]       |     1.7397 |
|            Clock [MHz]            |  2530.1640 |
|                CPI                |     0.5465 |
|  Memory read bandwidth [MBytes/s] |   123.4196 |
|  Memory read data volume [GBytes] |     0.1961 | <---
| Memory write bandwidth [MBytes/s] | 10331.2416 |
| Memory write data volume [GBytes] |    16.4152 |
|    Memory bandwidth [MBytes/s]    | 10454.6612 |
|    Memory data volume [GBytes]    |    16.6113 | <---
+-----------------------------------+------------+

The disassembly of the two versions provided in your question when using GCC 5.3 shows that the two loops are translated to exactly the same sequence of assembly instructions sans the different code address. The sole reason for the difference in the execution time is the memory access as explained above. Resizing the vectors initialises them with zeros, which results in a and b being backed up by their own physical memory pages. Not initialising a and b when operator new[] is used results in their backing by the zero page.

Edit: It took me so long to write this that in the mean time Zulan has written a way more technical explanation.

查看更多
The star\"
3楼-- · 2019-06-20 04:54

In this case, i think the culprit is -funroll-loops, from what i just tests in O2 with and without this option.

https://gcc.gnu.org/onlinedocs/gcc-5.4.0/gcc/Optimize-Options.html#Optimize-Options

funroll-loops: Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop. It also turns on complete loop peeling (i.e. complete removal of loops with small constant number of iterations). This option makes code larger, and may or may not make it run faster.

查看更多
Summer. ? 凉城
4楼-- · 2019-06-20 04:56

Meaningful benchmarking is hard

The answer from Xirema has already outlined in detail the difference in the code. std::vector::reserve initializes the data to zero, whereas new double[size] does not. Note that you can use new double[size]() to force initalization.

However your measurement doesn't include initialization, and the number of repetitions is so high that the loop costs should outweigh the small initialization even in Xirema's example. So why do the very same instructions in the loop take more time because the data is initialized?

Minimal example

Let's dig to the core of this with a code that dynamically determines whether memory is initialized or not (Based on Xirema's, but only timing the loop itself).

#include <vector>
#include <chrono>
#include <iostream>
#include <memory>
#include <iomanip>
#include <cstring>
#include <string>
#include <sys/types.h>
#include <unistd.h>

constexpr size_t size = 10'000'000;

auto time_pointer(size_t reps, bool initialize, double init_value) {
    double * a = new double[size];
    double * b = new double[size];
    double * c = new double[size];

    if (initialize) {
        for (size_t i = 0; i < size; i++) {
            a[i] = b[i] = c[i] = init_value;
        }
    }

    auto start = std::chrono::steady_clock::now();

    for (size_t t = 0; t < reps; t++) {
        for (size_t i = 0; i < size; i++) {
            c[i] = a[i] + b[i];
        }
    }

    auto end = std::chrono::steady_clock::now();

    delete[] a;
    delete[] b;
    delete[] c;

    return end - start;
}

int main(int argc, char* argv[]) {
    bool initialize = (argc == 3);
    double init_value = 0;
    if (initialize) {
        init_value = std::stod(argv[2]);
    }
    auto reps = std::stoll(argv[1]);
    std::cout << "pid: " << getpid() << "\n";
    auto t = time_pointer(reps, initialize, init_value);
    std::cout << std::setw(12) << std::chrono::duration_cast<std::chrono::milliseconds>(t).count() << "ms" << std::endl;
    return 0;
}

Results are consistent:

./a.out 50 # no initialization
657ms
./a.out 50 0. # with initialization
1005ms

First glance at performance counters

Using the excellent Linux perf tool:

$ perf stat -e LLC-loads -e dTLB-misses ./a.out 50  
pid: 12481
         626ms

 Performance counter stats for './a.out 50':

       101.589.231      LLC-loads                                                   
           105.415      dTLB-misses                                                 

       0,629369979 seconds time elapsed

$ perf stat -e LLC-loads -e dTLB-misses ./a.out 50 0.
pid: 12499
        1008ms

 Performance counter stats for './a.out 50 0.':

       145.218.903      LLC-loads                                                   
         1.889.286      dTLB-misses                                                 

       1,096923077 seconds time elapsed

Linear scaling with increasing number of repetitions also tells us, that the difference comes from within the loop. But why would initializing the memory cause more last level cache-loads and data TLB misses?

Memory is complex

To understand that, we need to understand how memory is allocated. Just because a malloc / new returns some pointer to virtual memory, doesn't mean that there is physical memory behind it. The virtual memory can be in a page that is not backed by physical memory - and the physical memory is only assigned on the first page fault. Now here is where page-types (from linux/tools/vm - and the pid we show as output comes in handy. Looking at the page statistics during a long execution of our little benchmark:

With initialization

                 flags  page-count       MB  symbolic-flags         long-symbolic-flags
    0x0000000000000804           1        0  __R________M______________________________ referenced,mmap
    0x000000000004082c         392        1  __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable
    0x000000000000086c         335        1  __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
    0x0000000000401800       56721      221  ___________Ma_________t___________________ mmap,anonymous,thp
    0x0000000000005868        1807        7  ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
    0x0000000000405868         111        0  ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp
    0x000000000000586c           1        0  __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
                 total       59368      231

Most of the virtual memory is in a normal mmap,anonymous region - something that is mapped to a physical address.

Without initialization

             flags  page-count       MB  symbolic-flags         long-symbolic-flags
0x0000000001000000        1174        4  ________________________z_________________ zero_page
0x0000000001400000       37888      148  ______________________t_z_________________ thp,zero_page
0x0000000000000800           1        0  ___________M______________________________ mmap
0x000000000004082c         388        1  __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable
0x000000000000086c         347        1  __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
0x0000000000401800       18907       73  ___________Ma_________t___________________ mmap,anonymous,thp
0x0000000000005868         633        2  ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x0000000000405868          37        0  ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp
0x000000000000586c           1        0  __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
             total       59376      231

Now here, only 1/3 of the memory is backed by dedicated physical memory, and 2/3 are mapped to a zero page. The data behind a and b is all backed by a single read-only 4kiB page filled with zeros. c (and a, b in the other test) have already been written to, so it has to have it's own memory.

0 != 0

Now it may look weird: everything here is zero1 - why does it matter how it became zero? Whether you memset(0), a[i] = 0., or std::vector::reserve - everything causes explicit writes to memory, hence a page fault if you do it on a zero page. I don't think you can/should prevent physical page allocation at that point. The only thing you could do for the memset / reserve is to use calloc to explicitly request zero'd memory, which is probably backed by a zero_page, but I doubt it is done (or makes a lot of sense). Remember that for new double[size]; or malloc there is no guarantee what kind of memory you get, but that includes the possibility of zero-memory.

1: Remember that the double 0.0 has all bits set to zero.

In the end the performance difference really comes only from the loop, but is caused by initialization. std::vector carries no overhead for the loop. In the benchmark code, raw arrays just benefit from optimization of an abnormal case of uninitialized data.

查看更多
来,给爷笑一个
5楼-- · 2019-06-20 05:00

I have a good hypothesis.

I've written three versions of the code: one using raw double *, one using std::unique_ptr<double[]> objects, and one using std::vector<double>, and compared the runtimes of each of these versions of the code. For my purposes, I've used a single-threaded version of the code to try to simplify the case.

Total Code::

#include<vector>
#include<chrono>
#include<iostream>
#include<memory>
#include<iomanip>

constexpr size_t size = 10'000'000;
constexpr size_t reps = 50;

auto time_vector() {
    auto start = std::chrono::steady_clock::now();
    {
        std::vector<double> a(size);
        std::vector<double> b(size);
        std::vector<double> c(size);

        for (size_t t = 0; t < reps; t++) {
            for (size_t i = 0; i < size; i++) {
                c[i] = a[i] + b[i];
            }
        }
    }
    auto end = std::chrono::steady_clock::now();
    return end - start;
}

auto time_pointer() {
    auto start = std::chrono::steady_clock::now();
    {
        double * a = new double[size];
        double * b = new double[size];
        double * c = new double[size];

        for (size_t t = 0; t < reps; t++) {
            for (size_t i = 0; i < size; i++) {
                c[i] = a[i] + b[i];
            }
        }

        delete[] a;
        delete[] b;
        delete[] c;
    }
    auto end = std::chrono::steady_clock::now();
    return end - start;
}

auto time_unique_ptr() {
    auto start = std::chrono::steady_clock::now();
    {
        std::unique_ptr<double[]> a = std::make_unique<double[]>(size);
        std::unique_ptr<double[]> b = std::make_unique<double[]>(size);
        std::unique_ptr<double[]> c = std::make_unique<double[]>(size);

        for (size_t t = 0; t < reps; t++) {
            for (size_t i = 0; i < size; i++) {
                c[i] = a[i] + b[i];
            }
        }
    }
    auto end = std::chrono::steady_clock::now();
    return end - start;
}

int main() {
    std::cout << "Vector took         " << std::setw(12) << time_vector().count() << "ns" << std::endl;
    std::cout << "Pointer took        " << std::setw(12) << time_pointer().count() << "ns" << std::endl;
    std::cout << "Unique Pointer took " << std::setw(12) << time_unique_ptr().count() << "ns" << std::endl;
    return 0;
}

Test Results:

Vector took           1442575273ns //Note: the first one executed, regardless of 
    //which function it is, is always slower than expected. I'll talk about that later.
Pointer took           542265103ns
Unique Pointer took   1280087558ns

So all of STL objects are demonstrably slower than the raw version. Why might this be?

Let's go to the Assembly! (compiled using Godbolt.com, using the snapshot version of GCC 8.x)

There's a few things we can observe to start with. For starters, the std::unique_ptr and std::vector code are generating virtually identical assembly code. std::unique_ptr<double[]> swaps out new and delete for new[] and delete[]. Since their runtimes are within margin of error, we'll focus on the std::unique_ptr<double[]> version and compare that to double *.

Starting with .L5 and .L22, the code seems to be identical. The only major differences are an extra pointer arithmetic before the delete[] calls are made in the double * version, and some extra stack cleanup code at the end in .L34 (std::unique_ptr<double[]> version), which doesn't exist for the double * version. Neither of these seem likely to have strong impact on the code speed, so we're going to ignore them for now.

The code that's identical appears to be the code directly responsible for the loop. You'll notice that the code which is different (which I'll get to momentarily) doesn't contain any jump statements, which are integral to loops.

So all of the major differences appear to be specific to the initial allocation of the objects in question. This is between time_unique_ptr(): and .L32 for the std::unique_ptr<double[]> version, and between time_pointer(): and .L22 for the double * version.

So what's the difference? Well, they're almost doing the same thing. Except for a few lines of code that show up in the std::unique_ptr<double[]> version that don't show up in the double * version:

std::unique_ptr<double[]>:

mov     edi, 80000000
mov     r12, rax
call    operator new[](unsigned long)
mov     edx, 80000000
mov     rdi, rax
xor     esi, esi //Sets register to 0, which is probably used in...
mov     rbx, rax
call    memset //!!!
mov     edi, 80000000
call    operator new[](unsigned long)
mov     rdi, rax
mov     edx, 80000000
xor     esi, esi //Sets register to 0, which is probably used in...
mov     rbp, rax
call    memset //!!!
mov     edi, 80000000
call    operator new[](unsigned long)
mov     r14, rbx
xor     esi, esi //Sets register to 0, which is probably used in...
mov     rdi, rax
shr     r14, 3
mov     edx, 80000000
mov     r13d, 10000000
and     r14d, 1
call    memset //!!!

double *:

mov     edi, 80000000
mov     rbp, rax
call    operator new[](unsigned long)
mov     rbx, rax
mov     edi, 80000000
mov     r14, rbx
shr     r14, 3
call    operator new[](unsigned long)
and     r14d, 1
mov     edi, 80000000
mov     r12, rax
sub     r13, r14
call    operator new[](unsigned long)

Well would you look at that! Some unexpected calls to memset that aren't part of the double * code! It's quite clear that std::vector<T> and std::unique_ptr<T[]> are contracted to "initialize" the memory they allocate, whereas double * has no such contract.

So this is basically a very, very round-about way of verifying what Shadow observed: When you make no attempt to "zero-fill" the arrays, the compiler will

  • Do nothing for double * (saving precious CPU cycles), and
  • Do the initialization without prompting for std::vector<double> and std::unique_ptr<double[]> (costing time initializing everything).

But when you do add zero-fill, the compiler recognizes that it's about to "repeat itself", optimizes out the second zero-fill for std::vector<double> and std::unique_ptr<double[]> (which results in the code not changing) and adds it to the double * version, making it the same as the other two versions. You can confirm this by comparing the new version of the assembly where I've made the following change to the double * version:

double * a = new double[size];
for(size_t i = 0; i < size; i++) a[i] = 0;
double * b = new double[size];
for(size_t i = 0; i < size; i++) b[i] = 0;
double * c = new double[size];
for(size_t i = 0; i < size; i++) c[i] = 0;

And sure enough, the assembly now has those loops optimized into memset calls, the same as the std::unique_ptr<double[]> version! And the runtime is now comparable.

(Note: the runtime of the pointer is now slower than the other two! I observed that the first function called, regardless of which one, is always about 200ms-400ms slower. I'm blaming branch prediction. Either way, the speed should be identical in all three code paths now).

So that's the lesson: std::vector and std::unique_ptr are making your code slightly safer by preventing that Undefined Behavior you were invoking in your code that used raw pointers. The consequence is that it's also making your code slower.

查看更多
一夜七次
6楼-- · 2019-06-20 05:07

I tested it and found out the following: The vector case had a runtime about 1.8 times longer than the raw array case. But this was only the case when I did not initialize the raw array. After adding a simple loop before the time measurement to initialize all entries with 0.0 the raw array case took as long as the vector case.

It took a closer look and did the following: I did not initialize the raw arrays like

for (size_t i{0}; i < SIZE; ++i)
    a[i] = 0.0;

but did it this way:

for (size_t i{0}; i < SIZE; ++i)
    if (a[i] != 0.0)
    {
        std::cout << "a was set at position " << i << std::endl;
        a[i] = 0.0;
    }

(the other arrays accordingly).
The result was that I got no console output from initializing the arrays and it was again as fast as without initializing at all, that is about 1.8 faster than with the vectors.

When I initialized for example only a "normal" and the other two vector with the if clause I measured a time between the vector runtime and the runtime with all arrays "fake initialized" with the if clause.

Well... that's strange...

Now, i thougt the std::vector has almost no overhead ? What is happening here ? I'd like to use nice STL objects...

Although I cannot explain you this behavior, I can tell you that there is not really an overhead for std::vector if you use it "normal". This is just a very artificial case.

EDIT:

As qPCR4vir and the OP Napseis pointed out this might have to do with optimization. As soon as I turned on optimization the "real init" case was about the already mentioned factor of 1.8 slower. But without it was still about 1.1 times slower.

So I looked at the assembler code but I did not saw any difference in the 'for' loops...

查看更多
一夜七次
7楼-- · 2019-06-20 05:13

The major thing to notice here is the fact that

The array version has undefined behavior

dcl.init #12 states:

If an indeterminate value is produced by an evaluation, the behavior is undefined

And this is exactly what happens in that line:

c[i] = a[i] + b[i];

Both a[i] and b[i] are indeterminate values since the arrays are default-initialized.

The UB perfectly explains the measuring results (whatever they are).

UPD: In the light of @HristoIliev and @Zulan answers I'd like to emphasize language POV once more.

The UB of reading uninitialized memory for the compiler essentialy means that it can always assume that memory is initialized, so whatever the OS does is fine with C++, even if the OS has some specific behavior for that case.

Well it turns out that it does - your code is not reading the physical memory and your measurements correspond to that.

One could say that the resulting program does not compute the sum of two arrays - it computes the sum of two more easily accessible mocks, and it is fine with C++ exactly because of the UB. If it did something else, it would still be perfectly fine.

So in the end you have two programs: one adds up two vectors and the other just does something undefined (from C++ point of view) or something unrelated (from OS point of view). What is the point of measuring their timings and comparing the results?

Fixing the UB solves the whole problem, but more importantly it validates your measurements and allows you to meaningfully compare the results.

查看更多
登录 后发表回答