C++ Warming std vector

2020-03-24 04:27发布

问题:

Why filling std::vector second time is FASTER? Even if space was reserved from the beggining?

int total = 1000000;

struct BaseClass {
  float m[16];
  int id;

  BaseClass(int _id) { id = _id; }
};

int main() {

  std::vector<BaseClass> ar;
  ar.reserve(total);

  {
    auto t_start = std::chrono::high_resolution_clock::now();
    for (int var = 0; var < total; ++var) {
      ar.emplace_back(var);
    }
    auto t_end = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(
                     t_end - t_start).count() << "\n";
    ar.clear();
  }

  {
    auto t_start = std::chrono::high_resolution_clock::now();
    for (int var = 0; var < total; ++var) {
      ar.emplace_back(var);
    }
    auto t_end = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(
                     t_end - t_start).count() << "\n";
    ar.clear();
  }

  {
    auto t_start = std::chrono::high_resolution_clock::now();
    for (int var = 0; var < total; ++var) {
      ar.emplace_back(var);
    }
    auto t_end = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(
                     t_end - t_start).count() << "\n";
    ar.clear();
  }

  return 0;
}

online preview: http://coliru.stacked-crooked.com/a/229e4ba47adddb1a

RESULTS:
118
23
21

P.S. I'm asking why it becomes faster if the only reason for slowdown for vector is allocation/reallocation. And we allocated array BEFORE start.

回答1:

The reason that the first run is slower than the other two is that the runtime has not yet gotten the memory pages from the OS.

I instrumented your program to output the number of major and minor page faults the task had taken at the beginning, and after each of the three stages above. (Note: This works on Linux. No idea if it'll work on whatever OS you're on.) Code:

Note: updated to latest, with reserve() moved to the top and wrapped in its own getrusage call.

#include <ctime>
#include <chrono>
#include <iostream>
#include <vector>

#include <sys/time.h>
#include <sys/resource.h>

using namespace std;

int total = 1000000;

struct BaseClass {
  float m[16];
  int id;

  BaseClass(int _id) { id = _id; }
};

int main() {

  std::vector<BaseClass> ar;
  struct rusage r;
  {
    auto t_start = std::chrono::high_resolution_clock::now();
     }

  getrusage(RUSAGE_SELF, &r);
  cout << "minflt: " << r.ru_minflt << " majflt: " << r.ru_majflt << endl;

  ar.reserve(total);

  getrusage(RUSAGE_SELF, &r);
  cout << "minflt: " << r.ru_minflt << " majflt: " << r.ru_majflt << endl;

  {
    auto t_start = std::chrono::high_resolution_clock::now();
    for (int var = 0; var < total; ++var) {
      ar.emplace_back(var);
    }
    auto t_end = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(
                     t_end - t_start).count() << "\n";
    ar.clear();
  }

  getrusage(RUSAGE_SELF, &r);
  cout << "minflt: " << r.ru_minflt << " majflt: " << r.ru_majflt << endl;

  {
    auto t_start = std::chrono::high_resolution_clock::now();
    for (int var = 0; var < total; ++var) {
      ar.emplace_back(var);
    }
    auto t_end = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(
                     t_end - t_start).count() << "\n";
    ar.clear();
  }

  getrusage(RUSAGE_SELF, &r);
  cout << "minflt: " << r.ru_minflt << " majflt: " << r.ru_majflt << endl;

  {
    auto t_start = std::chrono::high_resolution_clock::now();
    for (int var = 0; var < total; ++var) {
      ar.emplace_back(var);
    }
    auto t_end = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(
                     t_end - t_start).count() << "\n";
    ar.clear();
  }

  getrusage(RUSAGE_SELF, &r);
  cout << "minflt: " << r.ru_minflt << " majflt: " << r.ru_majflt << endl;

  return 0;
}

I then ran it on my box. The result is enlightening:

minflt: 343 majflt: 0
minflt: 367 majflt: 0
48    minflt: 16968 majflt: 0
16
minflt: 16968 majflt: 0
15
minflt: 16968 majflt: 0

Notice that the first measured for-loop incurred over 16,000 minor faults. Those faults are what make the memory available to the application and account for the slower running time. No additional faults happen thereafter. In contrast, the reserve() call itself only incurred 24 minor faults.

In most modern virtual-memory OSes, the OS implements lazy memory allocation, even if the software running on it does not. When the runtime requests additional memory from the OS, the OS makes a note of the request. If the request succeeds, the runtime now has a new range of virtual addresses available to it. (Details vary depending on the API called and the OS, but the essence is the same.) The OS may point the virtual address range to a single zero-filled page marked read-only.

The OS does not necessarily make those pages immediately available to the task. Rather, the OS waits until the task actually tries to write to the memory it's allocated. At that point, the OS allocates a physical page to back the virtual page allocated to the task. That registers as a "minor fault" in UNIX parlance. That process can be expensive.

It's that lazy allocation that your task is measuring.

To prove that, I did an strace of the application as well. The meaningful portion is below.

getrusage(RUSAGE_SELF, {ru_utime={0, 0}, ru_stime={0, 0}, ...}) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe3aa339000
write(1, "minflt: 328 majflt: 0\n", 22) = 22
mmap(NULL, 68001792, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe3a551c000
getrusage(RUSAGE_SELF, {ru_utime={0, 0}, ru_stime={0, 0}, ...}) = 0
write(1, "minflt: 352 majflt: 0\n", 22) = 22
write(1, "52\n", 3)                     = 3
getrusage(RUSAGE_SELF, {ru_utime={0, 30000}, ru_stime={0, 20000}, ...}) = 0
write(1, "minflt: 16953 majflt: 0\n", 24) = 24
write(1, "20\n", 3)                     = 3
getrusage(RUSAGE_SELF, {ru_utime={0, 50000}, ru_stime={0, 20000}, ...}) = 0
write(1, "minflt: 16953 majflt: 0\n", 24) = 24
write(1, "15\n", 3)                     = 3
getrusage(RUSAGE_SELF, {ru_utime={0, 70000}, ru_stime={0, 20000}, ...}) = 0
write(1, "minflt: 16953 majflt: 0\n", 24) = 24
munmap(0x7fe3a551c000, 68001792)        = 0
exit_group(0)                           = ?

As you can see, the task allocated memory with an mmap call between the first two getrusage system calls. And yet, that step only incurred 24 minor faults. So, even though C++ was not being lazy, Linux was being lazy about giving the memory to the task.

Specifically, the first mmap call appears to allocate an I/O buffer for the first write mesage. The second mmap call (allocating 68001792 bytes) happens before the second getrusage call. And yet, you can see only 24 additional faults occurred between the two on this run.

The hawk-eyed among you will notice the numbers are slightly different for this run than the numbers I showed above. I've run this executable many times, and the numbers shift slightly each time. But, they're always in the same general ballpark.