Kernel copying CoW pages after child process exit

In Linux, whenever a process is forked, the memory mappings of the parent process are cloned into the child process. In reality, for performance reasons, the pages are set to be copy-on-write -- initially they are shared and, in the event one of the two processes writing on one of them, they will then be cloned (MAP_PRIVATE).

This is a very common mechanism of getting a snapshot of the state of a running program -- you do a fork, and this gives you a (consistent) view of the memory of the process at that point in time.

I did a simple benchmark where I have two components:

A parent process that has a pool of threads writing into an array
A child process that has a pool of threads making a snapshot of the array and unmapping it

Under some circumstances (machine/architecture/memory placement/number of threads/...) I am able to make the copy finish much earlier than the threads write into the array.

However, when the child process exits, in htop I still see most of the CPU time being spent in the kernel, which is consistent to it being used to handle the copy-on-write whenever the parent process writes to a page.

In my understanding, if an anonymous page marked as copy-on-write is mapped by a single process, it should not be copied and instead should be used directly.

How can I be sure that this is indeed time being spent copying the memory?

In case I'm right, how can I avoid this overhead?

The core of the benchmark is below, in modern C++.

Define WITH_FORK to enable the snapshot; leave undefined to disable the child process.

#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/wait.h>

#include <numaif.h>
#include <numa.h>

#include <algorithm>
#include <cassert>
#include <condition_variable>
#include <mutex>
#include <iomanip>
#include <iostream>
#include <cmath>
#include <numeric>
#include <thread>
#include <vector>

#define ARRAY_SIZE 1073741824 // 1GB
#define NUM_WORKERS 28
#define NUM_CHECKPOINTERS 4
#define BATCH_SIZE 2097152 // 2MB

using inttype = uint64_t;
using timepoint = std::chrono::time_point<std::chrono::high_resolution_clock>;

constexpr uint64_t NUM_ELEMS() {
  return ARRAY_SIZE / sizeof(inttype);
}

int main() {

  // allocate array
  std::array<inttype, NUM_ELEMS()> *arrayptr = new std::array<inttype, NUM_ELEMS()>();
  std::array<inttype, NUM_ELEMS()> & array = *arrayptr;

  // allocate checkpoint space
  std::array<inttype, NUM_ELEMS()> *cpptr = new std::array<inttype, NUM_ELEMS()>();
  std::array<inttype, NUM_ELEMS()> & cp = *cpptr;

  // initialize array
  std::fill(array.begin(), array.end(), 123);

#ifdef WITH_FORK
  // spawn checkpointer threads
  int pid = fork();
  if (pid == -1) {
    perror("fork");
    exit(-1);
  }

  // child process -- do checkpoint
  if (pid == 0) {
    std::array<std::thread, NUM_CHECKPOINTERS> cpthreads;
    for (size_t tid = 0; tid < NUM_CHECKPOINTERS; tid++) {
      cpthreads[tid] = std::thread([&, tid] {
        // copy array
        const size_t numBatches = ARRAY_SIZE / BATCH_SIZE;
        for (size_t i = tid; i < numBatches; i += NUM_CHECKPOINTERS) {
          void *src = reinterpret_cast<void*>(
            reinterpret_cast<intptr_t>(array.data()) + i * BATCH_SIZE);
          void *dst = reinterpret_cast<void*>(
            reinterpret_cast<intptr_t>(cp.data()) + i * BATCH_SIZE);
          memcpy(dst, src, BATCH_SIZE);
          munmap(src, BATCH_SIZE);
        }
      });
    }
    for (std::thread& thread : cpthreads) {
      thread.join();
    }
    printf("CP finished successfully! Child exiting.\n");
    exit(0);
  }
#endif  // #ifdef WITH_FORK

  // spawn worker threads
  std::array<std::thread, NUM_WORKERS> threads;
  for (size_t tid = 0; tid < NUM_WORKERS; tid++) {
    threads[tid] = std::thread([&, tid] {
      // write to array
      std::array<inttype, NUM_ELEMS()>::iterator it;
      for (it = array.begin() + tid; it < array.end(); it += NUM_WORKERS) {
        *it = tid;
      }
    });
  }

  timepoint tStart = std::chrono::high_resolution_clock::now();

#ifdef WITH_FORK
  // allow reaping child process while workers work
  std::thread childWaitThread = std::thread([&] {
    if (waitpid(pid, nullptr, 0)) {
      perror("waitpid");
    }
    timepoint tChild = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> durationChild = tChild - tStart;
    printf("reunited with child after (s): %lf\n", durationChild.count());
  });
#endif

  // wait for workers to finish
  for (std::thread& thread : threads) {
    thread.join();
  }
  timepoint tEnd = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> duration = tEnd - tStart;
  printf("duration (s): %lf\n", duration.count());

#ifdef WITH_FORK
  childWaitThread.join();
#endif
}

The size of the array is 1GB, which is about 250K pages, where each page is 4KB in size. For this program, the number of page faults that occur due writing to CoW pages can be easily estimated. It can also measured using the Linux perf tool. The new operator initializes the array to zero. So the following line of code:

std::array<inttype, NUM_ELEMS()> *arrayptr = new std::array<inttype, NUM_ELEMS()>();

will cause about 250K page faults. Similarly, the following line of the code:

std::array<inttype, NUM_ELEMS()> *cpptr = new std::array<inttype, NUM_ELEMS()>();

will cause another 250K page faults. All of these page faults are minor, i.e., they can be handled without accessing the disk drive. Allocating two 1GB arrays will not cause any major faults for a system with much more physical memory.

At this point, about 500K page faults have already occurred (there will be other pages faults caused by other memory accesses from the program, of course, but they can be neglected). The execution of std::fill will not cause any minor faults but the virtual pages of the arrays have already been mapped to dedicated physical pages.

The execution of the program then proceeds to forking the child process and creating the worker threads of the parent process. The creation of the child process by itself is sufficient to make a snapshot of the array, so there is really no need to do anything in the child process. In fact, when the child process is forked, the virtual pages of both arrays are marked as copy-on-write. The child process reads from arrayptr and writes to cpptr, which results in additional 250K minor faults. The parent process also writes to arrayptr, which also results in additional 250K minor faults. So making a copy in the child process and unmapping the pages does not improve performance. On the contrary, the number of page faults is doubled and performance is significantly degraded.

You can measure the number of minor and major faults using the following command:

perf stat -r 3 -e minor-faults,major-faults ./binary

This will, by default, count minor and major faults for the whole process tree. The -r 3 option tells perf to repeat the experiment three times and report the average and standard deviation.

I noticed also that the total number of threads is 28 + 4. The optimal number of threads is approximately equal to the total number of online logical cores on your system. If the number of threads is much larger or much smaller than that, performance will be degraded due to the overhead of creating too many threads and switching between them.

Another potential issue may exist in the following loop:

for (it = array.begin() + tid; it < array.end(); it += NUM_WORKERS) {
  *it = tid;
}

Different threads may try to write more than once to the same cache line at the same time, resulting in false sharing. This may not be a significant issue depending on the size of the cache line of your processor, the number of threads, and whether all cores are running at the same frequency, so it's hard to say without measuring. A better loop shape would be having the elements of each thread to be contiguous in the array.