My understanding of copy-on-write is that "Everyone has a single, shared copy of the same data until it's written, and then a copy is made".
- Is a shared copy of the same data comprised of a heap and bss segment or only heap?
- Which memory segments will be shared, and is this dependent on the OS?
To better understand, you should eliminate the term segment from your vocabulary. Most systems work on pages; not segments. In 64-bit Intel segments have finally gone away.
You should be asking, "What pages are affected in copy on write."
That would be pages that are writeable and shared by multiple processes when one process writes to it.
This can happen after a fork. One way to implement forking is to create a complete copy of the parent process's address space. However, that could be a lot of effort, especially because most of the time one does an exec in the child right after the fork.
An alternative is have the parent and children share the same memory. That works fine for read-only memory but has obvious problems if multiple processes can write to the same memory.
This can be overcome by having the processes charge read/write memory until a process writes to it. In which case, that page becomes unshared by the writing process, the OS allocates a new page frame, maps that to the address space, copies the original data to that page, then allows the writing process to continue.
The OS can set whatever "copy on write" policy it wishes, but generally, they all do the same thing (i.e. what makes the most sense).
Loosely, for a POSIX-like system (linux, BSD, OSX), there are four areas (what you were calling segments) of interest:
data
(whereint x = 1;
goes),bss
(whereint y
goes),sbrk
(this is heap/malloc), andstack
When a
fork
is done, the OS sets up a new page map for the child that shares all the pages of the parent. Then, in the page maps of the parent and the child, all the pages are marked readonly.Each page map also has a reference count that indicates how many processes are sharing the page. Before the fork, the refcount will be 1 and, after, it will be 2.
Now, when either process tries to write to a R/O page, it will get a page fault. The OS will see that this is for "copy on write", will create a private page for the process, copy in the data from the shared, mark the page as writable for that process and resume it.
It will also bump down the refcount. If the refcount is now [again] 1, the OS will mark the page in the other process as writable and non-shared [this eliminates a second page fault in the other process--a speedup only because at this point the OS knows that the other process should be free to write unmolested again]. This speedup could be OS dependent.
Actually, the
bss
section get even more special treatment. In the initial page mapping for it, all pages are mapped to a single page that contains all zeroes (aka the "zero page"). The mapping is marked R/O. So, thebss
area could be gigabytes in size and it will only occupy a single physical page. This single, special, zero page is shared amongst allbss
sections of all processes, regardless whether they have any relationship to one another at all.Thus, a process can read from any page in the area and gets what it expects: zero. It's only when the process tries to write to such a page, the same copy on write mechanism kicks in, the process gets a private page, the mapping is adjusted, and the process is resumed. It is now free to write to the page as it sees fit.
Once again, an OS can choose its policy. For example, after the fork, it might be more efficient to share most of the stack pages, but start off with private copies of the "current" page, as determined by the value of the stack pointer register.
When an
exec
syscall is done [on the child], the kernel has to undo much of the mapping done during thefork
[bumping down refcounts], releasing the child's mapping, etc and restoring the parent's original page protections (i.e. it will no longer be sharing its pages unless it does anotherfork
)Although not part of your original question, there are related activities that may be of interest, such as on demand loading [of pages] and on demand linking [of symbols] after an
exec
syscall.When a process does an
exec
, the kernel does the cleanup above, and reads a small portion of the executable file to determine its object format. The dominate format is ELF, but any format that a kernel understands can be used (e.g. OSX can use ELF [IIRC], but it also has others].For ELF, the executable has a special section that gives a full FS path to what's known as the "ELF interpreter", which is a shared library, and is usually
/lib64/ld.linux.so
.The kernel, using an internal form of
mmap
, will map this into the application space, and set up a mapping for the executable file itself. Most things are marked as R/O pages and "not present".Before we go further, we need to talk about the "backing store" for a page. That is, if a page fault occurs and we need to load the page from disk, where it comes from. For heap/malloc, this is generally the swap disk [aka paging disk].
Under linux, it's generally the partition that is of the type "linux swap" that was added when the system was installed. When a page is written to that has to flushed to disk to free up some physical memory, it gets written there. Note that the page sharing algorithm in the first section still applies.
Anyway, when an executable is first mapped into memory, its backing store is the executable file in the filesystem.
So, the kernel sets the app's program counter to point to the starting location of the ELF interpreter, and transfers control to it.
The ELF interpreter goes about its business. Every time it tries to execute a portion of itself [a "code" page] that is mapped but not loaded, a page fault occurs and the loads that page from the backing store (e.g. the ELF interpreter's file) and changes the mapping to R/O but present.
This occurs for the ELF interpreter, shared libraries, and the executable itself.
The ELF interpreter will now use
mmap
to maplibc
into the app space [again, subject to the demand loading]. If the ELF interpreter has to modify a code page to relocate a symbol [or tries to write to any that has the file as the backing store, like adata
page], a protection fault occurs, the kernel changes the backing store for the page from the on disk file to a page on the swap disk, adjusts the protections, and resumes the app.The kernel must also handle the case where the ELF interpreter (e.g.) is trying to write to [say] a
data
page that had never yet been loaded (i.e. it has to load it first and then change the backing store to the swap disk)The ELF interpreter then uses portions of
libc
to help it complete initial linking activities. It relocates the minimum necessary to allow it to do its job.However, the ELF interpreter does not relocate anywhere near all the symbols for most other shared libraries. It will look through the executable and, again using
mmap
, create a mapping for the shared libraries the executable needs (i.e. what you see when you doldd executable
).These mappings to shared libraries and executables, can be thought of as "segments".
There is a symbol jump table that points back to the interpreter in each shared library. But, the ELF interpreter makes minimal changes.
[Note: this is a loose explanation] Only when the application tries to call a given function's jump entry [this is that GOT et. al. stuff you may have seen] does a relocation occur. The jump entry transfers control to the interpreter, which locates the real address of the symbol and adjusts the GOT so that it now points directly to the final address for the symbol and redoes the call, which will now call the real function. On a subsequent call to the same given function, it now goes direct.
This is called "on demand linking".
A by-product of all this
mmap
activity is the the classicalsbrk
syscall is of little to no use. It would soon collide with one of the shared library memory mappings.So, modern
libc
doesn't use it. Whenmalloc
needs more memory from the OS, it requests more memory from an anonymousmmap
and keeps track of which allocations belong to whichmmap
mapping. (i.e. if enough memory got freed to comprise an entire mapping,free
could do anmunmap
).So, to sum up, we have "copy on write", "on demand loading", and "on demand linking" all going on at the same time. It seems complex, but makes
fork
andexec
go quickly, smoothly. This adds some complexity, but extra overhead is done only when needed ("on demand").Thus, instead of a large lurch/delay at the beginning launch of a program, the overhead activity gets spread out over the lifetime of the program, as needed.