I was thinking about how the Linux kernel implements system calls and I was wondering if someone could give me a high level view of how sbrk/brk work?
I've reviewed the kernel code, but there is just so much of it and I don't understand it. I was hoping for a summary from someone?
In a very high level view, the Linux kernel tracks the memory visible to a process as several "memory areas" (struct vm_area_struct
). There is also a structure which represents (again in a very high level view) a process' whole address space (struct mm_struct
). Each process (except some kernel threads) has exactly one struct mm_struct
, which in turn points to all the struct vm_area_struct
for the memory it can accesss.
The sys_brk
system call (found in mm/mmap.c
) simply adjusts some of these memory areas. (sbrk
is a glibc wrapper around brk
). It does so by comparing the old value of the brk
address (found inside struct mm_struct
) and the requested value.
It would be simpler to look at the mmap
family of functions first, since brk
is a special case of it.
you have to understand how virtual memory works, and how an MMU mapping relates to real RAM.
real RAM is divided in pages, traditionally 4kB each. each process has its own MMU mapping, which presents to that process a linear memory space (4GB in 32-bit linux). of course, not all of them is actually allocated. at first, it's almost empty, that is no real page is associated with most addresses.
when the process hits a non-allocated address (either trying to read, write or execute it), the MMU generates a fault (similar to an interrupt), and the VM system is invoked. If it decides that some RAM should be there, it picks an unused RAM page and associates with that address range.
that way, the kernel doesn't care how the process uses memory, and the process doesn't really care how much RAM there is, it will always have the same linear 4GB of address space.
now, the brk/sbrk
work at a slightly higher level: in principle any memory address 'beyond' that mark is invalid and won't get a RAM page if accessed, the process would be killed instead. the userspace library manages memory allocations within this limit, and only when needed ask the kernel to increase it.
But even if a process started by setting brk
to the maximum allowed, it wouldn't get real RAM pages allocated until it starts accessing all that memory addresses.
Well, from a super-high level perspective, the kernel allocates a pageable block of memory, modifies the page tables of the process requesting that block so that the memory is mapped into the process's VA space, then returns the address.
A key concept of how the linux kernel passes memory to a user process is that the processes available heap (the data segment) grows up from the bottom. the kernel does not keep track of individual chunks of memory, only a continuous block of memory. the brk/sbrk system calls expand the amount of memory the process has, but it's up to the process to manage it in usable pieces.
A key consequence of this is that memory scattered across the processes address space that is not in use cannot be returned to the operating system for other uses. Only memory at the very end of the data segment can be returned to the operating system, so in-use memory near the end would have to be shifted downward toward the top. In practice almost no allocators do this. For this reason, it's usually important to do a good job of managing the maximum amount of memory a process uses, because that determines how much memory will be left for other processes.