What is the aligment requirements for sys_brk

2019-09-02 10:04发布

问题:

I'm using sys_brk syscall to dynamically allocate memory in the heap. I noticed that when acquiring the current break location I usually get value similar to this:

mov rax, 0x0C
mov rdi, 0x00
syscall

results in

rax   0x401000

The value usually 512 bytes aligned. So I would like to ask is there some alignment requirements on the break value? Or we can misalign it the way we want?

回答1:

The kernel does track the break with byte granularity. But don't use it directly for small allocations if you care at all about performance.


There was some discussion in comments about the kernel rounding the break to a page boundary, but that's not the case. The implementation of sys_brk uses this (with my comments added so it makes sense out of context)

newbrk = PAGE_ALIGN(brk);     // the syscall arg
oldbrk = PAGE_ALIGN(mm->brk); // the current break
if (oldbrk == newbrk)
    goto set_brk;      // no need to map / unmap any pages, just update mm->brk

This checks if the break moved to a different page, but eventually mm->brk = brk; sets the current break to the exact arg passed to the system call (if it's valid). If the current break was always page aligned, the kernel wouldn't need PAGE_ALIGN() on it.


Of course, memory protection has at least page granularity (and maybe hugepage, if the kernel chooses to use anonymous hugepages for this mapping). So you can access memory out to the end of the page containing the break without faulting. This is why the kernel code is just checking if the break moved to a different page to skip the map / unmap logic, but still updates the actual brk.

AFAIK, nothing will ever use that mapped memory above the break as scratch space, so it's not like memory below the stack pointer that can be clobbered asynchronously.

brk is just a simple memory-management system built-in to the kernel. System calls are expensive, so if you care about performance you should keep track of things in user-space and only make a system call at all when you need a new page. Using sys_brk directly for tiny allocations is terrible for performance, especially in kernels with Meltdown + Spectre mitigation enabled (making system calls much more expensive, like tens of thousands of clock cycles + TLB and branch prediction invalidation, instead of hundreds of clock cycles).