Why Linux Kernel ZONE_NORMAL is limited to 896 MB?

2020-05-19 04:49发布

问题:

A newbie question. I'm doing some kernel study and get confused on the 896MB size limit of ZONE_NORMAL. I don't understand why kernel cannot map 4G physical memory into kernel space directly. Some documents mentioned its size constraint of page map. But considering 4G memory has 2^20 pages and each "struct page" is 4 bytes, the mem_map would only be 4MB. That should not be the issue. Hope you could shed me some light.

Thanks

回答1:

Of course, the kernel can map all available memory.

In Linux, the memory available from all banks is classified into "nodes". These nodes are used to indicate how much memory each bank has. Memory in each node is divided into "zones". The zones currently defined are ZONE_DMA, ZONE_NORMAL and ZONE_HIGHMEM.

ZONE_DMA is used by some devices for data transfer and is mapped in the lower physical memory range (up to 16 MB).

Memory in the ZONE_NORMAL region is mapped by the kernel in the upper region of the linear address space. Most operations can only take place in ZONE_NORMAL; so this is the most performance critical zone. ZONE_NORMAL goes from 16 MB to 896 MB.

Why?

Part of memory is reserved for kernel data structures that store information about the memory map and page tables. This on x86 is 128 MB. Hence, of the 1 GB physical memory the kernel can access (on a typical configuration, 1GB is reserved for the kernel), 128MB is reserved. Hence the kernel virtual addresses in this 128 MB are not mapped to physical memory directly. This leaves a maximum of 896 MB for ZONE_NORMAL. So, even if one has 1 GB of physical RAM, just 896 MB will be actually available for userspace.

To better understand the subject, I suggest you have a look at Chapter 15 of Linux Device Drivers (pdf).



回答2:

The reason why the kernel limits itself to 896 megabytes is for performance reasons.

The more space available to the kernel means less address space available to userspace. This 3/1 split means that the most amount of address space a user process can allocate is 3 gigabytes -- of course, due to memory fragmentation, in practice it seems to start failing around 2.5 gigabytes.

Different splits are available: 2/2 and 1/3 splits that allocate two gigabyte address space for the kernel and two gigabytes for userspace, and three gigabytes for the kernel and one gigabyte address space for userspace. (This firefox is now consuming 1249 megabytes, so it couldn't fit into one of those 1/3 split kernels.)

There are some kernels (perhaps vendor-only?) that support what is known as the 4:4 split -- four gigabytes of address space for the kernel and four gigabytes of address space for userspace. These are extremely useful for the 32-bit systems that have 32 or 64 gigabytes of memory -- since a large system probably has many disks, a lot of IO in flight, and needs significant buffering for both block devices and network traffic. However, these 4:4 kernels require flushing the TLB cache on entering and exiting every system call. These TLB flushes introduce significant slowdowns on "small" systems and are only worth it on "large" systems where the extra memory can cache enough disk / network resources to improve the performance of the system.

The other splits don't incur this TLB flush because the TLB maintains a permissions bit indicating whether the pages are available when the CPU is in user state or supervisor state: the kernel pages are always mapped, but marked available only when the CPU's supervisor flag is set. So entering and exiting the kernel is fast, when exiting back to the process that entered the kernel. When context-switching, of course the TLB needs to be flushed then.



标签: linux kernel