Thursday, December 13, 2007

Case Study: Linux Memory Management

PageAllocation
The Linux page allocator, from mm/page_alloc.c, is the main memory allocation mechanism in the Linux kernel. It has to deal with allocations from many parts of the Linux kernel, under many different circumstances. Consequently the Linux page allocator is fairly complex, and easiest to understand in the context of its environment. Because of this, this wiki article begins with an explanation of exactly what the page allocator needs to do, before going into the details of how things are done.

I am writing this article bit by bit whenever I feel like it. If you feel like writing something, go right ahead - RikvanRiel


Contents

memory allocators
gfp mask
page allocation order
alloc_pages
_alloc_pages
buddy allocator
per-cpu page queues
hot/cold pages
NUMA tradeoffs

memory allocators
Various different parts of the Linux kernel allocate memory, under different circumstances. Most memory allocations happen on behalf of userspace programs; these allocations can use any memory in the system (highmem, zone_normal and dma) and, if free memory is low, can wait for memory to be freed by the pageout code. Page cache and page table allocations can also use any memory in the system and can wait for memory to be freed.

Most kernel level allocations are different and can only use memory that is directly mapped into kernel address space (zone_normal and dma). Most, though not all, kernel level allocations can wait for memory to be freed, if free memory is low.

Allocations from interrupt context are different. They can not wait for memory to be freed, so if free memory on the system is low at the time of the allocation, the allocation will simply fail.

PageReplacementDesign
This page describes a new page replacement design by Rik van Riel, Lee Schermerhorn and others.

If you spot any conceptual problems or oversights in this design, or have questions on the details, please email or IRC RikvanRiel. Once nobody can find holes, the concepts will probably work


The problem
In brief, the current page replacement mechanism in Linux (up to and including 2.6.24) has two problems:

Sometimes the kernel evicts the wrong pages, resulting in bad performance.
The kernel scans over pages that should not be evicted. This results in increased CPU use on large memory systems (several GB), but results in catastrophic CPU use and lock contention on huge systems (>128GB of RAM).

More details on the problem space can be found on ProblemWorkloads and page replacement requirements.


High level overview
Evicts pages with minimal scanning. Systems with 1TB RAM exist, the VM cannot afford to scan pages that should not be evicted.
Most page churn comes from reading large files. Evict those pages first.
Evict other pages when we do not have enough memory for readahead or we keep reading in the same file data over and over again.
Filesystem IO is much more efficient than swap IO. Only evict file cache if we can.
Use reference and refault data to balance filesystem cache and anonymous (process) memory.
Optimize the normal case; also optimize the worst case. The VM must be robust at all times, so the worst case has to work well too.