This article is mainly targeted
2.4
of
kernel
.
About the organization of this article:
My goal is 'Guide', providing the overall concept of the Linux memory management subsystem, and gives a further in-depth study (including code organization, file, and main functions and some reference documents). This way, because I am in the process of reading the code, I feel that "reading a code is easy, and I am very impossible to grasp the overall thinking." Moreover, when I write some kernel code, I feel that in many cases, I don't have to understand all kernel code very specific, and it is often understood that its interface and overall work are enough. Of course, my personal ability is limited, and time is not enough. Many things are also committed to lectures in the lecture pressure :), the content is inevitable and even mistaken, welcome everyone to correct.
Storage hierarchy and x86 storage management hardware (MMU)
It is assumed here to have a certain understanding of virtual storage and segment pages. It mainly emphasizes some conceptual or easy to misunderstand the concept.
Storage level
Cache -> Main Memory ---> Disk (Disk)
Understand the root cause of the storage hierarchy: the gap between the CPU speed and the memory speed.
The reason for the hierarchical structure: topical principles.
Linux tasks:
Reduce Footprint, improve the cache hit rate, and take advantage of topics. Realize virtual storage to meet the needs of the process, effectively manage memory allocation, and strive to use limited resources.
Reference documentation:
"Too Little, Too Small" by Rik Van Riel, Nov. 27, 2000.
And all architectural materials :)
MMU's role auxiliary operating system for memory management, providing hardware support such as virtual real address transition. The address logical address of the x86 appears in the machine instruction, used to set the address of the operand. Segment: Offset Linear Address: After the logical address is processed by the segment unit, it is a linear address, which is a 32-bit unsigned integer, which can be used to locate 4G storage units. Physical address: The linear address is obtained after the page table lookup, which will be sent to the address bus to indicate the physical memory unit to be accessed. Linux: Try to avoid using segment function to increase portability. If the base address is 0, the logical address == linear address is made. Segment: Segment: Select Sub Descriptor. More than just a base address is to provide more information: protection, length limit, type, etc. The descriptor is placed in a table (GDT or LDT), and the selector can be considered to be the index of the table. Segment registers are stored in the selection, while the segment register is loaded, the data in the descriptor is loaded into an invisible register for quick access. (Figure) P40 dedicated register: GDTR (including the first address of the global description of the schedule), LDTR (segment of the current process), TSR (pointing to the task status section of the current process) Linux usage: __kernel_cs: kernel Code segment. Range 0-4g. Readable, execute. DPL = 0. __Kernel_ds: Nuclear code segment. Range 0-4g. Readable, write. DPL = 0. __User_cs: Nuclear code segment. Range 0-4g. Readable, execute. DPL = 3. __User_ds: The kernel code segment. Range 0-4g. Readable, write. DPL = 3. TSS (task status segment): The hardware context of the storage process is used when the process is switched. (Because X86 hardware has certain support for TSS, all this special segment and corresponding dedicated register.) Default_ldt: Theoretically, each process can be used in many segments at the same time, and these segments can be stored in their own LDT, but actual Linux rarely uses the X86's functions, and in most cases, all processes share this segment, which only contains an empty descriptor. There are also some special segments in the power management and other codes. (Before 2.2, the LDT and TSS segments of each process exist in GDT, and GDT can only have 8192 items, so the total number of processes in the entire system is limited to around 4090.2.4 no longer exist in GDT This restriction is therefore canceled.) __User_cs and __user_ds segments are shared by all processes in user states. Note Don't confuse this sharing and process space: Although you use the same paragraph, by using different pages tables, the process space is still independent. The X86 paging mechanism X86 hardware supports two-level page table, and the models above Pentium have also supported the Physical Address Extension Mode and a Level 3 page table. The so-called hardware support includes some special registers (CR0-CR4), and the CPU recognizes some of the flags in the page entry and reacts according to access. If you read the page of the Present bit 0 or the page written to the Read / Write bit 0 will cause the CPU to issue the Page Fault exception, and automatically set the Accessed bit after accessing the page.
Linux uses a three-level page table model that is unrelated to the architecture (as shown), using a series of macros to cover details of various platforms. For example, by treating the PMD as only one table and stores in the PGD entry (the first address of the PMD table in the normal PGD table), the intermediate directory (PMD) of the page table is clever 'folded' The global directory (PGD) to the page table is adapted to the secondary page table hardware.
6. TLB
TLB is full of TLBs is Translation Look-Aside Buffer, which is used to speed up page table lookup. The key point here is that if the operating system changes the page table content, it must refresh the TLB accordingly to make the CPU incomplete entries.
7. Cache
Cache is basically transparent to programmers, but different use methods can lead to large different performance. Linux has many key places to optimize the code, many of which are to reduce unnecessary pollution to cache. If you put only the code used in the event of only the error in .Fixup Section, you will centrally use the data frequently used to a Cache line (such as struct task_struct), reduce the Footprint of some functions, SLAB Coloring, etc. in the SLAB dispenser.
In addition, we must also know when cache is invalid: New MAP / Remap one page to an address, page out, page protection change, process switching, etc., that is, when Cache corresponds to the content or meaning of the address of the Cache Time. Of course, there is no need to invalidate throughout Cache in many cases, just if you need an address or address range. In fact,
Intel is doing very well in this regard, and Cache consistency is completely maintained by hardware.
For more information on x86 processors, please refer to its manual: Volume 3: Architecture and Programming Manual, you can get from ftp://download.intel.com/design/penload/manuals/24143004.pdf
8. Linux related implementation
This part of the code and architecture are closely related, so most of the Arch subdirectory and a large number of macro definitions and inline functions exist in the header file. Take the i386 platform as an example, the main documents include:
Page.H page size, page mask definition. Page_size, Page_shift and Page_Mask. Operation of the page, such as clear page Clear_page, copy page COPY_PAGE, page logs Page_Align also starting point: The famous Page_Offset :) and the related kernel Macro __pa and __va .. Virt_to_page gets the description structure of this page from a kernel deficiency address. We know that all physical memory is described by a MEMMAP array. This macro is the location of the physical page calculating a given address in this array. In addition, this file also defines a simple macro check that a page is not legal: valid_page (page). If Page is too far from the beginning of the MEMMAP array to exceed the distance that the maximum physical page should have, it is not legal. The more strange thing is that the definition of the page item is also placed here. PGD_T, PMD_T, PTE_T and Macro XXX_VAL PGTABLE.H PGTABLE-2LEVEL.H PGTABLE-3LEVEL.H
As the name suggests, these files are to process the page table, which provides a series of macro action pages. PGTABLE-2LEVEL.H and PGTABLE-2LEVEL.H respectively correspond to the needs of the X86 secondary and third-level page tables, respectively. First of all, of course, how many of the definitions of each page table is different. And in PAE mode, the address exceeds 32 bits, and the page entry PTE_T is indicated by 64 (PMD_T, PGD_T does not need to be changed), and some operations for the entire page entry are different. There are several categories: [PTE / PMD / PGD] _ERROR is printed when printing, 64-bit and 32-bit is of course different. SET_ [PTE / PMD / PGD] Settings Territories PTE_SAME Compare PTE_PAGE from PTE to whether the MEMMAP position PTE_NONE is empty. __mk_pte Constructs the macro of PTE PGTABLE.H, no longer explained one by one. In fact, it is also intuitive, usually it can be seen from the name. The parameter of the PTE_xxx macro is PTE_T, and the parameters of PTEP_xxx are PTE_T *. 2.4 Kernel has made some efforts in Clean Up of code, and many local vague names become clear, and some functions of some functions are better. In addition to the macro of the page table, PGTable.h is also reasonable because they are often used in the page table operation. The TLB operation here is started in __, that is, the internal use, the real external interface is in pgalloc.h (which may be because in the SMP version, the TLB refresh function and the stand-alone version are large, some It is no longer embedded function and macro). 8.3 PGalloc.h includes the assignment and release macro / function of the page entry, it is worth noting that the use of the key cache: PGD / PMD / PTE_QUickList kernel There are many localities to use similar techniques to reduce the call to the memory allocation function. Accelerate frequently used allocation. Such as Buffer_Head and Buffers in Buffer Cache, the most recently used area in the VM area. There is also the TLB refresh interface mentioned above 8.4 segment.h definition __kernel_cs [ds] __user_cs [ds] reference:
The second chapter of "Understanding the Linux Kernel" gives a brief description of the implementation of Linux,
Physical memory management.
2.4 Memory management has a large change. Zone Based Buddy Systems is implemented on physical page management. The zone is divided according to the different types of use of memory. The memory of different regions uses a separate partner system (Buddy system) and is independently monitored.
(In fact, there is a higher layer and NUMA support. Numa (None Unified Memory Access) is an architecture in which different memory areas may have different access times for each processor in the system. It is determined by the distance from memory and processor). And in general machines are called DRAM, ie dynamic random access memory, which is the same for each unit, and the CPU is the same. Numa's access speed is the same memory area For a Node, the main task that supports this structure is to minimize communication between Node, so that the data to be used for each processor is as fast as possible in the fastest node. 2.4 kernel Node & # 0; the corresponding data structure is pg_data_t, each Node has its own MEMMAP array, divides its memory into several zone, each zone reuses the independent partner system management physical page. Numa has a lot of problems to deal with, Not much perfect, there is not much to say)
Some important data structures are roughly represented as follows:
Design of the area-based partner system design & # 0; management of physical page
The two major issues of memory allocation are: allocation efficiency, fragmentation problem. A good dispenser should be able to quickly meet the allocation requirements of various sizes, and cannot produce a lot of debris waste space. Partner system is a commonly used ratio algorithm. (Explanation: TODO) The concept of the introduction area is to distinguish between memory different types (methods?) To make more effectively use them.
2.4 There are three districts: DMA, NORMAL, and HIGHMEM. The first two actually managed by independent buddy system, but there is no clear zone concept in 2.2. The DMA zone is usually less than 16 megaby physical memory in the X86 architecture because the DMA controller can only use this paragraph of memory. HIGHMEM is a high-end memory that exceeds a certain value (usually about 900m). The other is Normal area memory. Due to Linux implementation, the high address memory cannot be used directly by the kernel. If the config_highmem option is selected, the kernel will use a special approach to use them. (Explanation: Todo). HIGHMEM is only used for page cache and user processes. After this is separated, we will be more targeted to use memory, and without having to use a large amount of memory available to DMA to use a user process that causes the driver to not get enough DMA memory. In addition, each zone independently monitors the usage of memory in this area, and the system will determine which region allocation comparison is compared, considering the requirements of the user and system status. 2.4 Mills the page may interact with the high-level VM code (allocation according to the case of idle pages, the kernel may allocate the page from the partner system, or directly to the assigned page to recover & # 0; reclaim, etc.), code ratio 2.2 Complex a lot, comprehensively understand it is familiar with the mechanism of the entire VM work.
The main interface of the entire dispenser is the following functions (mm.h page_alloc.c):
Struct Page * Alloc_pages (int GFP_MASK, UNSIGNED Long ORDER) Allocate 2 ^ Order pages from the appropriate area according to the requirements of GFTP_Mask, and return a descriptor of the first page. #define alloc_page (gfp_mask) alloc_pages (gfp_mask, 0) unsigned long __get_free_pages ((int gfp_mask, unsigned long order) work with alloc_pages, but return to the first address. #define __get_free_page (gfp_mask) __get_free_pages (gfp_mask, 0) get_free_page assign a cleared Zero page. __Free_page (s) and free_page (s) Release page (one / multiple) The former is parameter as a page descriptor, and the latter is parameter.
Regarding the Buddy algorithm, many textbooks have a detailed description,
Http://home.earthLink.net/~jknapka/linux-mm/zonealloc.html.
2. SLAB - Continuous physical region management
The splitter of the single allocation page must not meet the requirements. A large number of data structures in the kernel, the size from several bytes to dozens of hundred k qi, all of the power of the 2 is completely unrealistic. 2.0 The root solution is to provide a memory area having a size of 2, 4, 8, 16, ..., and 131056 bytes. When you need a new memory area, the kernel applies the page from the Partner System, divides them into one area, takes one to meet the needs; if the memory area in a page is released, the page will return to the partner system. This is not efficient to do. There are many places to improve: Different data types are allocated with different ways to increase the efficiency. For example, it is necessary to initialize the data structure. After the release, you can temporarily save, and it will not be initialized when redistribute.
The function of the kernel often repeatedly uses the same type of memory area, and the most recently released object can accelerate allocation and release.
The request for memory can be classified according to the request frequency, frequently used types of special cache, which is rarely used with a general purpose cache that can be used in similar 2.0.
The probability of cache collisions is large when using the power of 2 power, and it is possible to reduce cache collisions by carefully arrangeing the starting address of the memory area.
Cache a certain number of objects can reduce the call to the Buddy system, saving time and reduces cache pollution caused.
2.2 The implementation of the SLAB dispenser embodies these improvement ideas.
Main data structure interface:
KMEM_CACHE_CREATE / KMEM_CACHE_DESTORY
KMEM_CACHE_GROW / KMEM_CACHE_REAP Growth / Reduced Size Cache
KMEM_CACHE_ALLOC / KMEM_CACHE_FREE Assign / release an object from a class cache
The assignment of the kmalloc / kfree generic cache, release function.
Related code (SLAB.C).
Related reference:
Http://www.lisoleg.net/lisoleg/memory/slab.pdf: SLAB inventor's papers must read the classic.
Aka's 2000 context also has a strong prawn, please visit the AKA homepage: www.aka.org.cn
3. Vmalloc / vfree & # 0; physical address is discontinuous, the virtual address continuous memory management
Use the Kernel page table. The file Vmalloc.c is relatively simple.
Third, 2.4 kernel VM (perfect middle ...)
Process address space management creates, destroyed. MM_STRUCT, VM_AREA_STRUCT, MMAP / MPROTECT / MUNMAP PAGE FAULT Processing, Demand Page, Copy ON WRITE Related Files: The definition of the Struct Page structure, the Page flag definition, and the access operation macro definition. Struct VM_Area_struct definition. Function prototype of MM subsystem. INCLUDE / Linux / Mman.h: Operation MMAP / MProtect / Munmap related constant macro definitions. Memory.c: page fault processing, including COW and Demand Page, etc. Table table table related to a region: zeromap_page_range: Map all of the scope to ZERO_PAGE Remap_page_range: The reference page is re-mapped to another address space. ZAP_PAGE_RANGE: Release the user page within a given range, page table. Mlock.c: Mlock / Munlock system call. Mlock locks the page in physical memory. Mmap.c :: mmap / munmap / brk system call. MPROTECT.C: MProtect system call. The front three files involve the operation of VM_Area_struct, there are many similar xxx_fixup code, their tasks are patched areas that are affected and ensure that the VM_Area_Struct Link is correct. Switching purposes:
Make the process can use a larger address space. At the same time, there is a more process.
task:
Select the page you want to change to decide how to store the page in the switched area to decide when
KSWAPD kernel thread: activation every 10 seconds
Task: When the idle page is below a certain value, from the address space of the process, various cache recycling pages
Why can't you wait until the memory allocation failed to use the try_to_free_pages to reclaim the page? the reason:
Some memory allocation When interrupt or exception handling calls, they cannot block sometime that occurs in a critical path, so it can't start IO. If you can't afford to allocate the memory allocation on all paths, memory cannot be released.
KRECLAIMD reclaims from the inactive_clean_list, wake up from __alloc_pages.
Related documents:
MM / SWAP.C KSWAPD uses various parameters and functions for the age of operation.
MM / SWAP_FILE.C swaps the operation of the partition / file.
MM / PAGE_IO.C Read or write a switched page.
MM / SWAP_STATE.C SWAP CACHE related operations, join / delete / find a swap cache, etc.
The mm / vmscan.c scanning the VM_Area of the process, trying to replace some pages (KSWAPD).
Reclaim_page: Recycled a page from INACTIVE_CLAN_LIST, put it in Free_List
KcLaimd is replicated after waking up Reclaim_Page until each area
Zone-> free_pages> = zone-> Pages_low
Page_Lauder: Calling from __alloc_pages and try_to_free_pages. Usually because the page of FreePags Inactive_clean_List is too small. Function: Transfer the page of INACTIVE_DIRTY_LIST to INACTIVE_CLEAN_LIST, first put the page (by bdflush) that has been written back or exchange zone (by bdflush), if Freepages are indeed shortage, wake up BDFLUSH, and then write back a certain number of Dirty pages over again. For the logic of these queues (active_list, inactive_dirty_list, inactive_clean_list), please refer to: Document: RFC: Design for New VM, you can get from Lisoleg's documentation.
,
Page Cache, Buffer Cache and Swap Cache
Page Cache: Cache of the file content when reading and writing files, size is a page. Not necessarily continuous on the disk.
Buffer Cache: When reading and writing disk blocks, the Cache of the disk block content corresponds to a continuous area on the disk, and a buffer cache size may be from 512 (sector size) to a page.
Swap cache: It is a subset of Page Cache. The case where the page shared by multiple processes is swapped out of the switching area.
Relationship between Page Cache and Buffer Cache
Essentially different, buffer cache buffer magnetic disk block content, Page Cache buffer file 1 page content. The temporary buffer cache will be used to write the disk when writing.
BDFLUSH: Write Dirty's buffer cache back to disk. Usually only when dirty's buffer is too much or requires more buffer and runs short-term memory. Page_Lauder may also wake up it.
Kupdate: Timed operation, writing the Dirty Buffer that the write-back period has reached the disk.
2.4 Improvement: Page Cache and Buffer Cache are coupled. In 2.2, the disk file reads Page Cache, and the Buffer Cache is written directly, so the problem of synchronization: You must use the update_vm_cache () update possible Page Cache. 2.4 Page Cache has made a relatively large improvement, and the file can be written directly by Page Cache, and Page Cache is prioritized using High memory. Moreover, 2.4 introduces new objects: File Address Space, which contains methods used to read and write a full page data. These methods take into account the update of Inode, the use of Page Cache processing and temporary buffer. The synchronization problem of Page Cache and Buffer Cache is eliminated. It turns out that INODE OFFSET lookup Page Cache becomes converted via FILE Address Space Offset; the Inode member in Struct Page is replaced by the address_space type Mapping member. This improvement also makes sharing of anonymous memory be possible (this is difficult to implement in 2.2, many discussions).
4. The virtual memory system brings a lot of experience from FreeBSD, which has made huge adjustments for 2.2.
Document: RFC: Design for New VM is not read.
Due to time rush, many of the new VMs have never been able to figure out. First roughly, I will further improve this article in the future, and strive to clear the problem. In addition, after this semester exam, I hope to provide you with some detailed source code. Lisoleg collected some documents and links in memory management, you can take a look.
http://www.lisoleg.net/lisoleg/memory/index.html