Linux kernel source code roaming
Alessandro Rubini, rubini@pop.systemy.it
Zhao Wei translated, gohigh @ sh163net, (www.plinux.org)
This chapter tries to explain the Linux source code in order to help readers have a good understanding of the architecture of source code and many related UNIX features. The goal is to help the experienced C programmers who don't understand Linux don't know how to understand the entire Linux design. This is why the intra-core roaming point is selected as the starting point of the kernel itself: System Boot (start).
This material requires a good understanding of the C language and the concept of UNIX and the structure of the PC, but there is no C code in this chapter, but directly refer to (pointing) the actual code. The best space for kernel design is in other chapters in this manual, and this chapter still tends to be an informal overview.
The pathname of any file referred to in this chapter refers to the main source code directory tree, usually / usr / src / linux.
Most of the information given here are source code of the Linux release version 1.0. Nonetheless, a reference to later versions is sometimes provided. This is the opening of this roam
Any section of icons emphasize new changes to the kernel after version 1.0. If there is no such section, it is said until version 1.0.9-1.1.76, has not been changed.
Sometimes this chapter will be like this section, which is an indicator that points to the correct code to obtain more information about the subject that has just discussed. Of course, here is the source code.
Boot (start) system
When the PC's power is turned on, the CPU of the 80x86 structure will automatically enter the real mode, and automatically execute the program code from the address 0xffff0, which is usually the address in the ROM-BIOS. The BIOS of the PC will perform some system detection and initialize the interrupt vector at the physical address 0. Thereafter, it reads the first sector of the bootable device into the memory address 0x7C00 and jumps to this place. Starting equipment is usually a floppy drive or a hard drive. The narrative here is very simple, but this is enough to understand the initialization of the kernel.
The most advanced part of Linux is written in the 8086 assembly language (boot / bootsect.s), which will be read from the BIOS to the memory 0x7C00, and move itself to the absolute address 0x90000 when it is executed, and will start The lower 2kb byte of the device (boot / setup.s) reads the memory 0x90200, and the other ports of the kernel are read into the address 0x10000. Information "Loading ..." will be displayed during system loading. The control will then pass to the code in Boot / Setup.s, which is another real-mode assembly language program.
Start the part identifying some of the characteristics of the host and the type of VGA card. If needed, it will ask the user to select the display mode for the console. The entire system is then moved from the address 0x10000 to 0x1000, enters the protection mode and jump to the remainder of the system (at 0x1000).
The next step is the decompression of the kernel. The code at 0x1000 comes from zboot / head.s, which initializes the register and calls Decompress_kernel (), which is composed of zboot / inflate.c, zboot / unzip.c and zboot / misc.c. The decompressed data is stored in the address 0x10000 (1 mega), which is why Linux cannot run less than 2 megabytes of memory. [Decompose the kernel in 1 megaby memory has been completed, see Memory Savers - Ed]
The job encapsulates the kernel in a gzip file is done by Makefile in the zboot directory. They are interesting documents worth seeing.
The kernel release 1.1.75 moves the Boot and Zboot directory to Arch / i386 / boot, which means that the different architecture allows the true kernel construction, but I will still only explain the information about I386.
The extracted code is executed from the address 0x10100 [I may forget the specific physical address, because I am not very familiar with the corresponding code], where all 32-bit setting starts: IDT, GDT And the LDT is loaded, the processor and coordinator have confirmed that the paging work is also set; the start_kernel subroutine is finally called. The source code of the above operation is in boot / head.s, which may be the most intricate code in the entire core. Note If it is wrong in any of the aforementioned steps, the computer will lock. It can't be handled before the operating system is not fully run.
Start_kernel () is located in init / main.c and there is no return result. From any code from now, any code is used in C language, in addition to the interrupt management and system call (of course, most macros are embedded in assembly code).
Let the wheel turn
After processing all intricate problems, Start_kernel () initializes all parts of the kernel, especially:
Set the memory boundary and call paning_init ();
Initialization interruption, IRQ channel and scheduling;
Analysis (analysis) command line;
Assign a data buffer and other small parts if needed;
Correct latency cycles (calculate "bogomips");
Check if the interrupt 16 can work with the coprocessor.
Finally, in order to generate the initial process, the kernel is ready to move to MOVE_TO_USER_MODE (), and its code is also in the same source code file. The so-called idle task, the process number 0 is running in an unlimited idle cycle.
The initial process is then tried to run / etc / init, / bin / init or / sbin / init.
If they do not run success, they will perform code "/ bin / sh / etc / rc" and generate an root command interpreter (root shell) on the first terminal. This code retracts to Linux 0.01, and the operating system has only one kernel and there is no login process.
After using the initial initializer from a standard place (let us assume that we have), the kernel does not have a direct control of the execution of the program. From now on it, it is to provide processing for system calls, as well as asynchronous event services (such as hardware interrupts, etc.). Multi-task environments have been established, from now on the init program to manage multi-user access via the system process and login processes that are delivered through fork ().
Since the kernel is responsible for providing services, this roaming article will continue to discuss by observing these services ("System Call") and the organizational structure of the basic data structure and the organizational structure of the code.
How the kernel sees a process
From the perspective of the kernel, a process is only an entry in the process table.
The process table and the memory management tables and buffer memory are the most important data structure in the system. The individuals in the process table are Task_Struct structures, which is a very large data structure defined in include / Linux / Sched.h. The information from the low-level to the high layer is retained in the task_struct, and the inode information of the copy from some hardware registers to the process work directory.
Process tables are both an array and double-linked list and a tree structure. Its physical implementation is a static pointer array, which is defined in the constant NR_TASKS in include / Linux / Tasks.h, and each structure is located in a reserved memory page. This list structure is made by pointer next_task and pre_task, and the tree structure is very complicated and we will not discuss here. You may want to change the default value of NR_TASKs, but you have to ensure that the appropriate files in all source files are re-compiled. After the startup boot process is completed, the kernel will always work on a process, and the global variable Current - a pointer to a task_struct entry - is used to record the running process. Current can only be changed by the scheduler in kernel / sched.c. However, since all the processes must be accessed, the macro for_each_task is used. When the system load is very light, it is much more than the order of the array.
The process always runs in "user mode" or "kernel mode". The main body of the user program is running in user mode, and the system calls are running in kernel mode. The stacks used in these two execution modes are different-conventional stack segments for user mode, and a fixed size stack (one page, all of which are all) used for kernel mode. The kernel stack page is never exchanged, because it must exist whenever a system calls enter.
System calls in the kernel are existing as a C language function, and their 'normal' name is starting with 'sys_'. For example, a system called Burnout will call the kernel function sys_burnout ().
The system call mechanism discusses in Chapter 3 of this manual. Watching for_each_task and set_links in include / Linux / scheD.h can help understand the list and tree structure in the process table.
Create and end progress
The UNIX system creates a process through the Fork () system call, and the termination of the process is done by exit () or received a signal. Their Linux implementation is located in kernel / fork.c and kernel / exit.c. It is very easy to derive a process, so the fork.c program is very short and easy to understand. Its main task is to fill in the data structure for new processes. In addition to filling out each field, the relevant steps are:
l Available in an idle memory page to save Task_struct
l Find an idle process slot (Find_empty_Process ())
l Take another idle memory page for the memory pile of pages kernel_stack_page.
l Copy the LDT of the father to the child process
l Copy the MMAP information of the parent process
Sys_fork () also also manages file descriptors and inode.
1.0 kernel also provides some insufficient support for threads, so the fork () system call gives some schematic. The thread of the kernel is a process product other than the mainstream kernel.
Exiting from one process is more tips, because the parent process must be advertised about any child process. Moreover, a process can be exited by another process (these is UNIX characteristics), so in addition to sys_exit (), sys_kill () and sys_wait () are also existing in exit.c. .
Not discussing the code of Exit.c here - because it is not interested in it. To exit the system in a consistent state, it involves many details. The POSIX standard is quite strict for the signal, so it must be described here.
execute program
After calling fork, there is two copies of the same program running, usually a program executes another program using EXEC (). The exec () system call must locate the binary image of the execution file, load and execute it. Word 'loading' does not necessarily mean "copying binary images into memory" because Linux supports on-demand. EXEC () Linux implements supports different binary formats. This is achieved by the Linux_binfmt structure, in which two pointers in which two pointing functions are embedded - one is used to load the executable, and the other is used to load the library function, each binary format has these two functions. . The loading of the shared library is implemented in the same source program in Exec (), but we only discuss the exec () itself. The UNIX system provides six EXEC () functions. Except for one, all are implemented in the form of a library function, and the Linux kernel is separately implemented SYS_EXECVE () calls. It performs a very simple task: load the header of the executable file and try to perform it. If the first two bytes are "#!", Then the first row of the executable file will be parsed and an interpreter is called to execute it, otherwise, the two-registered binary format is sequentially tried. The format of Linux itself is directly supported by FS / EXEC.C, and the related function is load_aout_binary and load_aout_library. For binary, the function will load a "a.out" executable and end with the MMAP () load disk file or call Read_exec (). The former method uses Linux's on-demand load mechanism, and uses an error load mode (FAULT-IN) load program page when the program is accessed, and the latter way is that when the host file system does not support memory images (such as "MSDOS" File system) is used. A revised MSDOS file system is embedded in the new 1.1 core, which supports MMAP (). And the Linux_binfmt structure is already a linked list instead of an array to allow a new binary format to be loaded in a kernel module. Finally, the structure itself has also been expanded into a core dump program associated with the format.
Access file system
As is well known, the file system is the most basic resource in the UNIX system. Its basic and general existence so that it requires a more convenient name - I will be "FS" that I am loyal to the standard. "FS."
I will assume that the reader has already known the principle of the basic UNIX file system - Access (Permissions) License, INODE, Hyper Block, Loading (Mount) and Uninstall File System. These concepts give a good explanation in the standard UNIX literature, so I don't repeat their work and I will only focus on Linux issues.
Early UNIX usually only supports a file system (FS) type, which spreads through the entire kernel, today's implementation uses a standard interface between the kernel and FS to facilitate data in different architectures. exchange. Linux itself provides a standard layer to pass data between the kernel and each FS module. This interface layer is called VFS, ie "Virtual FileSystem").
Thus, the code of the file system is split into two layers: the upper layer is the management and data structure of the kernel table, and the low layer is made of a function set related to each file system and is called by the VFS data structure.
All data independent of file system is located in the fs / *. C file. They involve the following questions:
· Manage buffer registers (buffer.c);
• Respond to FCNTL () and IOCTL () system calls (FCNTL.C and IOCTL.C);
• Amoding pipeline and FIFO (FIFO.C, PIPE.C) in Inode and buffer; · Manage files - and inode - table (file_table.c, inode.c);
· Lock and unlock files and records (Lock.c);
· Map the name to Inode (Namei.c, Open.c);
· Implement an intricate SELECT () function (Select.c);
· Provide information (stat.c);
· Load and uninstall the file system (Super.c);
· Execute executables and dump core programs (Exec.c) using Exec ();
• Load a variety of binary formats (bin_fmt * .c, as described above).
The VFS interface consists of a relatively high level of operation and is actually executed by each file system type from a relatively high level of operation. The most relevant data structure is inode_Operations and File_Operations, although they do not exist alone: there are other data structures. They are all defined in the include / linux / fs.h file.
The kernel entry point to the actual file system is the data structure File_System_Type. A array of file_system_types is included in FS / FileSystems.c and will reference it whenever a mount command is issued. The corresponding FS type function read_super is responsible for filling in an item of SUPER_BLOCK, and the item is embedded in SUPER_STRUCT and the structure TYPE_SB_INFO. The former provides a pointer to the general FS operation for the current FS type, while the latter embeds specific information on the corresponding FS type.
The file system type array has been converted to a linked list to allow new FS types to be loaded with the kernel module. Function (UN-) register_filesystem code is included in FS / Super.c.
Quick analysis of a file system type
The task of a file system type is to perform low-level tasks for mapping the corresponding high-level VFS operations to physical media (disks, networks, etc.). The VFS interface has sufficient flexibility to support traditional UNIX file systems and external images MSDOS and UMSDoS file system types.
Each FS type is made up of the following items except its own source code catalog:
· File_Systems [] an entry in an array (item) (fs / filesystems.c);
· Superblock's include file (include / linux / type_fs_sb.h);
· INCLUDE file (include / linux / type_fs_i.h);
· Ordinary INCLUDE file (include / linux / type_fs.h);
· Two lines in INCLUDE / Linux.fs.h, and entries in Structure Super_Block and Inode.
For specific FS types yourself, you contain all actual code, inode, and data management programs.
In this manual for the chapter of Procfs, all low-level code and VFS interfaces related to the FS type are revealed. After reading that chapter, the source code in FS / Procfs is very easy to understand. Now let's observe the internal work of the VFS mechanism and as a practical example of the code of the MiniX file system. I have chosen the minix type because it is short but complete; and all other FS types in Linux are derived from it. In the real standard file system type EXT2 in the recent Linux installation, it is much more complicated than it, and the exploration of this file system for EXT2 is left to a smart reader as an exercise.
When a MiniX-FS is loaded, MiniX_Read_super adds data read from the loaded device to the super_block data structure. At this time, the S_OP field in this structure will remain in a pointer to MiniX_SOPS, which will be used by the general file system code to assign the operation of the hyper block.
Linking newly loaded FS in the global system tree structure depends on the following data items (assuming SB is a super block data structure, while Dir_i is a pointer to the INDE of the loading point):
· SB-> S_MOUNTED points to the root of the loaded file system (minix_root_ino);
· DIR_I-> i_mount saves SB-> S_MOUNTED;
· SB-> S_COVERED Save Dir_i
The uninstallation operation will eventually be executed by do_umount, and it will call Minix_put_super sequentially.
Whenever a file is accessed, MiniX_Read_inode will start execution; it will fill in the system-wide inode data structure using the data in the MiniX_inode fields. Inode-> i_op fields are filled in in Indode-> i_mode, which will be responsible for any other operations for the file. The code of the above MiniX function can be found from FS / miniX / inode.c.
The Inode_Operations data structure is a kernel function for assigning the Inode operation to a particular FS type; the first item of the data structure is a pointer to the File_Operations item, which is equivalent to the I_OP of data management. The MiniX file system type allows two ways in Indore operations (for directory, file, and symbolic links), and two file operations (symbolic links do not require file operation).
Directory operation (MiniX_Readdir only) is located in FS / Minix / Dir.c; file operation (read read and write write) in fs / minix / file.c and symbolic operation (read and followed) is located in fs / minix / Symlink.c.
The rest of the Minix source code directory is used to implement the following tasks:
· Bitmap.c is used to manage I nodes and block allocation and release (and the EXT2 file system has two different code files);
· Fsynk.c is used for FSync () system calls - it managed directly, indirectly, and double-blocking blocks (I assume that you know these terms, because this is Unix common knowledge);
· Namei.c has the operation of the I node related to the name, such as the creation and elimination, renaming, and links of icon nodes;
· TRUNCATE.C executive file truncation operation.
Console Driver
As a major I / O device on most Linux systems, the console driver should be subject to certain attention. The source code for console and other character drivers can be found in Drivers / Char. When we align files, we will use this specific directory. The initialization of the console is performed by the TTY_INIT () function in TTY_IO.C. This function only involves acquiring the primary device number of each device set and calls the INIT function set for each device. Con_init () is a function related to the console and exists in Console.c.
In the development of kernel 1.1, the initialization of the console has changed significantly. Console_init () has been detached from TTY_INIT () and is directly called by ../../main.c. Now the virtual console is dynamically assigned, and its code has changed very much. So I will skip the detailed discussion of initialization, allocation, and so on.
How is the file operation dispatched to the console
This section is quite a substantial discussion, you can skip this section at confidence.
There is no doubt that UNIX devices are accessed through a file system. This section will detail all steps from the device file to the actual console function, and the following information is extracted from the 1.1.73 source code of the kernel, which may be a bit different from the code of 1.0.
When a device i node is opened, the chrdev_open () function in ../../fs/devices.c (or BLKDEV_OPEN (), but I only focus on character devices) will be executed. This function is obtained by the data structure DEF_CHR_FOPS, and it is called by chrdev_inode_operations, which is used by all file system types (see section of the file system in front).
Chrdev_open manages the specified device operation by replacing the File_Operations table of the specific device in the current operation and calls a specific Open () function. The table structure of the specific device is saved in the array chrdevs [], and is used as the index as an index, located in the same ../../devices.c.
If the device is a TTY type (we are not paying attention to the console?), We will discuss TTY's device drivers, and their functions are in tty_fops as an index. In this way, TTY_Open () will call init_dev (), and init_dev () is assigned any desired data structure based on the secondary device number.
The secondary device number is also used to retrieve the actual drivers that have been registered with TTY_REGISTER_DRIVER () registration. Moreover, the driver is still another data structure for assigning calculations, just as FILE_OPS; it is related to the write operation and control of the device. The last data structure for managing TTY is a line procedure, which will be described later. The line procedure for the console (and any other TTY device) is set by initialize_tty_struct () and is called by init_dev.
All things we involved in this section are independent of the equipment, only console.c associated with a particular console, and have registered their own drivers during the Con_init () operation. Instead, the line procedure is independent of the equipment.
The TTY_DRIVER data structure
There is a complete description.
The above information is obtained from the 1.1.73 source code. It is likely to be different from your kernel ("" If the information has changed, it will not be notified ").
Console write operation
When a console device is written, the CON_WRITE function is called. This function manages all control characters and demodiny character sequences, which provide all screen management operations to the application. The realized modem sequence is VT102 terminal; this means that when you use Telnet to connect to a non-Linux host, your environment variable should have Term = VT102; however, the best choice for local operation is to set Term = Console, because the Linux console provides a supercoming VT102 function. Thus, con_write () is primarily constructed by the conversion statement, and is used to handle the explanation of a limited long state automatic modem sequence for each character. In normal mode, the printed characters are written directly to the display memory using the current attribute. In Console.c, all domains of data structural VC use macros are accessible, so, for example, any reference to Attr, as long as currcons is the number of console, it is indeed a data structure Vc_Cons [currcons The domain in the middle.
In fact, VC_Cons in the new kernel is no longer a data structure number, and now it is an array of pointers, and its content is to operate with kmalloc (). The use of macros greatly simplifies the work of code modification, because many code does not need to be rewritten.
The actual mapping and non-mapping of the console memory to the screen memory is made by the function set_scrmem () (it copies the data in the console buffer to the display memory) and get_srcmem () (it copies the data back to the console buffer) implemented. In order to reduce the number of data transmission, the private buffer of the current console is physically mapped to the actual display of RAM. This means that get- and set-_scrmem () in console.c are static and is only called only during one console conversion.
Control station reading operation
The console reader is done by a line procedure. The default (also unique) line procedure in Linux is called TTY_LDISC_N_TTY. The line procedure is also "entered through a line constraint". It is another function table (we are used to this method, isn't it?), It is related to the device read operation. With the help of the Termios logo, the line procedure is also a procedure from the TTY: the unprocessed data, CBREAK, and planned ways; select (); ioctl (), and more.
The read function in the line procedure is called read_chan (), which reads the buffer of TTY and where the data is from. The reason is that the characters from a TTY are managed by the asynchronous hardware interrupt.
The line procedure n_tty is also in TTY_IO.C, although a different N_TTY.C source program is used in the following kernel.
The bottom layer entered by the console is part of the keyboard management, so it is processed in Keyboard.c's keyboard_interrupt ().
Keyboard management
Keyboard management is simply a nightmare. It is limited to the file keyboard.c, which is full of hexadecimal numbers that represent each keyboard of different manufacturers' keyboard.
I will not discuss KEYBOARD.C because there is no relevant information related to kernel researchers.
For those who are actually interested in programs for Linux, the best way is to look back from the last line of Keyboard.c. The bottom of the bottom is the upper part of the file.
Converting the current console
The current console is converted by using a function change_console (), it is called by keyboard.c and vt.c in ttyBoard.c and the console of the console of the button, the latter is a program by reference to an IOCTL () Convert console when calling).
The actual conversion process is performed in two steps, and the function complete_change_console () handles the second part. The division of the conversion means completing the task after a possible handshake that controls the TTY that we are leaving. If the console is not under the process control, Change_Console () will call Complete_Change_Console yourself. The process requires sufficient ability to successfully complete the conversion from the graphic to the text console or from the text to the graphics console, and the X server (for example) is a control process of its graphics console. Selection mechanism
"Selection" is the clipping (CUT) function of the Linux text console. This trick is mainly processed by the user-level process, which can be explained with specific examples of Selection or GPM. The user-level program uses the IOCTL () notified the kernel on the console to highlight a area of the display screen. The selected text is then copied to a selection buffer. The buffer is a static entity in Console.c. Paste Text Operation is done in the TTY input queue through "Hand". The entire selection mechanism is protected by #ifdef, so users can disable it to save thousands of bytes during kernel configuration.
Choosing is a very low-level feature, so it is not visible to any other kernel activity. This means that many #ifdef just simply move the highlights before any way is modified in any way.
The new kernel feature improves the selected code, the highlights of the mouse pointer can be independent of the selected text (ipient 1.1.23 or higher). Moreover, from the 1.1.73 version, the selected text uses a dynamic buffer instead of static, so that the kernel is 4KB.
Use IOCTL () operating device
IOCTL () system call is an entry point for user process control device file behavior. IOCTL management is generated from ../../fs/iocTl.c, actually sys_iocTL () is in this ioctl.c. The standard IOCTL request is performed there, and the other and file-related requests are processed by file_ioc () (in the same source file), while any other request is assigned to the IOCTL () function of the specific device.
The IOCTL data of the console device is located in vt.c because the console driver is assigned to vt_ioctl ().
The above information is about 1.1.7x core. 1.0 The kernel is not "driver" table, and vt_iocTL () is directed by the file_Operations () table.
The information of IOCTL is indeed quite confusing. Some requests are related to the device, while some are related to the line procedures. I will try to summarize anything between the 1.0 and 1.1.7x kernels.
1.1.7X series cores have the following features: TTY_IOCTL.C only implements line procedures requests (n_tty_ioc (), this is the only n_tty function outside the n_tty.c), and the File_Operations field points to TTY_IOTL in tty_io.c ( ). If the request number is not parsed by TTY_IOCTL (), it will be passed to TTY-> Driver.iOctl or if it fails, go to TTY-> LDISC.IOCTL. The console's information related to the driver can be found from VT.C, while the line procedures are in Tty_iocTl.c.
In the 1.0 kernel, TTY_IOCTL () is pointed in tty_iocTl.c and has a general TTY's file_Operations. Unqualified requests will be transferred to a specific IOCTL function or to the line procedure code with a method similar to 1.1.7X.
Note that in these two cases, the Tioclinux request is in the code that is independent of the device, which suggests that the console selection operation can be set by IOCTL to operate any TTY (SET_SELECTION () is always operating on the front desk of the console. This is a safe vulnerability. This is also a good reason to transfer to an updated kernel. In the new kernel, the selection of the selection is allowed to make up for this vulnerability by allowing the super user to handle the selection. There are many requests that can be sent to console devices, and the best way to know their best is to browse the source file Vt.c.
Copyright (C) 1994 Alessandro Rubini, Rubini@pop.systemy.it