Going to Linux 2.6
English original
content:
Linux 2.6 Highlights New Scheduler Document Seasoning Improved Thread Model and NPTL Support Virtual Memory LINUX 2.6 Driver Transplantation Memory Management Change Work Queue Interrupt Routine Changes Unified Device Model Conclusion Reference About the author This article evaluation
related information:
Development embedded Linux applications on embedded devices: Overview Contact 2.4 Core Compilation Linux kernel developerWorks Toolbox subscription
In the Linux area:
Tutorial Tools & Product Codes & Component Articles
Enter the wonderful work of the next new kernel
Level: Intermediate Anand K Santhanam (Asanthan@in.ibm.com) Software Engineer, IBM Global Services 2003 November
The new stable kernel that is about to be released supports more types of processors, and reliability and scalability have been improved, so it will drive Linux to get a wider range of applications. Here we will focus on some changes in different degrees and give some code samples.
The development of the Linux kernel has experienced a long process, which was originally Linus TorvalDS published in 1991. The original version 0.1 released, including a basic scheduler, IPC (process communication), and memory management algorithm. Now it is already a practical alternative to operating the system, and it has exhibited powerful competitiveness in the market. More and more government agencies and IT giants are turning to Linux. From the smallest embedded device to S / 390, from the watch to large enterprise servers, Linux is now almost all places. Linux 2.6 is the next major version in the Linux development cycle, which includes some powerful features that are designed to improve the performance of high-end enterprise servers and support more and more embedded devices (to learn more about Linux. 2.6 For large, small, and multi-processor systems support problems, see Refigurative References to the link to Joseph Pranevich's "linux" link). This article analyzes the important features of Linux 2.6 to focus on Linux, and discusses the multi-faceted changes that driver developers may be interested. Linux 2.6 Highlights Whether it is for the corporate server or a embedded system, Linux 2.6 is a huge progress. For high-end machines, new features are targeted for performance improvement, scalability, throughput, and support for SMP machine NUMA. For embedded areas, new architecture and processor types have been added - including support for MMU-Less systems that do not have hardware controlled memory management schemes. And, in order to meet the needs of the desktop user group, a new audio and multimedia drivers are added. In this article, we analyze some of Linux 2.6's most interesting features, but there are still many changes that are worth paying attention to, including enhanced core dumps, fast mutually exclusive support, improved I / O subsystems, etc. Here we can't discuss all. Some of these bars have been summarized, and the rest of us give links in the reference information section. The new scheduler 2.6 version of the Linux kernel uses the new scheduler algorithm developed by INGO MOLNAR, called the O (1) algorithm, which is extremely excellent in high load, and when there are many processors It can be expanded well. In the 2.4 version of the scheduler, the time slice re-calculation algorithm requires that their new time films will be recalculated after all the processes are used. In this way, in a system with a lot of processors, when the process is used to wait for the time slice, it will wait to be re-calculated (to obtain a new time piece), resulting in most of the processors in an idle state; this will affect SMP effectiveness. In addition, when the idle processor begins to perform those processes that have not been exhausted (if their own processors are busy), they will begin "jump" between the process. When a high priority process or interactive process is jumping, the performance of the entire system will be affected. The new scheduler solves the above problem is based on each CPU to distribute the time slice and cancel the global synchronization and the retrieval cycle. The scheduler uses two priority arrays, namely an active array, and an expired array that can be accessed by a pointer. Active array contains all tasks that are mapped to a CPU and have not been exhausted. The expired array contains an ordered list of all tasks that have been exhausted. If all active tasks have been exhausted, the pointer to these two arrays is interchanged, and the expired array (including the task that is ready to run) is an active array, and an empty activity array has become a new array containing an expiration task. The index of the array is stored in a 64-bit bitmap, which is easy to find the highest priority task. The new scheduler is no longer a big Runqueue_lock.
It maintains the running queue / lock mechanism of each processor so that two processes on different processors can sleep in parallel, wake up and context. Reconscent cycles (recalculate the time slice for process) and Goodness cycles have been canceled, O (1) algorithm is used for Wakeup () and Schedulee (). The main benefits of the new scheduler include: SMP efficiency: If you have a job, you will work. Waiting for the process: There is no process to wait for a long time; at the same time, there is no process to take up a lot of CPU time without the process. SMP process mapping: The process is only mapped to a CPU and does not jump between the CPU. Priority: The priority of unimportant tasks (vice versa). Load balancing: The scheduler reduces the priority of those that exceed the processor load capacity. Interaction performance: Using new scheduler, even if it is very high load, the system takes a long time to respond to the mouse click or keyboard input will not occur again. The kernel seizures the internal nuclear session patch in the 2.5 series has been hit, and it will be played in 2.6. This will significantly reduce the delay of the user interactive application, multimedia application, and more. This feature is especially useful for real-time systems and embedded systems. 2.5 The core of the kernel seizures module is done by Robert Love. In previous kernel versions (including 2.4 kernels), tasks that are not allowed to run in kernel mode (including user tasks into kernel mode through system) until they actively release the CPU. In the kernel 2.6, the kernel is a preemption. A kernel task can be preemptive, which allows important user applications to continue running. The most important advantage is that the system's user interaction can be greatly enhanced, and the user will feel that the mouse click and the hitting event gets a faster response. Of course, not all kernel code segments can be preemptive. You can lock the key part of the kernel code, and you will not be allowed. Lock ensures that the data structure and status of each CPU are always protected without being preemptive. The following code snippet shows the data structure problem of each CPU (in the SMP system): List 1. There is a code of the internal nuclear session
INT Arr [NR_CPUS];
Arr [SMP_PROCESSOR_ID ()] = i;
/ * kernel preemption could happen here * /
J = arr [SMP_Processor_ID ()] / * I and J Are Not Equal AS
SMP_Processor_ID () May Not Be The Same * /
In this case, if the kernel session occurs at a particular point, the task will be assigned to other processors to other processors due to the re-scheduled - SMP_Processor_ID () will return a different value. This situation should be protected by locking. The FPU mode is another CPU should be protected from being preserved. When the kernel is executing a floating point command, the FPU status is not saved. If you have a preemption at this time, the FPU status will be completely different from the pre-schedule due to re-scheduled. So the FPU code must always be locked to prevent the kernel. The lock can be achieved, and the critical part is prohibited, and then the preemption will be activated later. The following is a definition of prohibiting and activating a preemption in the 2.6 core:
preempt_enable () - seizure counter minus 1 preempt_disable () - seizure counter plus 1 get_cpu () - call preempt_disable () and SMP_Processor_ID () PUT_CPU () - Reactivate preemption () Use these definitions, list 1 can be heavy Write this: Listing 2. Using the code of anti-champion lock
INT CPU, Arr [NR_CPUS]; Arr [get_cpu ()] = i; / * disable preemption * /
J = Arr [SMP_Processor_ID ()];
/ * Do some critical stuff here * /
PUT_CPU () / * RE-enable preemption * /
Note that preempt_disable () / enable () call is nested. That is, preempt_disable () can be called N times, and only when the nth preempt_enable () is called, the preemption is reactivated. When using a self-lock, the preemption is implicitly prohibited. For example, a spin_lock_irqsave () call implicitly prevents preempt_disable () to prevent preemption; spin_unlock_irqreStroe () is reactivated by calling preempt_enable (). Improved thread models and support for NPTL have made a lot of improvement thread performance in the 2.5 core. The thread model that is modified in 2.6 is still done by Ingo Molnar. It is based on a 1: 1 thread model (a kernel thread corresponding to a user thread), including support for new NPTL (Native Posix Threading Library), which is developed by Molnar and Ulrich Drepper.
2.6 Other compelling changes in the kernel
The file system has improved the EXT2 / EXT3 file system, including support for extended attributes and POSIX access control lists. NTFS drivers have also been overwritten, can support (Reentrant Safe) SMP, more than 4KB clusters, and so on. At the same time 2.6 also supports IBM JFS (Journaling File System) and SGI XFS. Audio is for desktop users, a more expectated feature is a new Linux audio architecture called ALSA (Advanced Linux Sound Architecture), which replaces the old OSS (Open Sound System) of defects. The new sound architecture supports USB audio and MIDI devices, full duplex, and so on. Playing MP3 and other audio files on the desktop will never be as before! The bus SCSI / IDE subsystem has been greatly overwriting, and some drivers are still in the test phase or at the end of the final. Power Management Supports ACPI (Advanced Configuration Management Interface, Advanced Configuration and Power Interface), used to adjust the CPU (which can make CPUs in different loads in different clock frequencies) and software hang (this feature is still In the test). Support for IPSec (IP security) is added to the networking and IPsec kernels, so various RFCs such as IP payload compression are also supported. Delete the original core built-in HTTP server KTTPD. The IPSec feature uses a new encryption API provided by the kernel. This encryption API includes various popular algorithms such as MD4, MD5 DES, and more. Added support for new NFSv4 (Network File System) Client / Server. The user interface layer 2.6 kernel rewrites the frame buffer / console layer. This will mean you need to update a variety of user spatial frame buffers, such as FBSET and FBDESL. The human-machine interface also joins the support for nearly all accessible devices, from the touch screen to the blind, to a variety of mice. Thread operations can increase speed; 2.6 The kernel can now handle any number of threads, and PID can be up to 2 billion (IA32). Another change is to introduce the TLS (Thread Local Storage) system call, which allows you to assign one or more GDT (Global Descriptor Table) entry as a thread registry. Each CPU has a GDT, each entry corresponding to a thread. This enables a 1: 1 thread model that is not created thread limit (because each new kernel thread is created for a user thread). 2.4 The kernel can only support 8,192 threads per processor. The system calls Clone is extended to optimize the creation of the thread. If the clone_parent_setid flag is set, the kernel stores the thread ID in a given memory location. If the clone_clearid flag is set when the thread is ended, the kernel will empty the memory location. This helps user-level memory management to identify the memory blocks that are not used. Similarly, support for signal security load on thread registry has also been integrated into this system. When pThread_join occurs, Futex (Fast User Space Mutex) is complete by the kernel according to the thread ID. (To learn more about Futex, see Resources). POSIX signal processing is done in the kernel space. A signal is passed to a usable thread in the process; the destroy signal will terminate the entire process. Stop and continuing signals will also affect the entire process so that the work control of the multi-threaded process can be implemented. A variant of exit system call is introduced, called exit_group (), which is terminated to terminate the entire process and its thread.
In addition, the exit processing has been improved by introducing an O (1) algorithm, so that a process with thousands of threads can be terminated within two seconds (15 minutes to complete the same thing in the 2.4 core). Modify the Proc file system and no longer report all threads but only report the original thread. This avoids the decline in / proc reporting speed. The kernel guarantees that the original thread will not terminate before all other threads are terminated. Virtual memory changes From a virtual memory perspective, the new kernel combines RIK VAN RIEL's R-MAP (reverse mapping, Reverse Mapping) technology, which will significantly improve the performance of virtual memory under certain degrees load. In order to understand the reverse mapping technology, let's first briefly understand some of the basic principles of the Linux virtual memory system. Linux kernel works in virtual memory mode: Each virtual page corresponds to a physical page for a corresponding system memory. The address translation between the virtual page and the physical page is completed by the hardware page table. For a specific virtual page, the corresponding physical page can be found in accordance with a page table, or the prompt that the page cannot be found (there is a page error). But this "virtual to physical" page mapping is not always one or one: multiple virtual pages (pages shared by different processes) may point to the same physical page. In this case, page records for each shared process will have mappings to the corresponding physical page. If there is a case like this, things will become complicated when the kernel wants to release a specific physical page, because it must traverse all process page table records to find references to this physical page; it can only reach the reference number This physical page can only be released, because it doesn't matter if it doesn't matter if there is still a reference to this page. This makes virtual memory very slow when the load is high. The reverse address mapping patch solves this problem by introducing a data structure called PTE_Chain in the structural page. PTE_Chain is a simple link list of PTE pages that can return a list of PTE-referenced pages. Page releases a bit very simple. However, there is a pointer overhead in this mode. Each of the system pages must have an additional structure for PTE_Chain. In a 256M memory system, there are 64K physical pages, which requires 64kb * (Struct PTE_Chain) memory being allocated for the structure of PTE_Chain - a very considerable number. There are some technologies that can solve this problem, including delete WAIT_QUE_HEAD_T domain from the structural page (exclusive access to the page). Because this waiting team is extremely useful, a smaller queue is implemented in the RMAP patch, and the correct wait queue is found through the Hash queue. Despite this, RMAP performance - especially in high-load high-end systems - a significant increase in virtual memory systems relative to 2.4 kernels. Linux 2.6 driver transplantation 2.6 kernel brings a series of very meaningful changes to driver developers. This section focuses on some important aspects of transferring drivers from 2.4 kernel to 2.6 kernel. First, with respect to 2.4, the kernel compilation system is improved to obtain a faster compilation speed. An improved graphical tool is added: make XConfig (requires Qt library) and make gconfig (requires a GTK library). The following is some highlights of the 2.6 compilation system:
Automatically create Arch-zimage and modules when using Make Use make -jn to make parallel Make Make default is not redundant way (can be set to redundancy by setting kBuild_verbose = 1 or using Make V = 1) Make Subdir / All files in the Subdir / and its subdirectory Make Help will provide Make Target Support does not need to run the Make DEP kernel module loader in any phase. It is also completely re-implemented in 2.5, which means that the module compilation mechanism is relatively It has been greatly different from 2.4. A new set of module tools is needed to complete the module loading and 缷 (their download link can be found in the reference), the original 2.4 makefile can not be used in 2.6. The new kernel module loader is developed by Rusty Russel. It uses the kernel compilation mechanism to generate a .ko (kernel target file, kernel object) module target file instead of one .o module target file. The kernel compilation system first compiles these modules and connects them into Vermagic.o .o. This process creates a specific part in the target module to record the compiler version number, the kernel version number, whether to use the kernel seized information. Now let's take an example, analyze how the new kernel compilation system is compiled and loaded with a simple module. This module is a "Hello World" module, code, and 2.4 module code are basically similar, just Module_init and Module_exit to replace INIT_MODULE and CLEANUP_MODULE (kernel 2.4.10 module has used this mechanism). This module is named hello.c, the makefile file is as follows: Listing 3. Driver Makefile file example kernel_src = / usr / src / linux
Subdir = $ (kernel_src) / Drivers / Char / Hello /
All: modules
Obj-m: = module.o
Hello-Objs: = Hello.o
EXTRA_FLAGS = -ddebug = 1
Modules:
$ (MAKE) -C $ (kernel_src) Subdir = $ (Subdir) MODULES
The Makefile file uses the kernel compilation mechanism to compile the module. Compiled Modules will be named module.ko and get it by compiling Hello.c and connecting VERMAGIC. KERNEL_SRC Specifies the directory where the kernel source file is located, subdir specifies the directory where the module is placed. Extra_flags specifies the compile period that needs to be given. Once the new module (Module.ko) is created, it can be loaded or loaded by the new module tool. The original module tool in 2.4 cannot be used to load or load 2.6 core modules. This new module load tool will minimize the corresponding modules in the case where a device is still in use, but in confirming that these modules have no devices that are used after use. One of the reasons for producing this conflict is that the module use count is controlled by the module code (via MOD_DEC / INC_USE_CONT). In 2.6, the module no longer needs to be added or subtracted for the reference count, which will be performed outside the module code. Any code to reference module must call TRY_MODULE_GET (& Module), which only can access that module after the call is successful; if the module being called has been loaded, then this call will fail. Correspondingly, the reference to the module can be released by using Module_put (). Changes in memory management During the development of 2.5, the memory pool is added to meet the memory allocation without interruption. Its thoughts are pre-assigned a memory pool and retain it when you really need. A memory pool is created by MEMPOOL_CREATE () call (should contain the header file Linux / mempool.h). mempool_t * mempool_create (int min_nr, mempool_alloc_t * alloc_fn, mempool_free_t * free_fn, void * pool_data); min_nr here is the need to pre-allocate the number of objects, alloc_fn memory pool and free_fn pointing mechanisms of the standard object allocation and deallocation routine pointers . Their types are: typedef void * (mempool_alloc_t) (int gfp_mask, void * pool_data); typedef void (mempool_free_t) (void * element, void * pool_data); pool_data is used to allocate and deallocate function pointers, gfp_mask assigned tag . The allocation function will sleep only when the __gfp_wait tag is specified. The allocation and recycling object in the pool is completed by the following procedure: void * mempool_alloc (Mempool_t * pool, int GFP_MASK); Void Mempool_Free (MemPool_T * pool); MemPool_alloc () is used to assign objects; if the memory pool allocation The unit cannot provide memory, then the pre-allocated pool can be used. The system uses MemPool_Destroy () to recover the memory pool. In addition to introducing memory pools for memory allocation, the 2.5 core also introduces new GFP tags for regular memory allocation, which are:
__GFP_Repeat - Tell the page distributor to do its best to allocate memory. If the memory allocation has failed, it should reduce the use of this tag. __GFP_nofail - The memory allocation cannot be failed. Thus, since the caller is transferred to sleep, it may take a longer time to complete allocation, the needs of the caller can be satisfied. __Gfp_noretry - Ensure that the assignment is no longer retry, and the caller is reported to the caller. In addition to changes in memory allocation, Remap_page_range () calls - used to map pages to user space - have passed a small number of modifications. Relative to 2.4, now it has a more parameter. The Virtual Memory Area (VMA) pointer is the first parameter, then four commonly used parameters (Start, End, Size, and Protection tags). The work queue interface work queue interface is introduced during the development of 2.5, used to replace task queue interfaces (for scheduling kernel tasks). Each work queue has a dedicated thread that runs all the tasks from the run queue run in the context of the process (so they can sleep). Drivers can create and use their own work queues, or a work queue with kernels. The work queue creates: struct workQueue_struct * create_workqueue (const char * name); here Name is the name of the work queue. The work queue task can be created at compile time or when running. The task needs to be encapsulated into a structure called Work_struct. Use: declare_work (void * data) in compile time to initialize: declare_work (void * data); here Name is the name of Work_Struct, Function is a function called when the task is scheduled Data is a pointer to that function. Use: init_Work_Struct * Work, Void (* function) (void *), void * data); use the following function (a type of Work_Struct structure with the following function); work queue job / task) was added to the work queue: int queue_work (struct workqueue_struct * queue, struct work_struct * work); int queue_delayed_work (struct workqueue_struct * queue, struct work_struct * work, unsigned long delay); Specifies queue_delay_work () in DELAY is to ensure that the task in the work queue can be truly implemented after at least a given minimum delay time. The task in the work queue is performed by the relevant working thread, which may be in an unpredictable time (depending on load, interrupt, etc.), or after a delay. Any task that is waiting for an infinite long time in the work queue and does not have the following methods to cancel: int CANCEL_DELAYED_WORK (Struct Work_Struct * Work); if the task is being executed if a canceled call returns, then this The task will continue to do, but will not be added to the queue.
Clear all tasks in the Work Queue: Void flush_WorkQueue (struct workqueue_struct * queue); destroy work queue use: void destroy_workqueue (struct workqueue_struct * queue); not all drivers must have its own work queue. The driver can use the default work queue provided by the kernel. Since this work queue is shared by many drivers, the task may require a longer period of time to begin execution. In order to solve this problem, the delay in the work function should remain minimal or simply. It is necessary to pay special attention to the default queue is available for all drivers, but only the GP license driver can use custom work queue: int Schedule_Work; - to the work queue Add a task int Schedule_delayed_Work (Struct Work_Struct * Work, unsigned long delay); - Add a task to the work queue and delay the execution When the module is loaded, you should call a flash_scheduled_Work () function, this function will make the waiting queue All tasks are executed. The interrupt routine change 2.5 interrupt handler has undergone many changes, but most of them have no effect on ordinary driver developers. However, there are still some important changes to affect the driver developer. The return code of the current interrupt processing function is an Irqreturn_t type. This change introduced by Linus means that the interrupt handler tells the general-purpose IRQ layer to really be interrupted. This is to make when the interrupt request is commence (the reason is that the driver accidentally activates a interrupt bit or hardware broken), captures a fake disruption (especially in shared PCI), and any driver is Powerless. In 2.6, the driver is to return IRQ_HANDLED if you want to issue an interrupt from one device. If it is not, IRQ_NONE is returned. This helps the IRQ layer of the kernel clearly identifies which driver is dealing with that particular interrupt. If an interrupt request continues to comment and no handler of that device (for example, all drivers return to IRQ_NONE), the kernel ignores the interrupt from that device. By default, the driver IRQ routine should return to IRQ_HANDLED, and when the driver is being processed, it returns IRQ_NONE when the interrupt is processed, and there is a bug. The new interrupt handler may be similar to this: Listing 4. 2.6 interrupt handler pseudo code
Irqreturn_t irq_handler (...) {
.
IF (! (MY_Interrupt)
Return Irq_none; // NOT OUR Interrupt
...
Return Irq_Handled; // Return by Default
}
Note that CLI (), STI (), Save_Flags (), and Restor_Flags () are methods that are not approved. Instead, local_save_flags () and local_irq_disable () are replaced, which is used to disable all local interrupts (within this processor). It is impossible to prohibit all processors. Unified equipment model 2.5 The other most worthy change in the development process is to create a unified device model. This equipment model includes almost all equipment structures and systems by maintaining a large number of data structures. The advantage of doing this is to improve the power management and simplification of the device, including the tracking of the following information:
The device exists in the system, the power status system of the device is clear, and it is clear which devices are connected to the bus structure of their control system: which device is connected to which bus is connected, and which bus interconnection (For example, the interconnection of USB and PCI bus) The device in the system (categories include disk, partition, etc.) in the 2.5 core, other develops related to the device driver include: no longer use malloc.h. All code containing
Read how to upgrade to 2.6 kernel from KernelTrap. Joseph Pranevich's Wonderful World is more in-depth, more deeply introduced 2.6 kernel to embedded, large and multiprocessor support, and new kernel network and security features are a good reading. Dave Jones 2.6 - expects what is also looking forward to the characteristics and changes in 2.6. We use Linux 2.6 in the future or Linux 3.0? Please see the discussion on the list of kernel mailing. KernelTraffic has been reported on a weekly highlight according to the list of kernel mailing. Daily and weekly reports on kernels and lists can be found in Linux weekly news (special kernel news has its own page). Check the current kernel situation on kernelnewbies. View Rik Van Riel's site for kernel newbie. Don't miss the article that must be read Linux week news about 2.5 driver transplantation. Want to learn more about o (1) scheduler? Please visit the page of INGO MOLNAR. A setup of the Robert Love is called SchedUtils. Learn about Native Posix Threading Library Design. The R-MAP patch provided by Rik Van Riel. In order to load and load newly compiled modules, you will need a new module tool to get from the kernel.org website. BERT HUBERT HAS edited the documentation about Futexes and howto on IPsec. Take known as new Linux audio tools, visit the Advanced Linux Sound Architecture (ALSA) website. Linux has a weekly news about the work queue. The IBM DeveloperWorks is scheduled to work on the Linux system on the embedded device. Work together by Anand K Santhanam and Vishal Kulkarni. Other related developerWorks articles, please visit: LINUX system on embedded devices Develop embedded Linux applications: Overview Contact 2.4 Core Compile Linux kernels You can find more information prepared for Linux developers in the developerWorks Linux zone.
A bachelor's degree in computer scientific science in Madras University in India. Since July 1999, he has been working in India's IBM Global Services (Software Lab). He is a member of the IBM Linux team, which is mainly committed to the research and development of ARM-Linux, device drivers and power management in embedded systems. Other areas he are interested in are OS internal and networking. You can contact him through Asanthan@in.ibm.com.
English original