Linux 2.6 Scheduling System Analysis - Progress on 2.4

xiaoxiao2021-03-06  69

Linux 2.6 scheduling system analysis

content:

1. Foreword 2. New data structure Runqueue3. Improved Task_Struct4. New runtime tablet performance 5. Optimized priority calculation method 6. The process average waiting time SLEEP_AVG7. More accurate interactive process is preferred. Scheduler 9. The scheduler is supported by the kernel. Distributor related load balance 11. Scheduling 12 under NUMA structure. The real-time performance of the scheduler 13. Postscript: Look from the scheduler LINUX Development Reference About the author

related information:

Linux 2.4 Scheduling System Analysis DeveloperWorks Toolbox Subscription

In the Linux area:

Tutorial Tools & Product Codes & Component Articles

Progress in 2.4, Yangshazhou (Pubb@163.net) University of Defense Science and Technology School, April 2004

This paper analyzes the principles and implementation details of Linux 2.6 scheduling systems from Linux 2.6 scheduling systems, and analyzes and evaluate the load balance, NUMA structure, and real-time performance related to the scheduling system. At the end of the article, the author has made his own views from the development and implementation of the dispatching system.

1. Foreword Linux market is very broad, from desktop workstations to low-end servers, it is a strong competitor of any commercial operating system. At present, Linux is fully entered into the field of embedded systems and high-end server systems, but its technical defect limits its competitiveness: lack support for real-time tasks, multi-process machine scalable. In the 2.4 core, one of the key reasons for these two weaknesses is the defects in the scheduler design. 2.6 Scheduling System From the beginning of design, focus on better meetings and multiprocessors in parallel, and basically achieve its design goals. The main designer, the legendary character inyno molnar is summarized as the following:

Inherit and carry forward the feature of the 2.4 version of the schema:

The high-performance equity sharing of scheduling / wake-up at interactive job priority light-load conditions Based on the new features of the priority scheduling high CPU usage SMP efficient affinity real-time scheduling and CPU binding and other scheduled means:

O (1) Scheduling algorithm, the scheduler overhead is constant (independent of the current system load), the real-time performance is better and highly scalable, the lock size is greatly reduced to reduce the new design SMP affinity method optimization computational intensive batch batch job The scheduler works under the scheduling overload condition The multi-slide process of the scheduler is first improved by the parent process. In the 2.5.x test version, the development of the new scheduler has been widely concerned, and it has been proved that it does make system performance. improve. In this paper, the principle and implementation details of 2.6 scheduling systems are analyzed from 2.6 to 2.4 for the improvement of 2.6 for 2.4. 2.6 Scheduler design is quite complex, there are many places that need to continue research, especially the settings of each scheduling parameter, and as the core version is upgraded, it may continue to be revised. 2. New data structure Runqueue We know that in the 2.4 kernel, the ready process queue is a global data structure, and all the operations of the scheduler will cause the system to wait for the overall spin lock, making the ready queue become A obvious bottleneck. 2.4 Ready queue is a simple two-way linked list with Runqueue_Head, in 2.6, ready queue defines a complex data structure Struct Runqueue, and especially the critical is that every CPU will maintain a self Ready queue, - this will greatly reduce competition. O (1) Many key technologies in algorithms are related to Runqueue, so we start with the Runqueue structure from the analysis of the scheduler. 1) PRIO_ARRAY_T * ACTIVE, * EXPIRED, ARRAYS [2] The most critical data structure in Runqueue. Each CPU's ready queue is divided into two parts by the time tablets, and accessed through the Active pointer and Expired pointer, Active points to the time slice, the currently scheduled ready process, expired points to the time slice. Ready process. Each type of ready process represents the structure of a struct prio_array: struct prio_array {int NR_ACTIVE; / * The number of processes in this process * /

Struct List_head Queue [MAX_PRIO];

/ * See * / with a Hash table with priority.

Unsigned long bitmap [bitmap_size];

/ * Accelerate the bitmaps accessed above the Hash table, see * /

}

Figure 1: Active, Expired array example

Task in the figure is not a Task_Struct structure pointer, but task_struct :: run_list, this is a tip, see the explanation of Run_List below. In the version 2.4 version, the process of finding the best candidate processes is performed in the scheduler schedule (), each schedule must be performed (to call Goodness () in the for loop), this lookup process is currently The number of ready processes is related, so the time consuming to find the O (n) level, n is the number of current ready process. Because of this, the execution time of the scheduling action is related to the current system load, which is unable to give a limit, which is contrary to the real-time requirements. In the new O (1) scheduling, this lookup process is decomposed into n step, and the time spent on each step is O (1) level. The PRIO_ARRAY contains a ready queue array, an index of an array is a priority of the process (a total of 140, see the description of the "static_prio" attribute), the same priority process is placed in the corresponding array element Lin table Queue. Directly gives the first item in the tonight queue Active as a candidate process when the ready queue Active is given (see "scheduler"), and the priority calculation process is distributed to the execution of each process (see " Optimized priority calculation method "). In order to accelerate the linked list of the presence ready process, 2.6 The core has established a bitmap group to correspond to each priority linked list. If the priority chain is not empty, the corresponding position is 1, otherwise 0. The core also requires each architecture to construct a SCHED_FIND_FIRST_bit () function to perform this search operation, quickly locate the first non-empty ready process linked list. This algorithm that will be concentrated in the process is used to ensure the time limit of the scheduler operation, and the practice of retaining more information in memory has also accelerated the positioning process of the candidate process. This change is simple and efficient, one of the highlights in the 2.6 core. The Arrays binary array is a container of two types of ready queues, and Active and Expired points to one of them. Once the process in ACTIVE is transferred to Expired, set up new initial time slice; and when Active is empty, it means that the time slice of all current processes is complete, at this time, Active and Expired make a call, restart the next round of time film decreasing process (see "Scheduler"). Recall the 2.4 scheduling system, the calculation of the process time piece is time-consuming, in the early nuclear version, once the time is exhausted, the time slice is recalculated in the clock interrupt, and later in order to improve efficiency, reduce clock interruption processing Time, 2.4 Scheduling system has a one-time reconciliation in the scheduler after all ready-to-process time films. This is another O (n) level process. In order to ensure that the scheduler execution time of O (1), the time chip calculation of 2.6 is performed separately when the time slice of each process is performed separately, and the rotation of the time sheet is complete by the simple call described above (see "scheduler"). This is another highlight of the 2.6 scheduling system. 2) Spinlock_t lock runqueue's spin lock, still should be locked when you need to operate Runqueue, but this lock operation only affects the ready queue on a CPU, so the probability of competition should be much smaller. 3) Task_t * Curr This CPU is running the process.

4) Tast_t * idle points to the IDLE process of this CPU, equivalent to the role of init_tasks [this_cpu ()] in 2.4. 5) INT BEST_EXPIRED_PRIO Record the highest priority in the Expired Ready Process Group (Minimum value). This variable is saved when the process enters the Expired queue (Schedule_Tick ()), and it is used to see "Expired_TimeStamp" explanation). 6) Unsigned long expired_timestamp When the new round of time filtration begins, this variable records the time that the earliest process done the time film event (Jiffies's absolute value, in the schedule_tick (), it is used to characterize The longest wait time for the ready process in the Expired. Its use is embodied in Expired_starving (RQ) macro. It has been mentioned above, and two ready queues, Active, and Expired are maintained on each CPU. Under normal circumstances, the process of ending of the time slice should be transferred from the Active queue into the Expired queue (Schedule_Tick ()), but if the process is an interactive process, the scheduler will keep it on the Active queue to improve its response speed. . This measures should not allow other readyways waiting for a long time, that is, if the process in the Expired queue has been waiting for a long enough, even the interactive process should be transferred to the Expired queue, emptying the Active. This threshold is embodied on Expired_Starving (RQ): Both the following two conditions are satisfied in the pre-eXpired_TimeStamp and Starvation_Limit, if the following two conditions are met, expired_starving () returns true:

(Current absolute time - expired_timestamp)> = (Starvation_limit * total number of all ready processes in the queue 1), that is, at least one process in the Expired queue has been waiting for a long time; the static priority of the process is running than Expired The highest priority in the queue is low (Best_expired_prio, the value is large), of course, it should be emptied to switch to Expired as soon as possible. 7) STRUCT MM_STRUCT * PREV_MM Save the process to be scheduled after the scheduled process (called prev) Active_mm structural pointer. Since the Active_MM in 2.6 is released after the process switch is completed (MMDrop ()), the prev's Active_MM item may be NULL, so it is necessary to preserve in Runqueue. 8) Unsigned long NR_Running The number of ready processes on this CPU is the sum of the number of processes in both Active and Expired queues, which is an important parameter for this CPU load condition (see "Demand-related load balancing"). 9) Unsigned long nr_switches records the number of processes that have occurred since the scheduler running on this CPU. 10) Unsigned long NR_UnInterruptible Record This CPU is still in the number of Task_uninterruptible status, and the load information is related. 11) Atomic_T NR_IOWAIT Record the number of processes that are in the sleep state because of Waiting IO. 12) Unsigned long timestamp_last_tick This ready queue has recently scheduled the event time, which will be used when load balancing (see "Delivery-related load balancing"). 13) int prev_cpu_load [nr_cpus] records the load state on each CPU when load balancing (the NR_Running value in the ready queue) is analyzed to analyze the load condition (see "The" Scheduler-related load balancing "). 14) Atomic_t * node_nr_running; int prev_node_load [max_numnodes] These two attributes are only valid in the NUMA architecture, record the number of ready processes on each NUMA node and the load when the last load balance operation (see "the scheduling under NUMA structure "). 15) Task_t * migration_thread points to the migration process of this CPU. Each CPU has a core thread for executing a process migration operation (see "Delivery Balance related to the scheduler"). 16) struct list_head migration_queue requires a list of migrations (see "Delivery-related load balancing").

Most scheduling system code structure implements code, including the definition of the Runqueue structure, in the [kernel / Sched.c] file, the purpose of doing this is to centralize all scheduling systems, easy to update and replace. Unless otherwise specified, the codes and functions of this article are located in [kernel / Sched.c]. 3. The improved Task_Struct version 2.6 core still characterizes the process with task_struct, although the thread is optimized, but the core of the thread is still the same as the process. With the improvement of the scheduler, the content of Task_struct has also improved, and the interactive process is prioritized, and the new features such as internal nuclear seasity support are reflected in Task_struct. In Task_Struct, some attributes are new, and some attributes have changed, and some attributes are just a name. 1) The state of the State process is still represented by State. Different, the state constants in 2.6 are redefined, with easy-to-place operation: / * Selected [include / linux / sched.h] * / # define task_running 0

#define task_interruptible 1

#define task_uninterruptible 2

#define task_stopped 4

#define task_zombie 8

#define task_dead 16

The newly added Task_dead refers to a process that has already exited and does not require a parent process. 2) Time of the TimeStamp process (the unit is nanosecond, see below). Includes the following categories:

Wake-up time (in activate_task ()); Switching time (Schedule ()); SCHEDULE ()); SchEDule ()); load balancing related assignment (see "Conversion Balance" ). From this value and the difference in the current time, you can obtain "Time long", "Running Duration", etc., "" Optimized Priority Calculation Method ", etc.

The time of the two time unit systems is based on Nanosecond, but this numeric grains are fine, most of the core applications can only achieve its absolute value, and they do not know its precision. Time-related core applications typically perform around clock interrupts, in Linux 2.6, the system clock interrupts once every 1 millisecond (clock frequency, with Hz macro, is defined as 1000, that is, 1000 times per second, - 2.4 is defined in 100) Many applications are still along the 100 clock frequency), this time unit is called a JIFFIE. Many core applications are Jiffies as a time unit, such as a run time piece of the process. The conversion formula between JIFFIES and absolute time is as follows: Nanosecond = Jiffies * 1000000 core with two macros to complete two interchanges of two time units: jiffies_to_ns (), ns_to_jiffies (), many time macros There are two forms, such as ns_max_sleep_avg and MAX_SLEP_AVG. 3) PRIO priority, equivalent to the calculation result of Goodness () in 2.4, values ​​between 0 ~ Max_PRIO-1 (Max_PRIO is defined as 140), where 0 ~ Max_RT_PRIO-1 (MAX_RT_PRIO is defined as 100) belongs to the real-time process range Max_RT_PRIO ~ MX_PRIO-1 belongs to a non-real-time process. The larger value, the smaller the process priority. 2.6 In 2.6, the dynamic priority is no longer in the scheduler, but independently calculates, and stores in the TASK_STRUCT of the process, and then automatically sort by the Priority_Array structure described above. PRIO calculations and many factors are discussed in detail in "Optimized Priority Calculation Method". 4) static_Prio Static priority, the same meaning is the same as the Nice value of 2.4, but is converted to the same value zone as the PRIO. The nice value is used in the traditional Linux, the larger between -20 to 19, the larger the value, the smaller the priority of the process. NICE is user-maintainable, but only affects the priority of non-real-time processes. 2.6 The NICE value is no longer stored in the kernel, and it is static_prio. The size of the process initial chip is only determined in the static priority of the process, whether it is the real-time process or a non-real-time process, but the Static_PRIO of the real-time process does not participate in priority calculations. The relationship between Nice and Static_PRIO is as follows: static_prio = max_RT_PRIO NICE 20 kernel defines two macros to complete this conversion: PRIO_TO_NICE (), nice_to_prio (). 5) Activated means that the process is read from what reason, this reason will affect the calculation of the scheduling priority. ActiVated has four values:

-1, the process is woken up from the Task_uninterruptible state; 0, the default value, the process is in the case; 1, the process is awake from the task_interruptible state, and is not in the interrupt context; The context. Activated initial value is 0, modified in two places, one is in schedule (), is restored to 0, and the other is activate_task (), this function is called by try_to_wake_up () function, used to activate sleep processes: If it is interrupt Activate_task () called by the service program, that is, the process is activated by interrupt, then the process is most likely to interactively, therefore, set Activated = 2; otherwise, activated = 1. If the process is awakened from the Task_uninterruptible state, activated = -1 (in the try_to_wake_up () function). The specific meaning and use of Activated variables "Optimized priority calculation mode". 6) The average wait time of the SLEEP_AVG process (in nanosecond), at 0 to NS_MAX_SLEP_AVG, the value is 0, which is equivalent to the difference between the process waiting time and the runtime. The meaning of Sleep_avg represents is relatively abundant, which can be used to evaluate the "interaction level" of the process, and can be used to indicate urgency of the process needs to operate. This value is a key factor of dynamic priority calculation, the larger the SLEEP_AVG, the higher the process priority (less than the value). The change process of the SLEP_AVG will be analyzed in detail in the following "Process Average Wait Time Sleep_avg". 7) Interactive_credit This variable records the "interaction level" of this process, and values ​​between -credit_limit to CRedit_Limit 1. When the process is created, the initial value is 0, and then adds 1 minus 1 according to different conditions, once more than credit_limit (may be equal to CREDIT_LIMIT 1), it will not fall again, indicating that the process has passed "interactive" Testing is considered to be an interactive process. Interactive_credit Specific Change Process will be described in detail in "More Accurate Interactive Process Priority". 8) NVCSW / NIVCSW / CNVCSW / CNIVCSW process switch count. 9) Time_SLICE process time slice balance, equivalent to 2.4 counter, but no longer directly affects the dynamic priority of the process. The behavior of TIME_SLICE is specifically analyzed in the "New Run Time Film Performance". 10) first_time_slice 0 or 1, indicating whether it is the first time the time slice (the process that is just created). This variable is used to determine whether the remaining time slice should be returned to the parent process at the end of the process (see "new runtime").

11) RUN_LIST mentioned above, the priority array PRIO_ARRAY structure arranges all the processes under each priority in order, but in fact, each element in the actual array is the list_head structure, which is in its chain of the header. It is also list_head, where link is a Run_List member in task_struct. This is a small trick saving, accelerating access: the scheduler finds the corresponding RUN_LIST in PRIO_ARRAY, then find the corresponding Task_Struct (see enqueue_task (), dequeue_task (), and list.h through the fixed offset of Run_LIST in Task_Struct (see enqueue_task (), demue_task () Operational operation). 12) Array Record the current CPU's active ready queue (Runqueue :: Active). 13) THREAD_INFO's current process, two structural members and scheduling relationships are closely: preempt_count: Non-negative counter, greater than 0, indicate that the core should not be preempted; Flags: One TiF_Need_Resched bit, quite The NEED_RESCHED attribute in 2.4, if the process in the current run is 1, then start the scheduler as soon as possible. In 2.4, each process is located at the top of the process core stack (low site), the kernel can easily access the current process Task_struct by the stack register ESP. In 2.6, there is still a need to frequently access the data structure called CURRENT, but now, the top of the process core stack is saved in the thread_info attribute instead of the complete task_struct. The advantage of doing this is to save only the most critical, the most frequent running environment (still two page size), and save the task_struct most of the thread_info :: Task pointer to be convenient expansion. Thread_info allocation method and access method is exactly the same as the task_struct in 2.4, and the current current requires this to access: / * Extreme [include / asm-i386 / current.h] * /

Static inline struct task_struct * get_current (void)

{

Return Current_thread_info () -> Task;

}

#define current get_current ()

Where Current_Thread_info () is defined as:

/ * Extreme [include / asm-i386 / thread_info.h] * /

Static Inline strunt thread_info * current_thread_info (void)

{

Struct thread_info * ti;

__ASM __ ("ANDL %% ESP,% 0;" = R ":" 0 "(~ 8191ul);

Return Ti;

}

Figure 2: Current now

4. In the new runtime. Time_slice Although the same meaning is owned and counters, the performance behavior in the kernel has been large, and the next three aspects discusses the new runtime card performance: 1) Time_slice reference value and counter similar, the process of the process of time slice and process Static priority (in 2.4 is a nice value), use the following formula: Min_TimeSlice (MAX_TIMESLICE - MIN_TIMESLICE) * (MAX_PRIO-1 - (P) -> Static_USER_PRIO-1))

After the value of each macro, the results are shown in the figure:

It can be seen that the core maps the priority of 100 to 139 to the time slice of 200 ms ~ 10ms, the larger the priority value, the smaller the time slice allocated. Compared with the default time sheet of the process in 2.4, when the Nice is 0, 2.6 is 100 ms greater than 2.4 60ms.

The average time tablet core definition process of the process The average time slice of the process of the process is the Nice value of 0, and the calculation is calculated according to the above formula is about 102 ms. This value will participate in priority calculations as a reference value of the process runtime. 2) Time_SLICE changes process Time_SLICE value represents the remaining size of the runtime tablet of the process, and the parent process is scheduled when the process is created, and during the running process, once it belongs to 0, press the STATIC_PRIO value to re-impart the reference value, and Request scheduling. The decrementing and reset reset of the time sheet is made in the clock interrupt (SCHED_TICK ()), in addition to this, the change of the Time_SLICE value is mainly in the process of creating the process and the process exit: a) The process is created and 2.4, in order to prevent the process Fork to steal the time slice, and the child process is not assigned its own time film, but the remaining time slice of parent process with the parent process. That is, after the Fork ends, both of the two time slips and the time sheets of the original parent process. b) When the process exits the process exits (SCHED_EXIT ()), determine if you have never reassigned over time according to the value of first_time_slice, if yes, return your remaining time slice to the parent process (guaranteed not exceeding Max_TimeSlice). This action makes the process will not be punished due to the creation of a short period process (and "reward" is "reward" without creating sub-processes). If the process has already used the time slice from the parent process, it is not necessary to return (this is not considered in 2.4). 3) Time_slice's impact on the scheduling in 2.4, the remaining time film of the process is the maximum factor affecting the dynamic priority except for the NICE value, and the number of sleeps will be continuously superimposed, thereby calculating the priority Big big, the scheduler is using this way to reflect the priority strategy for the interactive process. However, in fact, the number of sleeps does not indicate that the process is interactive, which can only explain it is IO intensive, so this method is very accurate, sometimes because of the database application that will frequently access the disk as an interactive process. However, it causes a true user terminal response sluggish. 2.6 The scheduler is divided into two categories of the ready process with the time consumption of time, respectively, with different ready queues, respectively, with respect to the latter with absolute scheduling priority - only when the Active process time Delivery, the Expired process has the opportunity to run. However, when picking up processes in Active, the scheduler no longer reserves the process remaining time card as a factor affecting scheduling priority, and to meet the requirements of kernel deprivation, the time slice is too long, the non-real-time interactive process will be artificially Divided into several sections (each section is called a run granularity, the definition is shown) running, after each section runs, it is deprived from the CPU, placed on the end of the corresponding Active Ready queue, and has the same priority The process provides an opportunity to run. This operation is done after schedule_tick (). At this time, even if the process has not been completed, as long as the process meets the following four conditions, it will be enforced to deprive from the CPU, and the next dispatch is waiting for the next dispatch:

The process is currently in the Active ready queue; the process is an interactive process (Task_interactive () returns true, see "More precise interactive process priority", Nice is greater than 12, the macro returns to constant fake); the process has been depleted The time slice (the time sheet reference value minus the remaining time slice) is exactly the integer multiple of the operating granularity; the remaining time slice is not less than the definition of the granularity operating granularity, the granular TIMESLICE_GRANTY is defined as the number of Sleep_AVG and the system total CPU number . Because Sleep_avg actually represents the difference between the non-run time and runtime of the process, the relationship is closely related to the interaction level, so the definition of running granularity illustrates the following two scheduling policies:

The higher the degree of process, the smaller the running particle size, which is allowed by the operation characteristics of the interactive process; the CPU-Bound process should not be separated by Cache, the maximum number of CPUs, the total number of granularities The bigger it. 5. Optimized priority calculation methods In the 2.4 core, the priority calculation and the selection of candidate processes are in the scheduler, and it is not possible to ensure the execution time of the scheduler, which is mentioned in front of the Runqueue data structure. 2.6 The candidate process in the kernel is to select directly from the priority queue number of priority queues in algorithms, and the priority calculation is dispersed. This section is divided into two parts to describe this new priority computing method, part of the priority calculation process, part of the priority calculation (and processes). 1) The priority calculation process dynamic priority is mainly completed by the Effect_PRIO () function, which is fairly simple, and the priority of the non-real-time process is only determined by two factors of the Sleep_avg value of static priority (static_prio) and processes. And the priority of the real-time process is actually set in setscheduler (see "Real-time performance of the scheduling system" in Setscheduler (), just below the non-real-time process, and once the setting will no longer change. In contrast, 2.4's Goodness () function is even more complex, it is not considered for the cost of CPU Cache failure overhead and memory switching. 2.6 Dynamic priority algorithm is on the Sleep_AVG variable, in Effective_PRIO (), the range of Sleep_avg is 0 ~ max_sleep_avg, which turns into -max_bonus / 2 ~ max_bonus / 2 Bonus / 2 after conversion of the following formula: (ns_to_jiffies ((p) -> SLEEP_AVG) * MAX_BONUS / MAX_SLEEP_AVG) - MAX_BONUS / 2

As shown below:

Use this Bonus to decrease the static priority to get the dynamic priority of the process (and limit between max_rt_prio and max_prio), the smaller the Bonus, the larger the dynamic priority value, the lower the priority. That is, the larger the SLEEP_AVG, the higher the priority. MAX_BONUS is defined as max_user_prio * prio_bonus_ratio / 100, that is, the effect of sleep_avg on dynamic priority is only within 1/4 interval (± 5) of the user priority area (100 ~ 140) of the static priority, relatively , Static priority, that is, the NICE value specified by the user is much larger than the priority calculation. This is also a place in 2.6 scheduling system, and the scheduler tends to morely design the execution priority of the user's own design process. Sleep_avg reflects two strategies of the scheduling system: the interactive process priority and the fair sharing of the time-time system, in the next section, we have to analyze. 2) The calculation of the priority calculation is no longer centralized when the scheduler selection candidate process, as long as the process state changes, the core is possible to calculate the dynamic priority of the process: a) Creating a process in Wake_up_forked_Process ( In, the child process inherits the dynamic priority of the parent process and adds to the ready queue where the parent process is located. If the parent process is not in any ready queue (for example, it is an IDLE process), then the priority of the child process is calculated via the Effect_PRIO () function, and the child process is placed in the appropriate queue according to the calculation result. b) Wake-up Sleep Process Call RECALC_TASK_PRIO () Sets the dynamic priority of the process you wake up in the sleep state, and then place it in the corresponding ready queue according to the priority. c) Scheduling to the process of waking up from the Task_Interruptible state In fact, the scheduler has selected a candidate process, but considering this type of process is likely to be an interactive process, so it is still called RECALC_TASK_PRIO () pair. The priority of the process is corrected (see "Process Average Waiting Time Sleep_avg"), the correction result will be reflected in the next dispatch. d) The process is deprived of the CPU due to time tablets (starting from the clock interrupt) (starting by the clock), the process may be deprived of CPU due to the two reasons, one is time to exhaust, one is too long segment. These two cases will call Effect_PRIO () recalculate priority, re-enter the team. e) Other timings These other opportunities include IDLE process initialization (INIT_IDLE ()), load balancing (), for details, "Scheduler-related load balancing" and modify the NICE value (set_user_nice (), modify the scheduling policy (setscheduler ()) Wait for the case where the priority is required. It can be seen from the discussion, and the calculation process of the dynamic priority in 2.6 is carried out during the operation of each process, avoiding the problem of time-consuming process time when the ready process in which 2.4 systems is modeled, so that the response time of the process cannot be expected. At the same time, the factors affecting the dynamic priority are reflected in the SLEP_AVG variable.

6. The Sleep_AVG value of the process average waiting time SLEEP_AVG process is the key to determining the dynamic priority of the process. It is also the key to the level of progress. It is the most complicated part of the 2.6 scheduling system, which can be said that the performance improvement of the 2.6 scheduling system, A large part should be attributed to the design of Sleep_avg. In this section, we will specifically analyze the changes in Sleep_avg and its impact on scheduling. There are four places in the kernel to modify the Sleep_AVG: The sleep process is awakened (Activate_task () calls the recal_task_prio () function), the process of the task_interruptible status is wake up first scheduled to (Schedule () call RECALC_TASK_PRIO () ), The process is deprived from the CPU (in the Schedule () function), the process creation and process exits, where Recalc_task_prio () is the highest in which the complexity is highest, it passes the waiting time (or waiting in sleep, or It is the impact of the priority in the ready queue to reset the priority. 1) The sleep process is awakened at this time at this time, Activate_task () is called RECALC_TASK_PRIO () to calculate the time of the precedence of the sleep waiting time as the parameter. In Recalc_Task_PRIO (), Sleep_AVG may have four assignments, and is ultimately limited to NS_MAX_SLEP_AVG: a) unchanged from the Task_uninterruptible status status (activated == - 1), the degree of interaction is not high (! High_credit (p)) The user process (P-> mm! = Null)), if its sleep_avg is not less than Interactive_Sleep (P), its SLEEP_AVG will not change due to this waiting. b) Interactive_sleep (p) is awakened from the Task_uninterruptible status status (activated == - 1), the degree of interaction is not high (p-> mm! = null), if its sleep_avg is not To achieve interactive_sleep (p), but if the sleep time sleep_time is reached, its SLEEP_AVG is equal to Interactive_Sleep (P). c) max_sleep_avg-avg_timeslice user process (P-> mm! = null), if it is not awake from Task_uninterruptible sleep (P-> ActiVated! = - 1), and the time waiting time (sleep_time) has exceeded interactive_sleep (p), its SLEEP_AVG is set to jiffies_to_ns (max_sleep_avg-avg_timeslice). d) Sleep_avg sleep_time If you do not satisfy all of the cases, the SLEEP_TIME is superimposed on the Sleep_AVG.

At this time, SLEEP_TIME is corrected twice: i. According to the value of SLEEP_AVG, the larger the SLEEP_AVG, the smaller the corrected SLEEP_TIME: SLEEP_TIME = SLEEP_TIME * MAX_BONUS * (1-NS_TO_JIFFIES (Sleep_avg) / max_sleep_AVG)

Ii. If the process is very low (Low_credit () returns true, see "More Accurate Interactive Process Priority"), then limit the Sleep_Time to the reference time sheet of the process, as a priority for the process of CPU-Bound Level punishment. In general, the longer the waiting time is, the sleep_avg should be larger, but the core excludes two cases:

The process waken from the Task_uninterruptible status, especially those who have a longer sleep time, it is very likely to wait for some kind of resource to sleep, they should not be subject to the priority reward of waiting time, its sleep_avg is limited to interactive_sleep (P) Within the range (or no more than too much) (after the meaning of Interactive_Sleep (P), "This restriction is invalid for the process that has been identified as an interactive process. Not from the Task_uniterruptible status, if the waiting time is too long, it should not be rewarded by the waiting time.

Interactive_sleep (p) and NS_TO_JIFFIES (Interactive_Sleep (P)) =

Task_nice (p) * max_sleep_avg / 40

(Interactive_Delta 1 Max_bonus / 2) * max_sleep_avg / max_bonus -1

This macro is proportional to the process nice value, the lower the priority, allowing it to longer in the "interactive sleep" state. 2) The process of the Task_Interruptible status is woken up, and after the schedule is sent to the scheduler to pick up the candidate process, if it is found to be waken from Task_Interruptible, the first time is scheduled to (ActiVated> 0), the scheduler will According to it, it is called the recalc_task_prio () re-adjusted the process's SLEEP_AVG. The process of recalc_task_prio () adjustment is exactly the same as the "sleep process is awake", and it is different. At this time, it is not a process for sleeping time, but the process is waiting for scheduling on the ready queue. time. Also, if the process is not awakened (activated == 1), sleep_time will also be constrained (Delta = Delta * ON_Runqueue_Weight / 100), because the process is not the possibility of interactive processes. From the above analysis of RECALC_TASK_PRIO (), the Sleep_Time reduction generally means that the priority will decrease accordingly, so this reward indicating that the scheduler is further reducing the response time of the process, especially interactive processes. 3) Before being switched, Sleep_avg is the "average" wait time of the process, and recalc_task_prio () calculates waiting time. In Schedule (), the Sleep_avg that is switched down, the Sleep_AVG that is subtracted will be subtracted. Time run_time (not less than 0), this is a "average" manifestation: the longer the Sleep_AVG, the more the process is more and more, the longer the Sleep_AVG, the less the process is not easy to schedule it. . Run_time can be represented by the difference between the current time of the system and the process TimeSTAMP (the time of the last scheduled run ", but cannot exceed NS_MAX_SLEP_AVG. For the interactive process (hight_credit (p) is true, see "More Accurate Interactive Process Priority"), Run_Time also adjusts according to the value of Sleep_AVG: Run_Time = Run_Time / (Sleep_AVG * MAX_BONUS / MAX_SLEEP_AVG)

The result of this adjustment is that the RUN_TIME of the interactive process is smaller than the actual run time, the larger the SLEEP_AVG, the more Run_TIME reduces, so the process of being switched is finally calculated, the greater the Sleep_AVG, the dynamic priority is also changed. Big. The interactive process can take more opportunities to be performed. 4) After Wake_up_forked_process (), the Sleep_avg of the parent process is multiplied by Parent_Penalty / 100, while the Sleep_AVG of the child process is multiplied by Child_Penalty / 100. In fact, Parent_Penalty is 100, Child_Penalty is equal to 95, that is, the Sleep_avg of the parent process will not change, and the Sleep_avg that the child process inherits from the parent procedure will reduce 5%, so the last priority of the child process will be slightly smaller than the father process. Low (but the child process will still be placed on the same ready queue as the parent process, where the position is in the parent process - "preamble" said "child process first on the parent process"). 5) When the process exits, a process ends running. If its interaction is lower than the parent process (smaller in Sleep_AVG), the core will adjust the Sleep_AVG of its parent process in SCHED_EXIT (), and adjust the formula as follows (indicated by Child_Sleep_AVG) Sleep_avg of the child process: sleep_avg = sleep_avg * EXIT_WEIGHT / (EXIT_WEIGHT 1) Child_Sleep_AVG / (exit_weight 1)

Where EXIT_WEIGHT is equal to 3, so the Sleep_avg of the parent process will reduce the 1/4 of the Sleep_AVG, and then compensate for 1/4 of the sub-process Sleep_AVG, the priority will also decline, the degree of interaction of the child process is more different from the parent process. The more prioritized punishment of priority. Using the process average waiting time to measure the priority of the process, the wait time of all processes of the macro the same static priority is consistent with the ratio of the ratio of the runtime, which reflects the fairness of the CPU to share the CPU. On the other hand, the SLEEP_AVG is a measure of the degree of interactive process. 7. More accurate interactive process priority interactive process priority strategy is widely criticized in the 2.6 core, which has been greatly improved in the 2.6 core, and in general, the internal nuclear has four priorities for interactive processes. : 1) Sleep_AVG has analyzed the impact of Sleep_AVG on process priorities in detail. It can be seen that the interactive process is more than sleeping, and their Sleep_avg will be more powerful, so the priority is calculated. Level will also be higher accordingly. 2) Interactive_credit system introduces an Interactive_credit process property (see "Improved task_struct"), is used to characterize whether the process is an interactive process: as long as intective_credit exceeds the crredit_limit threshold (high_credit () returns true), the process It is considered to be an interactive process. Interactive_credit initial value is 0, in both cases it will add 1, both in the RECALC_TASK_PRIO () function:

User process (P-> mm! = Null), if not waken from Task_uninterruptible sleep (P-> ActiVated! = - 1), and wait for time (including waiting in sleep and waiting in the ready queue,) More than a certain limit (sleep_time> interactive_sleep (p)); except above, if the SLEEP_AVG is adjusted after the Sleep_Time, if it is greater than NS_MAX_SLEP_AVG. Either case, once intective_credit exceeds (greater than) Credit_Limit, it no longer increases, so interactive_credit maximum is CREDIT_LIMIT 1. Interactive_credit's decrement occurs in the Schedule () function. After the scheduler is switched with the SLEEP_AVG of the process being switched, if the Sleep_AVG is less than or equal to 0, and intective_credit is between -credit_limit and credit_limit (-100 <= intective_credit <= 100), INTERACTIVE_CREDIT minus 1. It can be seen that the interactive_credit minimum is -101, and once it reaches the maximum value of CREDIT_LIMIT 1, it will not be subtracted again - it will remain on the high value of CRedit_limit 1. That is to say, only the process is sleeping multiple times, and the time of sleep is long enough (longer than the time, longer than "interactive sleep" time), the process is likely to be listed as an interactive process; and once it is considered an interactive process Others will always be treated with an interactive process. The interactive process using the high_credit () standard assertion is mainly rewards in priority calculations in two sites: When the process is scheduled from the CPU, if it is an interactive process, it participates in the priority calculation running time. Small than the actual run time, to achieve a higher priority (see "Process Average Waiting Time Sleep_AVG"); Sleep Time in the Task_uninterruptible state will also be superimposed on Sleep_AVG, giving priority rewards (see " Process Average Waiting Time SLEEP_AVG "); 3) Task_interactive () The core is another interactive process priority mechanism without using high_credit () this cumulative method, where use task_interactive () macro (Boolean): PRIO < = static_prio-delta (p)

When the process time is exhausted, if the macro returns true (and when the Expired queue is not waiting for too long, see "New Data Structure Runqueue" "Expired_TimeStamp" section), then the process does not enter the Expired queue but retain it in Active In the queue in order to schedule this interactive process as soon as possible. Dynamic priority is calculated in Effective_Prio () when scheduling into this process: PRIO = static_prio-bonus (Sleep_avg) indicates that Bonus is a function of Sleep_avg, see "Priority Calculation Process"), and Delta (P ) The NICE value of the process can be expressed as Delta (Nice). The relationship between Bonus and Sleep_avg has been shown in the "Priority Calculation Process" section. The relationship between Delta and Nice is shown in the figure below: The range of nice is -20 ~ 19, Delta (P) converts it to -5 ~ 5 plus an Interactive_Delta constant range: Task_nice (p) * max_bonus / 40 interactive_delta, where Interactive_Delta is equal to 2. After the conversion, Task_INteractive (P) becomes "Delta (Nice) is not greater than Bonus (Sleep_avg). Represents Sleep_AVG as Jiffies, and substithere, delta <= Bonus (Sleep_avg) can be represented as: Nice / 4 2 <= Bonus, -5 <= Bonus <= 5

It can be seen from it that Nice is greater than 12, this inequality is constant, that is, the process will never be treated as an interactive process; the higher the process static priority, it is necessary to be used as an interactive process. The lower the upper limit of the SLEEP_AVG, that is, the static priority process is more opportunities to get this reward. 4) The reward for ready waiting time is most likely to be interactive because it is often in the Task_Interruptible status, so this class process will wait for the scheduled scheduling on the ready queue after waking up in hibernation. level. This work is in the scheduler (SCHEDULE ()) after selecting this type of process, and considering that the interactive process is usually woken up in the interrupt, the core also records this information, and wakes up without interrupt The process implements a reward constraint (see "Process Average Waiting Time Sleep_avg"). 8. After the scheduler has the above preparation, we can now look at the main proxidation of the scheduler. Compared with the 2.4 scheduler, 2.6 schedule () functions should be more simple, reduce the lock operation, the priority calculation is also available outside the scheduler. In order to reduce the process in the CPU, the process of reinterping the process that is switched in 2.4 will be omitted on another CPU. The basic process of the scheduler can still be summarized as the same five steps:

Clean up the current running process (Prev) Select the next input process (Next) Setting up the new process Running environment Perform process context switching 2.6 scheduler workflow retains a lot of 2.4 systems, process switching details Also true with 2.4 (starting from context_switch ()). In order not to analyze the dispatcher with the 2.4 system, we analyze the workflow of the scheduler in accordance with the impact of the scheduler on the various data structures, focus on the different parts of the 2.4 scheduler, the same or similar part of the same or similar part of the reader It is easy to understand with the technical analysis above. At the same time, the 2.6 scheduler has also increased support for load balancing and internal nuclear sealing operation, as content is independent, we also put it in separate chapters. 1) The associated lock is mainly because the ready queue is distributed to each CPU, and only two locks in the 2.6 scheduler: Ready queue lock Runqueue :: Lock, the global core lock kernel_flag. The operation of the ready queue is guaranteed to operate uniqueness, the meaning of the core lock is the same in 2.4: the scheduler will unlock the core lock before performing the handover (Release_kernel_lock ()), restore the lock after the scheduling (Reacquire_kernel_lock ()). The lock state of the process is still saved in the task_struct :: lock_depth property. Because there is no global locked operation in the scheduler, the 2.6 scheduler itself runs hard to exist. 2) The prev scheduler mainly affects the two attributes of the Prev process: Sleep_AVG minus the runtime of this process (see "Switching process" in the "Process Average Waiting Time Sleep_AVG"; TimeStamp Update to Current Time, Record The time to be switched, used to calculate the process waiting time. After the prev is switched, even if the Sleep_avg is modified, it does not change the position in the ready queue, which will always participate in the schedule until a state change (such as sleep). 3) NEXT When introducing the Runqueue data structure in the previous, we have analyzed the functions of Active / Expired, sequencing the ready process queue, and the 2.6 scheduler has three possibilities for the positioning of the candidate process:

The Active Readiness Process is the highest priority and waiting for the process; there is no ready process in Runqueue, then start load balancing from other CPUs transfer process, then select (see "Detail-related load balancing") If there is still no ready process, set this CPU's IDLE process to a candidate. After picking out next, if the next is wake up from Task_Interruptible sleep, the first time is scheduled to (ActiVated> 0), the scheduler will re-adjust the priority of the process according to the next time the next queue is waiting. In the new location in the enclosure queue, see "Process Average Waiting Time Sleep_AVG"). In addition to the update of Sleep_AVG and PRIO, NEXT's TimeSTAMP is also updated to the current time, which is used to calculate the running time when it is switched next time. 4) The external environment hereby environment refers to the assignment of the scheduler to the environment other than the participating scheduling and the environment other than the ready queue, mainly including the switching counting process and the CPU state update (QSCTR). 9. The schedulat is supported by the kernel sessage in 2.4 system, any process running in the core state, only when it calls Schedule () active abandon control, the scheduler will have the opportunity to choose other processes, so we say that Linux 2.4 kernel It is not to be ruined. Lack of this support, the core cannot guarantee the timely response of the real-time task, so it is not possible to meet the requirements of the real-time system (even soft). 2.6 The kernel has realized the rules, and any code segment without lock protection may be interrupted, its implementation, for scheduling technology, mainly the time to increase the scheduler operation. We know that in the 2.4 kernel, the scheduler has two start-up modes: active and passive, where the passive mode start scheduler can only be in control from the core state, so there is a characteristic of the kernel. 2.6, the startup method of the scheduler can still be divided into two types of active and passive, and the conditions of passive start scheduler are widening a lot. Its modification is mainly in Entry.s: ... RET_FROM_EXCEPTION: # 进口 中 中

Preempt_Stop # is interpreted as a CLI, the closing interrupt, that is, from the abnormality, it is not allowed to preemptively

RET_FROM_INTR: # 入 入 入

GET_THREAD_INFO (% EBP) # Take Task_Struct's Thread_info information

MOVL EFLAGS (% ESP),% EAX

MOVB CS (% ESP),% Al

Testl $ (VM_MASK | 3),% EAX

JZ RESUME_KERNEL # "Return to user state" and "return" in the core state "

Entry (Resume_USERSPACE)

CLI

MOVL TI_FLAGS (% EBP),% ECX

Andl $ _TIF_WORK_MASK,% ECX # (_ TIF_NOTIFY_RESUME | _TIF_SIGPENDING

# | _TIF_NEED_RESCHED)

JNE WORK_PENDING

JMP Restore_all

Entry (resume_kernel)

CMPL $ 0, TI_PRE_COUNT (% EBP)

JNZ restore_all # If preempt_count is not 0, it is not allowed to preemptively

NEED_RESCHED:

MOVL TI_FLAGS (% EBP),% ECX

TestB $ _TIF_NEED_RESCHED,% cljz restore_all # If there is no need_resched bit, no scheduling

Testl $ IF_MASK, EFLAGS (% ESP)

JZ RESTORE_ALL # If it is interrupted, it is not allowed.

Movl $ preempt_active, ti_pre_count (% EBP) #PreeMpt_count is set to Preempt_Active, and the notification scheduler is currently scheduling at once.

# 调

STI

Call Schedule

MOVL $ 0, TI_PRE_COUNT (% EBP) #preempt_count clear 0

CLI

JMP NEED_RESCHED

......

Work_pending: # This is also a Resched entry when returning from the system call

Testb $ _TIF_NEED_RESCHED,% CL

JZ Work_notifysig # does not need to schedule, then it is definitely because there is a signal that needs to be handled to enter Work_pending

Work_resched:

Call Schedule

CLI

MOVL TI_FLAGS (% EBP),% ECX

Andl $ _TIF_WORK_MASK,% ECX

JZ RESTORE_ALL # No Work is done, no need for resched

Testb $ _TIF_NEED_RESCHED,% CL

JNZ Work_Resched # or needs to be scheduled, or there is a signal to process

Work_notifysig:

......

Now, whether it is returning to a user state or returning a core state, it is possible to check the status of NEED_RESCHED. When it returns the core state, as long as preempt_count is 0, that is, the current process is currently allowed to call Schedule () according to the NEED_RESCHED status. In the core, because at least the clock interrupt is constantly occurring, so as long as the process sets the NEED_RESCHED flag of the current process, the current process is likely to be preemptive, regardless of whether it is willing to give up the CPU, even the core process is true.

In addition to the core application actively call the scheduler in addition to the core application, the core is still working in the following three timings:

Returns from interrupt or system calls; process re-enable preempt_schedule ()); actively enter sleep (such as wait_event_interruptible () interface) 10. The load balance related to the scheduler is switched in the 2.4 core, and if there is also a CPU idle, or the process priority running on the CPU is lower than yourself, then P will be scheduled to the CPU to run, core It is this way to achieve the balance of the load. Simple is the biggest advantage of this load balancing mode, but its disadvantages are also more obvious: the process migration is more frequent, and the interactive process (or high priority process) may continue "jumping" between the CPU. Even in the SMP environment, the process migration is also cost, and the use of the system has shown that this load balancing method is greater than that, solving this "SMP affinity" problem is one of the targets of 2.6 system design. 2.6 Scheduling system uses a relatively concentrated load balancing scheme to be "push" and "pull" two types of operations: 1) "pull" When a CPU load is too light and the other CPU load is heavy, the system will be from overloaded The "pull" process is over, this "pull" load balancing operation is implemented in the LOAD_BALANCE () function. Load_balance () has two modes, which are used for the current CPU is not idle and idle. We call it "busy balance" and "idle balance": a) Busy balance whether or not the current CPU is busy or free, clock interruption (Rebalance_Tick () function) Every time (busy_rebalance_tick) will start a load_balance () balanced load, which is called "busy balance". Linux 2.6 tends to do a lot of limitations as much as possible, so many restrictions on whether it should be "pulled": the most busy CPU load exceeds 25% of the current CPU load; the current CPU load Take the current real load and the larger value of the load of the load during the last load balance, smooth load valve; the load of each CPU takes a smaller value of the current real load and the load on the load balance, smooth load peak; After the source and destination two ready queues, then confirm that the primary queue load is not reduced, otherwise the load balancing action is canceled; the following three types of processes in the source queue participate in the load case, but do not do actual migration:

The process that is running is not allowed to migrate to the CPU of this CPU (according to the CPU_ALLOWED Property) The time of the CPU in the CPU is located (Runqueue :: TimeStamp_LAST_TICK, the value in the clock interrupt) and the process are switched down (Task_struct: The difference is less than a certain threshold (cache_decay_ticks nanosecond value), - the process is still active, the information in cache is not cool enough.

Load History Information In order to avoid competition, the scheduler will save the load of the entire CPU load (the number of ready process) in the corresponding element of the Prev_CPU_LOAD array of this CPU ready queue, which will refer to this when calculating the current load A historical information. After finding the busiest CPU (source CPU), determine the number of processes that need to be migrated as half of the source CPU load and the difference in this CPU load (all of the above historical information is smoothed), then follow the Expired queue to the Active queue, from low The priority process is migrated in the order of high priority processes. However, the process that actually performs migration is often less than the program migration process, because the process of the three types of "do not do active migration" does not participate in migration. b) Load balance in idle balance, there are two call time: in the scheduler, the Ready queue of this CPU is empty; in the clock interrupt, this CPU's ready queue is empty, and the current absolute time (Jiffies value) Is the multiple of idle_rebalance_tick (that is, it is performed once every idle_rebalance_tick). At this time, Load_balance () is relatively simple: looking for the current real load maximum CPU (Runqueue :: NR_Running the most), migrating one of the "most suitable" (see below) to the current CPU. The standard and "busy balance" candidate process is similar, but because the idle balance is only "pull" one process, the action is much smaller, and the frequency is relatively high (idle_rebalance_tick is 200 times the busy_rebalance_tick), so There is no historical situation and load difference considering the load, and the candidate migration process does not consider the activity of Cache.

The problem in calculating the busiest CPU algorithm is actually possible to become a balance source of the CPU load should be more than the current CPU load, so the initial value of MAX_LOAD in the Find_busiest_queue () function is NR_Running, and the LOAD is at least 1 So a little bit less. c) The specific action of the PULL_TASK () "pull" process is implemented in this function. After the process migrates from the source Reader to the purposeful queue, PULL_TASK () updates the TimeStamp property of the process so that it can continue to explain that the process relative to the switching time of the CPU. If the priority of the pulled process is higher than the process of running on this CPU, the NEED_RESCHED bit of the current process is waiting to be scheduled. 2) "Push" a) Migration_Thread () corresponds to "pull", and 2.6 load balancing system has a "push" process, the main body of "push" is a core process called Migration_Thread (). This process is automatically loaded (one each CPU) when the system is started, and set it to the real-time process of SCHED_FIFO, then check if there is a request to wait for processing in Runqueue :: migration_queue, if not, sleep in Task_Interruptible until waking up After checking again. MIGRATION_QUEUE is only added in SET_CPU_ALLOWED (), when the process (such as closing a CPU via APM) Call SET_CPU_ALLOWED () change the currently available CPU, so that a process is not suitable for continuing to run on the current CPU, constructing a migration request data Structure MIGRATION_REQ_T, put it in the migration_queue of the CPU ready queue where the process is located, then wakes up the migration of the ready queue (recorded in the Runqueue :: Migration_Thread property), migrating the process to the appropriate CPU (see "new data Structure Runqueue "). In the current implementation, the selection and load of the destination CPU are independent, but "any_online_cpu (req-> task-> cpus_allowed", which is the first allowed CPU in the CPU number order. So, with the LOAD_Balance () and the scheduler, the load balancing strategy is closely related, MIGRATION_THREAD () should be said to be just an interface such as a CPU binding and the function of CPU power management. b) Move_task_away () actual migration The action is implemented in the move_task_away () function. After the process enters the destination Ready queue, its TimeStamp is updated to the TimeStamp_Last_Tick of the destination CPU ready queue, indicating that the process is starting (on the destination CPU) waiting . Because the operation of "push" is to read the remote written locally (opposite to PULL_TASK ()), therefore, when the schedule of the CPU is started, it is necessary to synchronize with the remote operation, and it is also possible to pass the IPI (Inter-Processor Interrupt). Notify the destination CPU, all of these operations are implemented in the resched_task () function.

Two Runqueue's lock synchronization will involve two ready queues on the migration process, usually need to lock both the ready queue before the operation, in order to avoid dead locks, the kernel provides a set of locks. Double_rq_lock () / double_rq_unlock () function. This set of functions do not operate IRQ, so the action of the switch interrupt needs to do itself. This set of functions is used in Move_Task_away (), and PULL_TASK () is Double_lock_balance (), but the principle is the same as Double_RQ_Lock () / Double_RQ_UNLOCK (). 11. The scheduling under the Numa structure is in the Linux scheduler. The main difference between NUMA and SMP is in that the CPU under NUMA is organized into one node. Different architectures, the number of CPUs included in each node is different, for example, under the I386 platform of 2.6, the NUMAQ structure can be configured on each node, and the Summit structure can be configured 32 CPUs. The NUMA structure is officially reflected in the Linux kernel from 2.6, before this, Linux supports NUMA using the existing "discontiguous memory" architecture. In addition to special processing on memory allocation, the previous kernel is equivalent to SMP in the scheduling system. 2.6 The scheduler except for a single CPU load, the load of each node under NUMA is considered. The NUMA structure has two special processing in the new kernel. One is equal to each NUMA node when doing load balancing, and the other is to select a new program in the system (DO_EXECVE ()) Select from the Lightweight Node Executing the CPU: 1) Balance_node () balance between nodes as part of the Rebalance_Tick () function before loading_balance () (at this time "The work set of Load_balance () is the CPU within the node, that is, NUMA is not simply Balance the full-system CPU load, but first balance the node load, the re-balance node is loaded into "busy balance" and "idle balance" two steps, the execution interval is idle_node_rebalance_tick (current implementation is IDLE_REBALANCE_TICK 5 BUSY_NODE_REBALANCE_TICK (only twice the busy_node_rebalance_tick). Balance_Node () first invokes to find the busiest node in the system, then perform loading_balance () in the CPU collection composed of this node and this CPU. The algorithm for finding the busiest node involves several data structures:

Node_nr_running [max_numnodes], records the number of ready processes on each node with the node number, which is the real-time load on that node. This array is a global data structure that needs access to the Atomic Series function. Runqueue :: prev_node_load [max_numnodes], the load condition when the system recorded in the system recorded in the reader data structure, which corrects the following formula: Current load = last load / 2 10 * Current real-time load / The number of Node CPU uses this calculation method to smooth load, or the number of node CPUs can be considered inconsistent. Node_threshold, load weight, defined as 125, the load of the most busy node selected must exceed 125/100 of the current node load, that is, the load difference exceeds 25%. 2) SCHED_BALANCE_EXEC () When the execve () system call loads another program to put it run, the core will look for the Lights Lights in the Lights Lights Lighter (SCHED_BEST_CPU ()), then call SCHED_MIGRATE_TASK () Migrate this process to the selected CPU. This operation is implemented by DO_EXECVE () calls sched_balance_exec (). The selection criteria for the SCHED_BEST_CPU () are as follows:

If the current CPU ready process does not exceed 2, do not migrate; when calculating node load, use (10 * current real-time load / node CPU number) algorithm, does not consider the history of the load; calculate the load of the CPU within the node The actual number of ready-to-use processes acts as a load indicator, and does not consider the history of the load. And "Busy Balance" and "Idle Balance" use different load evaluation criteria, SCHED_BALANCE_EXEC () uses a different (simpler) evaluation criteria with Balance_Node (). SCHED_MIGRATE_TASK () Borrowed the Migration_Thread service process to complete the migration, the process's CPU_ALLOWED is temporarily set to only on the destination CPU, and the migration_thread will re-resemble the CPU_allowed attribute after migrating the process to the destination CPU. 12. Real-time performance of the scheduler 1) 2.6 For real-time applications 2.6 kernel scheduling systems have two new features to real-time applications: kernel seasants and O (1) scheduling, these two points guarantee the real-time process can be expected The response is obtained in time. This "limited time response" is in line with the requirements of Soft RealTime, and there is a distance from the "Hard Realtime) of" immediate response ". And, 2.6 scheduling system still does not provide deprivation of other resources other than CPUs, and therefore, its real-time is not fundamentally changed. 2) Priority of the real-time process 2.4 In the system, the priority of the real-time process is represented by the RT_Priority property, which is different from the non-real-time process. 2.6 Introducing a dynamic priority attribute outside of the static priority, and uses it to simultaneously represent the priority of the real-time process and the non-real-time process. From the above analysis we see that the static priority of the process is the basis of the initial time of the process, and the dynamic priority determines the actual scheduling priority of the process. Whether it is a real-time process or a non-real-time process, static priority is set and changed by set_user_nice (), and the default value is 120 (max_prio-20), that is, the time slice and non-real time process in real time in one range. Inside. There are two places in real-time processes and non-real-time processes: scheduling policy policy (SCHED_RR or SCHED_FIFO) and dynamic priority PRIO (less than max_user_rt_prio), actually use the latter as the inspection criteria. The dynamic priority of the real-time process is set (equivalent to rt_priority) in setscheduler (), and does not change with the process of the process, so the real-time process is always sorted in strict accordance with the priority of the settings, this point and non-real-time process dynamic priority meaning different. It can be considered that the static priority of the real-time process is only used to calculate the time slice, and the dynamic priority is equivalent to a static priority. 3) Sched_r and SCHED_FIFO in Real-Time Scheduling 2.4 There are no changes in 2.6. The two types of real-time processes are kept in the Active ready queue, just because the 2.6 kernel is sewer, real-time process (especially core level Real-time processes can react more quickly to changes in the environment (such as higher priority processes).

13. Postscript: Look in Linux from the scheduler In recent years, Linux has shown increasing interest and competitiveness for desktop systems, low-end servers, high-end servers, and embedded systems. For a "market style" open development The model's operating system can be made to do this is a miracle. However, from the implementation of the scheduling system, I feel that Linux is still on the desktop system. It still maintains the characteristics of "self-interest" in the early years, that is, the development of the developers of free software, to a large extent come from Change the current situation of existing systems to "not easy". Despite all kinds of motives and power, Linux exhibits strong competition with commercial operating systems such as Windows, but from the perspective of developers, this desire and free software development feature is contradictory. In an interview, INGO MONAR said that he designed O (1) scheduling algorithm, basically from personal creativity, not referring to the market and the existing scheduling algorithm in the field of research. As can be seen from the scheduler, 2.6 scheduling system considers a lot of details, but there is no clear main line in general, and it is impossible (or unintentional) to perform performance analysis of the O (1) model. From the development of 2.6, we can also see that the weights related to various schedules have been fine-tuning in different versions, which can be considered that the performance optimization of the 2.6 scheduling system is mainly measured. This is a typical Linux development model - full of passion, lack of planning. For Linux markets, the most urgent, most active needs are embedded systems, but at least from the dispatching system, 2.6 does not have great efforts in this area, maybe developers don't feel more and interest. . It can be sure, although Linux is very hot in the market, its development is still not a market being driven. This may affect Linux competitiveness, but maybe it can also maintain the vitality of Linux development. On March 1 this year (2004), the famous open source network security project FREES / WAN announced that the development is mainly due to the need for developers' intention and user demand. For network security systems, users think more about system functions, powerful, rather than their predictive advanceability, therefore, FREES / WAN new version of Opportunistic Encryption (OE) is not attracted to a sufficient number of users test. In view of this, investors have stopped funding for FREES / WAN projects, which provides powerful network security support for open source systems, may transfer to ground again. So far, I have not heard Linux's development depends on a report of a commercial fund. Therefore, relatively speaking, Linux has developed more freely and casualties, and promoting Linux people and developing Linux basically operates independently. Linux The beneficiary and Linux developers have not been closely combined. This may be a blessing for Linux, not a disaster.

转载请注明原文地址:https://www.9cbs.com/read-106573.html

New Post(0)