Linux 2.6 Added VSYSCALL System Service Call Mechanism (ZT)

xiaoxiao2021-03-06 47

Linux 2.6 Added VSYSCALL System Service Call Mechanism

Similar to Windows system service call implementation mechanism, Linux is called all core state system to maintain a jump table (SYS_CALL_TABLE @ Arch / i386 / kernel / entry.s). Only for Window, similar jumping tables (KeserVicedescriptable @ ntos / ke / kernldat.c) is further subdivided into four parts by feature, respectively, for the kernel and Win32 subsystem, etc. Linux system service table, because of the issues such as openness, compatibility and graft, more stable and conserved.

The following is program code: // sys_call_table @ Arch / i386 / kernel / entry.s.dataentry (sys_call_table) .long sys_restart_syscall / * 0 - OLD "setup ()" System call, usd for restarting * /. Long sys_exit .. ..long sys_request_key.long sys_keyctlsyscall_table_size = (.- SYS_CALL_TABLE)

The following is a program code: // KSERVICE_TABLE_DESCRIPTOR @ ntos / ke / ke.h # define NUMBER_SERVICE_TABLES 4typedef struct _KSERVICE_TABLE_DESCRIPTOR {PULONG Base; PULONG Count; ULONG Limit; PUCHAR Number;} KSERVICE_TABLE_DESCRIPTOR, * PKSERVICE_TABLE_DESCRIPTOR; // KeServiceDescriptorTable @ ntos / ke / Kernldat.ckservice_table_descriptor keserviceDescriptable [Number_Service_Tables];

For the client API, Linux and Windows NT / 2K are all through the traditional interrupt mode, Linux uses int 0x80; Windows uses INT 0x2e. For Glibc, it is actually a series of macro definitions, such as _syscall0 - _syscall6 and other different forms, such as

The following is program code: #define _syscall1 (type, name, type1, arg1) / type name (type1 arg1) / {/ long __res; / __ ASM__VOLATILE ("int 0x80" /: "= a" /: "0" (__nr _ ## name), "b" ((wire))); / __ syscall_return (type, __ res); /}

When the system is loaded, the 0x80 interrupt service process is taken to complete the distribution based on the SYS_CALL_TABLE jump watch. Such as

The following is the program code: static void __init set_system_gate (unsigned int N, void * addr) {_ set_gate (IDT_TABLE N, 15, 3, ADDR, __ KERNEL_CS); // Allow RING 3 to call the system door (15) system door } #define SYSCALL_VECTOR 0x80asmlinkage int system_call (void); void __init trap_init (void) {... set_system_gate (SYSCALL_VECTOR, & system_call); ...} where trap_init functions (arch / i386 / kernel / traps.c) is responsible for the initialization of the The interrupt processing routine will be called by the system initialization START_KERNEL (Init / main.c) function. Call the syscall_vector (0x80), actually handled by the System_Call function (Arch / I386 / Kernel / Entry.s).

For Windows User DLLs, such as NTDLL, is also achieved in a similar manner. Because there are more articles discussed in this regard, it is no longer coming. Interested friends can refer to INSIDE WIN2K book, and

Articles such as "Windows System Service Calling Mechanism".

However, this interrupted system service call mechanism has a large performance hazard for Intel P4 CPUs. According to the measured results P3 850, on the system call of the interrupt mode, there will be nearly double performance advantages than P4 2G, and the processing performance of the high-end CPU such as Xeon is even worse.

Intel P6 VS P7 System Call Performance

This is why starting from Windows XP / 2003, MS secretly converts Intel 2e's system calls to CPU special instructions SYSENTER (Intel) and Syscall (AMD). For example, in the Win2003 system, the system call in NTDLL has no longer use INT 0x2e, and the system call code on the fixed address is changed:

The following program code: 0: 001> u ntdll ZwSuspendProcessntdll NtSuspendProcess:!! 77f335bb b806010000 mov eax, 0x10677f335c0 ba0003fe7f mov edx, 0x7ffe030077f335c5 ffd2 call edx77f335c7 c20400 ret 0x40: 001> u 0x7ffe0300SharedUserData SystemCallStub:! 7ffe0300 8bd4 mov edx, esp7ffe0302 0f34 sysenter7ffe0304 C3 RET: 001> u7ffe0314shareduserdata! systemcallstub 0x14: 7ffe0314 8BD4 MOV EDX, ESP7FFE0316 0F05 Syscall7ffe0318 C3 RET

For the Intel X86 architecture, the Sysenter / Sysexit command is to add from the PII to the instruction set, which is specifically used to switch from a user state (Ring 1-3) to the core state (Ring 0). Unlike ordinary interrupts, use IDT or Call / JMP to give a given destination address, this series of commands read directly from the CPU-related MSR registers and the segment selectors and address offset from the stack. Therefore, it is only necessary to use the Sysenter instruction directly as the above code is required when the system is loaded. Because of this, switching between RING 0 and RING 3 is executed using Sysenter / Sysexit, and is switched between two predefined stable stability, so there is no need to perform a series of state transitions when interrupt processing. Improve the efficiency of switching processing. The principles of the AMD chip and Syscall are basically similar. In order to adapt to this change, improve the efficiency of the system call, the Linux kernel starts from 2.5.53 to add support for system service call mechanisms for Sysenter / Sysexit mode from 2.5.53. The new Sysenter.c (Arch / I386 / kernel /) code will dynamically determine whether the support is enabled based on whether the Sysenter / Sysexit directive is currently started to start the CPU.

The following is the program codes: #define X86_FEATURE_SEP (0 * 32 11) / * SYSENTER / SYSEXIT * / static int __init sysenter_setup (void) {void * page = (void *) get_zeroed_page (GFP_ATOMIC); __ set_fixmap (FIX_VSYSCALL, __pa ( page), PAGE_READONLY_EXEC); if (boot_cpu_has (X86_FEATURE_SEP)) {memcpy (page, & vsyscall_int80_start, & vsyscall_int80_end - & vsyscall_int80_start); return 0;}! memcpy (page, & vsyscall_sysenter_start, & vsyscall_sysenter_end - & vsyscall_sysenter_start); on_each_cpu (enable_sep_cpu, NULL, 1, 1 ); return 0;} __ initcall (Sysenter_Setup);

The SYSENTER_SETUP function (Arch / I386 / Kernel / Sysenter.c) When the kernel is loaded, a read-only and executable memory page will be invoked based on INT 0x80 calls, or system service calling code, loaded into this memory. Page. Based on the case, call the ON_EACH_CPU function to enable Sysenter / Sysexit instruction support for each CPU.

The following is the program codes: #define MSR_IA32_SYSENTER_CS 0x174 # define MSR_IA32_SYSENTER_ESP 0x175 # define MSR_IA32_SYSENTER_EIP 0x176extern asmlinkage void sysenter_entry (void); void enable_sep_cpu (void * info) {int cpu = get_cpu (); struct tss_struct * tss = & per_cpu (init_tss, cpu ); tss-> ss1 = __KERNEL_CS; tss-> esp1 = sizeof (struct tss_struct) (unsigned long) tss; wrmsr (MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0); wrmsr (MSR_IA32_SYSENTER_ESP, tss-> esp1, 0); wrmsr (MSR_IA32_SYSENTER_EIP , (Unsigned long), 0); PUT_CPU ();} You can see the enable_sep_cpu function (Arch / I386 / kernel / sysenter.c) actually sets the correlation value of its MSR register for each CPU so that the Sysenter directive is When the call is called, it is possible to switch directly to the kernel state of the predefined (including the segment selector and offset address) of the code and the stack. About SYSENTER / SYSEXIT and MSR, you can refer to the third volume of Intel IA-32 Development Manual, System Programming Manual, 4.8.7 and Appendix B.

The following program code: .text.globl __kernel_vsyscall.type __kernel_vsyscall, @ function__kernel_vsyscall: .LSTART_vsyscall: push% ecx.Lpush_ecx: push% edx.Lpush_edx: push% ebp.Lenter_kernel: movl% esp,% ebpsysenter ...

The above code is responsible for actually calling in vsyscall-sysenter.s (arch / i386 / kernel /), which can see very similar to the implementation code of the previous Win2003.

Unlike Win2003, Linux's system call has an interrupted problem. That is to say, when a system call is executed, because the call itself waits for some kind of resource or blocked, the call may be interrupted when the call is not completed, and to force to return to -eintr. At this point, the system call itself does not have an error, so a certain automatic retry mechanism should be provided. This is why VsysCall-Sysenter.s, and the following is the reason for processing code below Sysenter.

The following is the program code: .lenter_kernel: MOVL% ESP,% EBPSYSENTER / * 7: Align Return Point with Nop's to make disassembly easier * /. Space 7,0x90/ * 14: System Call Restart Point Is here! (Sysenter_Return - 2 ) * / Jmp .lenter_kernel / * 16: system call normal return point is here! * /. Globl Sysenter_Return / * Symbol buy by entry.s. * / Sysenter_return: POP% EBP.LPOP_EBP: POP% EDX.LPOP_EDX: POP% ECX.LPOP_ECX: RET.LEND_VSYSCALL: .Size __kernel_vsyscall, .-. LStart_vsyscall here Sysenter code, actually exist two returns: jmp .lenter_kernel instructions are the return point when the call is interrupted; SYSENTER_RETURN is when the call is normal. Back point. The Linux kernel resolves the problem of this automatic retry by adjusting the method of returning the address EIP by judging in the actual call function.

Linus in one

This problem is explained in the discussion of the mailing list. Specific information on mechanisms such as system call interrupts and retry, can be referenced

Section 4.5 of the book of "The Linux Kernel".

However, for some reason, Linux 2.6 kernel still does not use the new call method to the _syscall0 set function, but in addition to the existing mechanism, an extension mechanism called VsysCall is added to the system service call efficiency. High service is required. This mechanism is actually a virtual memory page of a portion of the fixed address, directly exposes and allows the user state to access. That is, the __set_fixmap call in the previous SYSENTER_SETUP function.

The following program code is: #define __FIXADDR_TOP 0xfffff000 # define FIXADDR_TOP ((unsigned long) __ FIXADDR_TOP) #define PAGE_SHIFT 12 # define __fix_to_virt (x) (FIXADDR_TOP - ((x) << PAGE_SHIFT)) void __set_fixmap (enum fixed_addresses idx, unsigned long phys, pgprot_t flags) {unsigned long address = __fix_to_virt (idx); if (idx> = __end_of_fixed_addresses) {BUG (); return;} set_pte_pfn (address, phys >> PAGE_SHIFT, flags);} enum fixed_addresses {FIX_HOLE, FIX_VSYSCALL , ...}; static int __init sysenter_setup (void) {... __set_fixmap (fix_vsyscall, __pa (page), Page_Readonly_exec); ...}

It can be seen that the __set_fixmap function is actually a virtual memory address that starts 0xFfff000, fixed to VsysCall, etc., and assigns physical memory and placed the function code. For fix_vsyscall, the memory management module will be specially treated.

The following program code: / * * This is the range that is readable by user mode, and things * acting like user mode such as get_user_pages * / # define FIXADDR_USER_START (__fix_to_virt (FIX_VSYSCALL)) # define FIXADDR_USER_END (FIXADDR_USER_START PAGE_SIZE). int in_gate_area (struct task_struct * task, unsigned long addr) {# ifdef AT_SYSINFO_EHDRif ((addr> = FIXADDR_USER_START) && (addr

A book of "The Linux Kernel" gives an example:

The following is program code: #include int pid; int main () {__ASM __ ("MOVL 20,% EAX / N" "Call 0xffe400 / n" "MOVL% EAX, PID / N" [IMG] /IMAGES/Wink.gif[/img]; Printf ("PID IS% D / N", PID); Return 0;}

And this mechanism system service calls, according to its test, there will be nearly double performance improvement.

The following is quoted:

An example of the kind of timing differences: John Stultz reports on an experiment where he measures gettimeofday () and finds 1.67 us for the int 0x80 way, 1.24 us for the sysenter way, and 0.88 us for the vsyscall).

转载请注明原文地址:https://www.9cbs.com/read-65650.html

9cbs

New Post(0)