Linux network code guide V0.2
◆ Linux network code guide V0.2
Author: yawl
Many people are more interested in analyzing the network part (mainly SRC / Linux / Net, SRC / Linux / Include / Net, SRC / Linux / INCLUDE / NET and SRC / Linux / INCLUX Directory) when analyzing Linux code. In the case of a large number of TCP / IP principles, if you don't read the source code, you can't afford specific impression in your mind. One problem analyzing this part of the code is a lot of code and the information is very small. The purpose of this article is to outline a framework, let readers can understand how TCP / IP works. Many of the code analysis previously seen are based on 2.0 kernel. Many functions in the new kernel have changed their names, especially for beginners, this article is an example of code 2.4.0-Test9, so Clearable code when code is.
In fact, the code of the network part only analyzes a line of firewalls. Many of the other places are just a semi-solving, if it is wrong, welcome to correct.
It is recommended to establish a project with Source Insight (www.soucedyn.com) while seeing code, which is more effective. I also used some other tools, but when analyzing a lot of code, there is no tool more convenient than it.
2 text
The ISO's seven-layer model is very familiar, of course, for Internet, more suitable with four-layer model. In these two models, the network protocol appears in the form of hierarchy. In Linux's kernel code, it is more difficult to strictly divide a clear level, because in addition to some / "KERNEL THREAD) /", the entire kernel is actually a single process. Therefore, the so-called / "network layer /" is only a set of related functions, and most of the layers are completed through the general function calls.
Logically, the code of the network part should be more reasonable such that .BSD Socket layer: This section processes the BSD socket-related operation, and each socket is reflected in the Struct Socket structure in the kernel. The files of this part include: /net/socket.c /net/protocols.c ETC
.Inet Socket layer: BSD socket is an interface that can be used for various network protocols, and when used for TCP / IP, it is necessary to retain some additional parameters when the AF_INET form is established, so there is a Struct Sock structure. . The files are mainly: /net/ipv4/protocol.c /net/ipv4/af_INET.C /NET/CORE/SOCK.C ETC
.TCP / UDP layer: Processing the operation of the transport layer, the transport layer is represented by two structures of Struct Inet_Protocol and Struct Proto. The file mainly: /net/ipv4/udp.c /net/ipv4/datagram.c /net/ipv4/tcp.c /net/ipv4/tcp_input.c /net/ipv4/tcp_output.c / net / ipv4 / TCP_MINISOCKS.C /NET/IPV4/TCP_OUTPUT.C /NET/IPv4/TCP_TIMER.C ETC
.IP layer: Handling the operation of the network layer, the network layer is represented by the struct packet_type structure. The files mainly include: /net/ipv4/ip_forward.c ip_fragment.c ip_input.c ip_output.c etc .. Data Link Layer and Driver: Each network device is represented by Struct Net_Device, and the general processing is in Dev.c. The driver is in the / driver / net directory.
There are still many other files such as firewalls, routes, etc.
Now I have to give a table, the content of the full text is to illustrate this table (if you think that I am more boring in the language in the article, letting them, combine this table yourself). When I initially read the network part of the network, I prefer a parap with "Linux Kernel Internals", one of which process A sent a package to another process B through the network. Detail a packet from the network stack. The process of walking through. I think this can help the reader to see the entire forest of the forest, so this paper will be described with reference to this structure.
^ | Sys_read fs / read_write.c | sock_read net / socket.c | sock_recvmsg net / socket.c | inet_recvmsg net / ipv4 / af_inet.c | udp_recvmsg net / ipv4 / udp.c | skb_recv_datagram net / core / datagram.c | ------------------------------------------- | SOCK_QUEUE_RCV_SKB INCLUDE / NET / SOCK .h | udp_queue_rcv_skb net / ipv4 / udp.c | udp_rcv net / ipv4 / udp.c | ip_local_deliver_finish net / ipv4 / ip_input.c | ip_local_deliver net / ipv4 / ip_input.c | ip_recv net / ipv4 / ip_input.c | net_rx_action net / dev.c | ------------------------------------------- | Netif_RX Net / dev.c | EL3_RX Driver / Net / 3C309.C | EL3_INTERRUPT Driver / Net / 3C309.c
==========================
| Sys_write fs / read_write.c | sock_writev net / socket.c | sock_sendmsg net / socket.c | inet_sendmsg net / ipv4 / af_inet.c | udp_sendmsg net / ipv4 / udp.c | ip_build_xmit net / ipv4 / ip_output.c | output_maybe_reroute NET / IPV4 / IP_OUTPUT.C | ip_output Net / IPv4 / IP_Output.c | ip_finish_output Net / IPv4 / ip_output.c | dev_queue_xmit net / dev.c | ----------------- --------------------------- | EL3_START_XMIT Driver / Net / 3C309.c v We assume the environment as follows: Two hosts are connected through the Internet Together, one of the machines runs this process A, another running process B, process A will send a message to process B, such as / "hello /", and B accept this information. TCP processing itself is very complicated, in order to facilitate the narrative, we will use UDP as an example.
2.1 Create a socket
Before the data is sent, set a socket (socket), which is called in both sides of the program:
... int sockfd; sockfd = socket (AF_INET, SOCK_DGRAM, 0); ...
This is a system call, so it will enter the system kernel through the 0x80 interrupt, call the corresponding function in the core. When looking for the corresponding process in the kernel, it is usually added to / "sys_ /" again, if it is for Fork Say, that is, call SYS_FORK. But Socket correlation calls are somewhat special, all such calls are passed through an entry, ie sys_socketcall enters the system kernel, and then call the specific sys_socket, socket_bind, etc. of the parameter.
Sys_socket calls SOCK_CREATE to generate a struct socket structure (see include / Linux / net.h), each socket has such a structure in the kernel, after initializing some of the general members of this structure (such as assigning inode Depending on the second parameter is Type item assignment, the scheduling is based on its parameters, that is, this sentence: ... Net_Families [Family] -> Create (sock, protocol); ...
The first parameter of our program is AF_INET, so this function pointer will point to inet_create (); (Net_Families is an array, the information of the network protocol (NET Families), and these protocols are loaded with SOCK_REGISTER.)
The most important information in the Struct socket structure is reserved in the Struct Sock structure, which is often used in the network code. It is recommended to print it and other common structures (such as Struct Sk_buff). In the INET_CREATE, it will allocate memory for this structure, and according to the socket type (actually the second parameter of the socket function), made different initialization: ... if (SK-> Prot-> init) SK-> Prot -> Init (SK); ... If the type is SOCK_STREAM, TCP_V4_INIT_SOCK will call, and the SOCKET of the SOCK_DGRAM type does not have additional initialization, and the Socket call ends.
There is also a thing worth noting that when inet_create () is called, then the SOCK_MAP_FD function will then be invoked, and a file descriptor is assigned a file descriptor and assigns a file file. Treat sockets like processing files in the application layer.
At the beginning, some processes may be difficult to follow, mainly the actual pointing of these function pointers will change according to type.
2.2 Sending Data
When the process A wants to send data, the program will call the following statement (if you use the Sendto function, you will take a similar process, omitted): ... Write (sockfd, / "hello /", strlen (/ "hello /") ); ...
The corresponding function of the WRITE is SYS_WRITE. This function first finds the Struct File structure according to the file descriptor. If this file exists (file pointer is non-empty) and writable (file-> f_mode & fmode_write), this file is called. Writing: ... if (file-> f_op& (write = file-> f_op-> write)! = Null) Ret = Write (file, buf, count, & file-> f_pos); ...
Struct file_operations wherein f_op is structured pointer, which points in the sock_map_fd socket_file_ops, which is defined as follows (/net/socket.c): static struct file_operations socket_file_ops = {llseek: sock_lseek, read: sock_read, write: sock_write, poll: sock_poll, ioctl: sock_ioctl, mmap: sock_mmap, open: sock_no_open, / * special open code to open via disallow / proc * / release: sock_close, fasync: sock_fasync, readv: sock_readv, writev: sock_writev};
At this time, the WIRTE function pointer clearly points to SOCK_WRITE. We follow up, this function sizes a string buffer into struct msghdr, and finally called SOCK_SENDMSG.
SCM_SEND in SOCK_SENDMSG I don't know (SCM is the story of Socket Level Control Messages), which is not very critical, we noticed this sentence: ... SOCK-> Ops-> Sendmsg (Sock, MSG, Size, & SCM .... is another function pointer, SOCK-> OPS is initialized in the inet_create () function, because we are UDP socket, SOCK-> OPS points to inet_dgram_ops (ie sock-> ops = & inet_dgram_ops;) It is defined in Net / IPv4 / AF_INET.C: Struct Proto_ops inet_dgram_ops = {Family: PF_INET,
release: inet_release, bind: inet_bind, connect: inet_dgram_connect, socketpair: sock_no_socketpair, accept: sock_no_accept, getname: inet_getname, poll: datagram_poll, ioctl: inet_ioctl, listen: sock_no_listen, shutdown: inet_shutdown, setsockopt: inet_setsockopt, getsockopt: inet_getsockopt, sendmsg: INET_SENDMSG, RECVMSG: INET_RECVMSG, MMAP: SOCK_NO_MMAP,}
So we have to see the inet_sendmsg () function, and immediately, this function calls another function through the function pointer: ... SK-> Prot-> Sendmsg (SK, MSG, SIZE); ...
We have to look for its specific pointing. Seeing this, how can I find the specific definition? I usually this: For the above example, SK is a Struct Sock structure, and it is defined (Linux / NET / Sock.h) Seeing Prot is a Struct Proto structure. At this time, we look for all in the source tree. Examples of this structure (these such as jump to definitions, looking for references, etc., it is too convenient to quickly ^ _ ^) in Source Insight, soon, it will find UDP_PROT, TCP_PROT, RAW_PROT, etc., guessing is UDP_PROT, then Find a reference to it in the source code, and I found this in inet_create: ... prot = & udp_prot; ...
In fact, if you look at the inet_create function, you will be discovered earlier, but I have not so careful :).
We walk down with udp_sendmsg: The main role of this function is to fill the UDP header (source port, destination port, etc.), then call ip_route_output, the role is to find out the route, then: ... ip_build_xmit (SK, (SK -> no_check == udp_csum_noxmit? udp_getfrag_nosum: udp_getfrag), & UFH, ULEN, & IPC, RT, MSG-> msg_flags); ... IP_BUILD_XMIT function is a significant proportion to generate SK_BUFF and add IP headings for packets. There is such a sentence behind: ... nf_hook (pf_inet, nf_ip_local_out, skb, null, rt-> u.dst.dev, output_maybe_reroute); ...
Simply put, in the absence of firewall code intervention, you can understand this here to directly call Output_Maybe_Reroute, (specifically see the "Nuclear Firewall Netfilter Getting Started in the 14th" of the Green Alliance ") and only one sentence in Output_Maybe_Reroute: Return SKB- > DST-> OUTPUT (SKB);
According to the method of the old (this is really not very good to find), this pointer is specified in IP_ROUTE_OUTPUT, (Tip: ip_route_output_slow: Rth-> u.dst.output = ip_output;), IP_ROUTE_OUTPUT's role is Find routes and record the results into SKB-> DST.
So, we started watching the ip_output function, and it immediately went to ip_finish_output ~~. Each network device, such as a network card, in the kernel, is represented by a net_Device in IP_FINISH_OUTPUT (which is also initialized in IP_ROUTE_OUTPUT), this parameter is transmitted to Netfilter in the NF_IP_POST_ROUTING point registration function, after the end Call ip_finish_output2, and this function will be called: ... hh-> hh_output (SKB); ...
Gossip less, actually call dev_queue_xmit, and we have completed the work of the TCP / IP layer and start the processing of the data link layer.
After making some judgment, the actual call is this sentence: ... dev-> hard_start_Xmit (SKB, DEV); ...
This function is defined in the NIC's driver, each different network card has different processing, my network card is a relatively universal 3C509 (its driver is 3c509.c), when the network card is processed (EL3_PROBE), Yes: ... dev-> hard_start_xmit = & el3_start_xmit; ...
Next, it is an IO operation, and the packet is truly sent to the network to the end of this transmission process.
In the middle, I said some grassroots, completely missed, blocked, fragmentation, etc., only description of the ideal process. The purpose of this essay is to help everyone build a rough impression, in fact, each place has a very complex treatment (especially TCP part). 2.3 Accept data
When there is a data to the NIC, a hardware interrupt will be generated, then the function in the NIC driver is called to handle it. For my 3C509 network card, its handle function is: EL3_INTERRUPT. (The corresponding IRQ number is determined by the Request_irq function when the network card is initialized.) This interrupt handler must first be done to perform some IO operations to read data (read IO INW function), when data frame is successful After receiving, EL3_RX (DEV) is executed further.
In EL3_RX, the received data report is packaged into struct SK_BUFF, and is detached from the driver to the universal processing function Netif_RX (dev.c). For the efficiency of the CPU, the upper processing function will be activated by soft interrupt, and an important job of Netif_RX is to put the incoming SK_BUFF in the waiting queue, and set the soft interrupt flag, and then rest assured to return, wait The arrival of a network packet: ... __skb_queue_tail (& queue-> Input_pkt_queue, skb); __cpu_raise_softirq (this_cpu, net_rx_softirq); ...
This place has been referred to as / "bottom half /" in the 2.2 kernel, and its internal implementation is basically similar, the purpose is to return from the interrupt.
After a while, a CPU scheduling will occur due to some reasons (such as the time of the process is used). In the process scheduling function, it is schedule (), if there is a soft interrupt, if there is a corresponding processing function: ... if (Softirq_Active (this_cpu) & software oto handle_softirq; handle_softirq_back: ... ... Handle_softirq: do_softirq (); goto handle_softirq_back; ...
When the system is initialized, specifically in Net_Dev_init, this soft interrupt process is set to NET_RX_ACITION: ... Open_SOFTIRQ (NET_TX_SOFTIRQ, NET_TX_ACTION, NULL); ...
When the next process schedules are executed, the system checks if NET_TX_SOFTIRQ soft interrupts occur, if there is, call NET_RX_ACITION.
The NET_TX_ACITION function is both a NET_BH function in version 2.2. There are two global variables in the kernel to register the network layer, one is a chain table ptype_all, and the other is a group ptype_base [16], and they record the third floor of all kernels. (According to the OSI7 layer model) protocol. The reception of each network layer is represented by a struct packet_type, and this structure will register them in ptype_all or ptype_base. Only the type item in packet_type is embh_p_all, will be registered in the PTYPE_ALL chain, otherwise, if IP_PACKET_TYPE, the corresponding location will be found in the array ptype_base [16]. Both are different from if it is registered in eth_p_all, then the process function is subject to all types of packages, otherwise it can only handle the type of yourself. SKB-> Protocol is assigned in EL3_RX, in fact, the upper protocol name extracted in the Etheri frame header information, for our example, this value is eth_p_ip, so in the NET_TX_ACTION, the reception function of the IP layer is selected. And it is not difficult to see from ip_packet_type, this function is IP_RECV (). PT_PREV-> FUNC (actually pointing IP_RECV) A atomic_inc_inc (& SKB-> Users) operation (this place is a SKB_CLONE in the 2.2 core, the principle is similar), the purpose is to increase the number of references to this SK_Buff. The reception function of the network layer is processed or because some reason to discard this SK_BUFF (such as a firewall) will call KFree_skb, and the kfree_skb will first check if there are other places to need this function, if there is no place to use, really release This memory (__kfree_skb), otherwise it is just a counter minus one.
Now let's take a look at IP_RECV (Net / IPv4 / IP_INPUT.C). The operation of this function is very clear: first check the legality of this package (whether the version number, length, checksum, etc. are correct), if legitimated, the next process is performed. In the 2.4 core, in order to flexibly process the firewall code, the original IP_RECV is divided into two parts, and the latter half of the original IP_RECV will independently an IP_RCV_FINISH function. In IP_RCV_FINISH, a part is an IP package with IP options such as source routing. The exception is to look for routing through ip_route_input, and record the results into SKB-> DST. There are two packages received at this time, sent to the local process (need to be transmitted to the upper protocol) or forward (used as a gateway), and the required processing function is not the same. If it is transmitted to the local, IP_LOCAL_DELIVER (/ Net / IPv4 / IP_INPUT.C), otherwise call ip_forward (/net/ipv4/ip_forward.c) .skb-> dst-> input This function pointer will lead the data to the correct road.
For our example, it should be time to call ip_local_deliver. The hair package is very likely to be a fragmentation. In this case, we should first put them back to the upper agreement. Of course, it is also the first job made by the ip_local_deliver function. If the assembly is successful (return SK_Buff is not empty) Then, continue processing (detailed assembly algorithm can see "Analysis of IP Split Restructuring and Common Debris Attack" in the 13th issue of the Green Alliance. But at this time, the code was divided into two by Netfilter. Like the front, we go directly to the second half, ip_local_deliver_finish (/net/ipv4/ip_input.c). The processing of the transport layer (such as TCP, UDP, RAW) is registered in INET_PROTOS (via inet_add_protocol). IP_LOCAL_DELIVER_FINISH will call the corresponding processing function according to the upper layer protocol information (ie iPh-> protocol) in the IP header information. For the sake of easy, we use UDP, and the ipprot-> handler is actually UDP_RCV.
As mentioned earlier, each socket created in the application has a Struct Socket / Struct Sock in the kernel. UDP_RCV will first find the SOCK in the kernel via UDP_V4_LOOKUP, and then call UDP_QUE_RCV_SKB (/net/ipv4/udp.c). Immediately, the SOCK_QUEUE_RCV_SKB function is called, this function puts SK_BUF in the waiting queue, then notifies the upper layer data to reach: ... KB_SET_OWNER_R (SKB, SK); SKB_QUE_TAIL (& Sk-> Receive_Queue, SKB); if (! SK-> DEAD) SK-> DATA_READY (SK, SKB-> LEN); Return 0; ...
SK-> Data_Ready defines when the SOCK structure is initialized (SOCK_INIT_DATA): ... SK-> DATA_READY = SOCK_DEF_READABLE; ...
Now we have to look from the top: Process B To receive the datagrand, call in the program: ... Read (SockFD, Buff, Sizeof (BUFF)); ...
This system calls the function in the kernel is Sys_READ (FS / Read_Write.c) to handle the processing similar to Write, no more detail .udp_recvmsg function calls SKB_RECV_DATAGRAM, if the data has not arrived, and socket is set to block mode, The process will hang (signal_pending (current)) until the DATA_READY notification process resource continues (Wake_UP_INTERRUptible (SK-> Sleep);).
2.4 SKBUFF
A large amount of processing in the network code involves the operation of SK_BUFF, although it is avoided in this paper, but when careful analysis, this must be analyzed, the data packet is transmitted in the form of SK_BUFF, It can be said that it is the most important data structure of the network. Specific analysis recommended to see the "NetWork Buffers and Memory Management" of Alan Cox, this is published on Linux Journal in October 1996. Here is a picture in Phrack 55-12, although it only depicts the minimal side of SK_Buff, it is very useful, especially when you always forget that SKB_PUT is back or backward or backward time:)
--- ----------------- Head ^ | | | | | | ---------------- -data --- --- | | | ^ | True | | | V SKB_PULL SIZE | | LEN | | | | V | | --------------- - Tail --- --- | | | | | | | V SKB_PUT V | | - - ----------------- End
Linux Network Layer Efficiency: The pointer is a large number of applications in Linux network layer code, and its purpose is to avoid operation of data copying system resources. The data segment portion of a packet is only twice a copy, that is, from the network card to the core state memory, and the intra-user memory is taken from the core state. I saw some days ago, in some attempts to improve Sniffer grip efficiency, Turbo Packet (a kernel patch) adopts a number of core states and user-mate sharing a memory, and reduces a copy of the data, further improves efficiency.
3 Postscript: This submission is to the last hush, look at the poor writing inside, I really feel a bit more couldn't help the audience ~~ If I have time I will rewrite this part, in fact, this is me Always wish :)
Publisher: Crystal From: Linux Technical Support Site