Reorganized HTML web page technology by captured data package
The port number is obtained based on the parsing protocol, and the corresponding process is determined by the port number, and then the data (data reorganization technique) can be recovered by the captured packet according to the package format of the process. The following describes how to recover HTML web pages technology based on the captured HTTP protocol TCP packet.
Any application must communicate using the network, must have its own communication format, that is, the data transmitted by the sender, the receiving end should be able to correctly explain; the receiving end returns data, the transmitting end must be able to explain the returned data and processed accordingly . The browser is no exception.
HTTP has two types of packets, from the customer's request message and the response message from the server to the customer.
Typical HTTP request packets are as follows:
Get /Dirabc/docu1.html http / 1.1 // request row
Connection: Close // This line is the first line
User-agent: mozilla / 4.0
Accept: Text / HTML, Image / GIF, Image / JPEG
Accept-language: en
/ / There is a hollow here
The first field in the request line is "Method". This is a special noun for use in object-oriented techniques, that is, the operation of the requested object, so these methods are actually some commands. The method fill in here is GET, indicating that the request reads a web page, which is the most common method. Other common methods are also HEAD (request reading is not the entire page is just its head) and POST (requesting the attached entity, if a record is attached to a database). The second field of the request line is the URL of the object to be linked. Since there is already a host name when establishing a TCP connection, you only write the file name (including the path) on the host. The third field description is the 1.1 version of HTTP.
The following is the first line. "Connection: Close" indicates that the browser wants the server to turn the TCP connection after transmitting the requested object (there is also an optional case where the TCP connection is not turned off, and other objects continue to read other objects). The next line is the type of Yu Ming browser. "Mozilla / 4.0" is a version 4.0 version of the Netscape browser. Then the next line is that the browser tells the server that it is ready to receive and the language is English (EN). When using GET, there is no last entity main body behind the header.
When the user fills in the form (FORM) online, the POST method is used. At this time, the information typed by the user is to fill in the last body, and the front request line is sent together to the HTTP server together.
Typical HTTP response packets are as follows:
HTTP / 1.1 200 ok // status line
Connection: 6 rows in this line are the first line
Date: Thu, 06 AUG, 2003 12:00:15 GMT
Server: Apache / 1.30 (UNIX)
Last-Modified: Mon, 22 Jun 2003 09:23:24 GMT
Content-Length: 8765 // File length byte number
Content-Type: Text / HTML
// A blank line here ... // data begins
200 in the status line is the status code, OK is a phrase, indicating everything is normal.
There are 41 cases of status code, and there are several of the following:
301 (Moved Premanently, the website has been transferred, the new URL indicates that the header of the response message:
Back);
400 (Bad Request, the server can't understand the request message);
401 (Not found, there is no requested object on the server);
505 (HTTP Version Not Supported, the server does not support the requested HTTP protocol version).
Hypertial Markup Language HTML (Hypertext Markup Language is a standard language for making the web page. HTML uses the HTML tag, such as: Web document should be written between the and HTML> tags,
head> define the header of the document,Figure 8
When browsing the web, the browser first issues an HTTP request message to the server, the server responds and returns the HTTP response message. Generally, a page size is 10K, while the IP package is up to 1500B, coupled with the IP header, TCP header, and is actually used to transfer the page content with 1460B. Obviously, use a lot of newspapers to transfer one page. Data transfer is divided by TCP to a data block that is considered to be transmitted, that is, data is transmitted to the transmit party into a piece of transmission, and then combines them together after receiving these data. The TCP determines the serial number and confirmation number that starts transmitting the first block of data at three handshakes, but because the TCP connection is completely two-way, that is, the data flows between the two parties can be transmitted simultaneously, and the data is independent in the transmission process. Therefore, when transmitting a page, the confirmation number will not change, the serial number is the serial number plus data length of the last TCP packet.
In summary, to restore a page, it can determine which TCP packets are used to transfer this page according to the confirmation number. First select any of the TCP packets of any transfer page, get the ACK of the page to transfer the page, find which TCP packets in the receiving reception package are the same as this ACK, if the same, indicating that these TCP packages are used to transmit the same page of. Press the SEQ (serial number) order of the package to resolve these packets, save the data portion in the package, so that the page content is resumed. However, in this process, pay attention to: Whether to capture all packets, the packet is not error in the transmission process.
Implementation:
1. How to get file names and file types to recover files.
When you visit a web, there are very few HTMLs that have a full text format. Most of the website page contains a lot of image files: JPEG, GIF, etc., Flash animation: SWF, and some websites with background music or other type files, save It is impossible to save and name the same name when you are named. This requires us to get the file name and file type before saving the file.
Before the server sends data to the client, the client is first requested to send a file, which is indicated in the request line in the HTTP request packet, such as get /dirabc/docu1.html http / 1.1, here Docu1.html is the file requested to be transmitted, and DOCU1 is the file name, and .html is the file type extension, so you want to find the file name and file type of the file, just find the request to send the file request message request message. The ACK of the client request message is the first package of the server when the server is transmitted. When you want to recover (reorganize) a file, you can easily get the ACK used when the selected packet is easy to send the file, search for all receiving packages, and find all the ACK the same package, these packages, the first one Also sending the first package of the file) is the ACK of all packets of the request message, whereby request packets can be obtained, and the file name and file type are obtained. 2. How to reorganize the package.
After all the packages that belong to the file to be restored, the restructuring the file is a simple thing. In addition to the file content in the first package, there is more important is the status line and the first line information (see the HTTP response packet above). From the headed information, you can get a meaningfulness after continuing to restore, such as: There is a 401 Not Found in the head line, which means that there is a package even if there is a package of the same ACK; from the first line can also Analyze the type of files, you can prompt the user to open the recovered file.
After removing the status line of the first package and the first line information, the rest is to restore the content of the file (Note: Some transfer files are very long, the bag is very short, the first line information is in the first In the two packages, save the file content, then add the contents of the next package in the tail of the saved content (for HTML or other text files, there is no such space or less space, but for image animation files Often because there is one space or less space to display garbled) until all packets are saved, save the file, it is the end of the recovery.
3. Reorganize the entire HTML page.
When opening an HTML page, the pictures in the page, the animation is html tag: src = "http://www.swau.edu.cn/Image/Picture.gif" references pictures, animations, some pages, some pages Used the frame, there is a label such as SRC = "Login.html". The browser issues a request message to the server actually requested the current page. The other on the page will be sent by SRC. You must send a request message separately, which is why we have discovered in the image or other link when we visited some pages. The reason why the page is not displayed but other content is still displayed (because all the contents of the page are issued by a request message, then anything is wrong to represent the entire server response packet error).
Select the HTML page to be reorganized. Find the file name and file type extension of all SRC content according to the HTML tab, enumerate all the request packets currently received for the SRC content, if yes, The server response message of the request message recombinant, and then put all SRCs in the same directory after the reorganization, and then point all SRC tags of the source HTML page to the local recombination content file, the entire HTML page recombination is completed.