Source: e800.com.cn
Network spider basic principles Network spider, Web Spider, is a very image of the name. Met the Internet into a spider web, then Spider is a spider climbed from the Internet. The web spider is looking for a web page through the link address of the web page, starting from a page (usually home) (usually home), read the contents of the webpage, find other link addresses in the web page, and then look for the next web page through these link addresses, so I have been loop until all the webpages all of this website are caught. If the entire Internet is used as a website, then the web spider can use this principle to grab all the pages on the Internet. For search engines, it is almost impossible to grab all the web pages on the Internet. From the current announced data, the largest search engine with capacity is just about 40% of the number of nets. One of the reasons of this is the bottleneck of cracking technology. There is no way to traverse all web pages. There are many web pages that cannot be found from the links of other pages; another reason is to store technology and processing technology issues, if the average of each page The size is 20K calculation (including pictures), the capacity of 10 billion web pages is 100 × 2000G bytes, even if it can be stored, download also has problems (download 20K according to one machine, you need 340 machines to keep download for one year Time can you download all the webpages). At the same time, since the amount of data is too large, it will also be effective in providing search. Therefore, many search engine network spiers just grabbed those important web pages, and the main basis for evaluating importance when crafting is the link depth of a web page. When grabbing the web page, the web spider generally has two strategies: breadth priority and depth priority (shown below). Parent-first means that the web spider will first capture all the pages of the links in the start web page, then select a link page to continue to capture all the web pages linked to this page. This is the most common way, because this method allows the network spider in parallel to improve its crawler. The depth is preferred that the network spider starts from the start page, a link is tracked, and then transfer to the next start page after processing this line, and continues to track the link. This method has an advantage that the network spider is easier when designing. The difference between the two strategies, the description of the following figure will be more clear. Some network spiders set up the number of layers accessible for some unimportant websites. For example, in the above figure, a is the starting web page, belongs to 0 layers, B, C, D, E, F belong to the first layer, g, and h belongs to the second layer, I belongs to the third layer. If the number of access layers set by the network spider is 2, the web page I will not be accessed. This also allows some web pages on some websites to search on search engines, and the other part cannot be searched. For website designers, the flattened website structure design helps search engines to grab more web pages. Network spiders often encounter encrypted data and web privileges when accessing website web pages, and some web pages are required to access. Of course, the owner of the website can make the network spider not to grab (the next small section will be introduced) through the agreement, but for some websites of the report, they want the search engine to search for their reports, but can not completely let search To view, you need to provide the network spider to provide the corresponding username and password. The web spider can capture these web pages through the permissions given to provide search. When the searcher click to view the page, it is also necessary to provide the search to provide the corresponding permissions verification.