Network spider principle ~ turn http:www.sysinfo.cnshow.php?id=35

xiaoxiao2021-03-06 52

With the rise of the search economy, people began to pay attention to the performance, technology and daily flow of the world's search engine. As a company, it will choose whether to put advertisements according to the reputation of the search engine and the daily traffic; as a normal netizens, it will choose the engine to find information on the performance and technology of the search engine; as a scholar, it will represent representative As a research object ... and as a website operator, it is more concerned that how to let more netizens know their own website through the network carrier, and then achieve higher traffic and popularity. Among them, search engines have become an important and free publicity approach. On the one hand, the search engine will take the initiative to find various web data on the network, and index in the background; on the other hand, the major websites can show netizens to netizens in order to make their own content more. It has begun to make major adjustments to the website structure, including flat structural design, dynamic (web) turning static (web pages), SiteMap, etc. These seemingly unrecriminal moves make us feel an important role in the search engine's change in our network usage. Also, it is a new position - exclusiveness of the rise of the search engine and the rise of the society, and it has created a new position. In fact, the rise of the search engine economy has proved the huge business opportunities that the network is contained. The network left the search will only have a messy data, as well as a large number of gold mines waiting to be exhausted. Search engines have been focusing on improving the user's experience, and its user experience is reflected in three aspects: quasi, full, fast. Use professional terms: the quotation rate, check the rate and search speed (ie, search time consuming). The most easily achieved search speed, because the visitor is difficult to distinguish if the search time consuming in 1 second, and the visitors are difficult to distinguish it, not to have the impact of network speed. Therefore, the evaluation of the search engine is concentrated in the top two: quasi, full. The "quasi" of the Chinese search engine needs to ensure that the results of the search are very relevant to the search term, which requires "word technology" and "sorting technology" to decide (reference author related articles [1] [2]) The "full" of the Chinese search engine needs to guarantee that there is no missing result, and can find the latest web pages, which requires a powerful web collector, which is generally called "network spider", also called "webpage" robot". There are a lot of articles in search engine technology, but most of the discussion is how to evaluate the importance of the webpage, and there are not many articles studied for network spiders. Network spider technology is not a very deep technology, but it is a powerful network spider, but it is not easy. When the current disk capacity is no longer a bottleneck, the search engine has been expanding its number of pages. The largest search engine google (http://www.google.com) increased from 2002 to now nearly 4 billion web pages; the Yahoo search engine (http://search.yahoo.com/) is known. 4.5 billion web pages; China's Chinese search engine Baidu (http://www.baidu.com) has increased from 70 million pages two years ago to more than 200 million. It is estimated that the number of web pages across the Internet reached more than 10 billion, and it is still growing rapid annual. So an excellent search engine requires continuous optimization of the network spider algorithm to enhance its performance.

Perhaps some people have questions, why the search engine needs to use the web spider to grab all the webpages of the website, why not grab the results of the needs after the searcher enters the keyword? This is actually an efficiency problem. The search engine cannot check each page in real time in search time, but you need to grab the webpage, establish a good index according to keywords, each search result will establish it directly from the search engine The indexed database is found and then returns the result to the visitor. About the knowledge of the search engine system architecture, reference [3], this paper mainly introduces related technologies related to the network spiders. The web spider is Web Spider, is a very image of the name. Met the Internet into a spider web, then Spider is a spider climbed from the Internet. The web spider is looking for a web page through the link address of the web page, starting from a page (usually home) (usually home), read the contents of the webpage, find other link addresses in the web page, and then look for the next web page through these link addresses, so I have been loop until all the webpages all of this website are caught. If the entire Internet is used as a website, then the web spider can use this principle to grab all the pages on the Internet. For search engines, it is almost impossible to grab all the web pages on the Internet. From the current announced data, the largest search engine with capacity is just about 40% of the number of nets. One of the reasons of this is the bottleneck of cracking technology. There is no way to traverse all web pages. There are many web pages that cannot be found from the links of other pages; another reason is to store technology and processing technology issues, if the average of each page The size is 20K calculation (including pictures), the capacity of 10 billion web pages is 100 × 2000G bytes, even if it can be stored, download also has problems (download 20K according to one machine, you need 340 machines to keep download for one year Time can you download all the webpages). At the same time, since the amount of data is too large, it will also be effective in providing search. Therefore, many search engine network spiers just grabbed those important web pages, and the main basis for evaluating importance when crafting is the link depth of a web page. When grabbing the web page, the web spider generally has two strategies: breadth priority and depth priority (shown below). Parent-first means that the web spider will first capture all the pages of the links in the start web page, then select a link page to continue to capture all the web pages linked to this page. This is the most common way, because this method allows the network spider in parallel to improve its crawler. The depth is preferred that the network spider starts from the start page, a link is tracked, and then transfer to the next start page after processing this line, and continues to track the link. This method has an advantage that the network spider is easier when designing. The difference between the two strategies, the description of the following figure will be more clear. Some network spiders set up the number of layers accessible for some unimportant websites. For example, in the above figure, a is the starting web page, belongs to 0 layers, B, C, D, E, F belong to the first layer, g, and h belongs to the second layer, I belongs to the third layer. If the number of access layers set by the network spider is 2, the web page I will not be accessed. This also allows some web pages on some websites to search on search engines, and the other part cannot be searched. For website designers, the flattened website structure design helps search engines to grab more web pages. Network spiders often encounter encrypted data and web privileges when accessing website web pages, and some web pages are required to access.

Of course, the owner of the website can make the network spider not to grab (the next small section will be introduced) through the agreement, but for some websites of the report, they want the search engine to search for their reports, but can not completely let search To view, you need to provide the network spider to provide the corresponding username and password. The web spider can capture these web pages through the permissions given to provide search. When the searcher click to view the page, it is also necessary to provide the search to provide the corresponding permissions verification. The web spider needs to capture the web page, which is different from general access. If the control is not good, it will cause the website server burden over. In April this year, Taobao (http://www.taobao.com) was arbitrary because of the online spider of Yahoo search engine. Will the website are unable to communicate with the web spider? In fact, there are many ways to allow websites and web spiders to communicate. On the one hand, let the website administrator know where the web spider comes from, what to do, on the other hand, tell the web spider which pages should not be captured, which pages should be updated. Every web spider has its own name. When you grabbly, you will indicate your identity to your website. The web spider sends a request when grabbing the web page. There is a field in this request to identify the identity of this web spider. For example, the logo of Google Network Spider is Googlebot, Baidu Web Spider is marked as baiduspider, yahoo web spider logo is iontomi slurp. If there is access logging on the website, the website administrator can know which search engines online spider came over, when did you come, and how much data is read. If a web administrator finds a problem with a spider, contacts its owner through its identity. Below is the blog China (http://www.blogchina.com) Search Engine Access Log: Network Spider Enters a website, usually visit a special text file robots.txt, this file is generally Under the root of the website server, such as: http://www.blogchina.com/robots.txt. Website administrators can define which directory network spiders cannot be accessed through Robots.txt, or which directories cannot be accessed for some specific network spiders. For example, some websites can be executed by some websites, and the temporary file directory does not want to be searched by the search engine, then web administrators can define these directory as denial of access to the directory. Robots.txt syntax is simple, for example, if there is no restriction on the directory, you can use the following two lines: User-agent: * Disallow: Of course, robots.txt is just a protocol, if the network spider does not follow this agreement, Website administrators cannot prevent web spiders from accessing to certain pages, but general network spiders follow these protocols, and webmasters can also refuse network spiders to capture some web pages in other ways. When the web spider is downloading a web page, it will look at the HTML code of the web page. In the part of its code, there will be Meta ID. Through these identifications, you can tell if you need to be captured by this web page, you can also tell if the links in the web spider this page need to be continued. For example, it means that this page does not need to be captured, but the links in the web page need to be tracked. About Robots.txt syntax and meta tag syntax, interested readers Viewing [4] Nowadays, the new website wants to search engines to more fully grabbing our website, because this can make more visitors can pass Search engines find this website.

In order to make this website more fully captured, webmasters can create a website map, ie Site Map. Many web spans will take the sitemap.htm files as a website page climbing entry, web administrators can put the links of all the webpages inside the website in this file, then the web spider can easily grab the entire website, avoid I miss some web pages, will also reduce the burden on the website server. Search Engine creates a web index, the object processing is a text file. For web spiders, grabbing web pages include various formats, including HTML, Pictures, DOC, PDF, Multimedia, Dynamic web pages, and other formats. After these files are grabbed, you need to extract text information in these files. Accurately extract information about these documents, on the one hand, there is an important role in searching accuracy of search engines, on the other hand, there is a certain impact on network spiders correctly track other links. For documents such as DOC, PDF, this document that is generated by the software provided by the professional manufacturer, the vendor provides the corresponding text extraction interface. Network spiders only need to call the interfaces of these plugins to easily extract text information in the document and other related information. The documentation such as HTML is different. HTML has a set of own syntax, which represents different fonts, colors, location, and other layers, such as:,,,,,,, " Filter identifiers are not difficult because these identifiers have certain rules, as long as they obtain corresponding information according to different identifiers. However, when identifying this information, you need to record many layout information, such as the font size of the text, whether it is a title, whether it is a keyword of the page, which helps calculate the word in the web page. Importance. At the same time, for the HTML web page, in addition to the title and body, there will be many ad links and public channel links. These links and text body have no relationships. When the webpage content is extracted, they also need to filter these useless links. For example, a website has the "Product Introduction" channel, because the navigation strips are available in each page in the website, if the navigation strip link is not filtered, when searching "Product Introduction", each page can be searched, no doubt It will bring a lot of spam. Filtering these invalid links requires a large number of web structure laws, extracts some common, unified filtration; for some important and special websites, you also need individual processing. This requires a certain extensibility of the network spider. For multimedia, pictures, etc., generally judge the content of these files by linking anchor text (ie, link text) and related file comments. For example, there is a link text to "Zhang Manju Photo", which point to a picture of a BMP format, then the web spider knows the content of this picture is "Zhang Manyu's photo". This way, you can find this picture when searching "Zhang Manyu" and "photo". In addition, there are file properties in many multimedia files, considering these properties, you can better understand the contents of the file. The dynamic web page has always been a problem with the network spider. The so-called dynamic web page is for a static web page, which is a page generated by the program. Such a benefit is that the webpage can be quickly unified, or the space occupied by the web page can also reduce the space of the server, but also give the web spider. Come some trouble. Due to the continuous increase in development, there are more and more types of dynamic web pages, such as: ASP, JSP, PHP, etc. These types of web pages can be slightly easier for web spiders. Network spiders are more difficult to handle web pages generated by some scripting languages (such as VBScript and JavaScript). If you want to improve these web pages, the web spider needs your own script interpretation.

转载请注明原文地址:https://www.9cbs.com/read-47409.html

9cbs

New Post(0)