Source: e800.com.cn
Website and web spider web spiders need to capture web pages, different from general access, if the control is not good, will cause the website server burden over. In April this year, Taobao (http://www.taobao.com/) was arbitrary because of the online spider from Yahoo search engine. Will the website are unable to communicate with the web spider? In fact, there are many ways to allow websites and web spiders to communicate. On the one hand, let the website administrator know where the web spider comes from, what to do, on the other hand, tell the web spider which pages should not be captured, which pages should be updated. Every web spider has its own name. When you grabbly, you will indicate your identity to your website. The web spider will send a request when grabbing the web page. There is a field in this request to identify the identity of this web spider. For example, the logo of Google Network Spider is Googlebot, Baidu Web Spider is marked as baiduspider, yahoo web spider logo is iontomi slurp. If there is access logging on the website, the website administrator can know which search engines online spider came over, when did you come, and how much data is read. If a web administrator finds a problem with a spider, contacts its owner through its identity. Below is a search engine access log of blog China (http://www.blogchina.com/) on May 15, 2004: Network spider enters a website, usually visit a special text file Robots.txt, this file is generally put Under the root of the website server, such as: http://www.blogchina.com/robots.txt. Website administrators can define which directory network spiders cannot be accessed through Robots.txt, or which directories cannot be accessed for some specific network spiders. For example, some websites can be executed by some websites, and the temporary file directory does not want to be searched by the search engine, then web administrators can define these directory as denial of access to the directory. Robots.txt syntax is simple, for example, if there is no restriction on the directory, you can use the following two lines: User-agent: * Disallow: Of course, robots.txt is just a protocol, if the network spider does not follow this agreement, Website administrators cannot prevent web spiders from accessing to certain pages, but general network spiders follow these protocols, and webmasters can also refuse network spiders to capture some web pages in other ways. When the web spider is downloading a web page, it will look at the HTML code of the web page. In the part of its code, there will be Meta ID. Through these identifications, you can tell if you need to be captured by this web page, you can also tell if the links in the web spider this page need to be continued. For example, it means that this page does not need to be captured, but the links in the web page need to be tracked. About Robots.txt syntax and meta tag syntax, interested readers Viewing [4] Nowadays, the new website wants to search engines to more fully grabbing our website, because this can make more visitors can pass Search engines find this website. In order to make this website more fully captured, webmasters can create a website map, ie Site Map. Many web spans will take the sitemap.htm files as a website page climbing entry, web administrators can put the links of all the webpages inside the website in this file, then the web spider can easily grab the entire website, avoid I miss some web pages, will also reduce the burden on the website server.