Search engines have their own "Search Robot" (Robots), and through these Robots on the web (generally HTTP and SRC links), the search is continuously grasped to build their own databases. For website managers and content providers, sometimes there are some sites, do not want to be disclosed by Robots. In order to solve this problem, two ways provide two ways: one is Robots.txt and the other is the Robots Meta tag.
First, robots.txt1, what is robots.txt? Robots.txt is a plain text file that declares that the site does not want to be accessed by Robots in this file, so that some or all of the content of the site can be included in the search engine, or specifying the search engine only Content. When a search robot accesses a site, it will first check if there is robots.txt in the root root of the site. If you find it, the search robot will determine the scope of the access by the content in this file. If the file does not exist, then Search Robots will be grabbed along the link. Robots.txt must be placed under the root of a site, and the file name must be lowercase.
For example: http://www.w3.org/robots.txt
2, the syntax "robots.txt" file of Robots.txt contains one or more records, which are separated by spaces (in Cr, Cr / NL, or NL as ending characters), each record format as follows "
:
".
In this document, you can use # to annotate, specific usage methods, and practices in UNIX. Records in this file typically start with a row or multi-line User-Agent, and there are several Disallow lines, and the details are as follows:
User-agent:
The value of this item is used to describe the name of the search engine Robot. In the "robots.txt" file, if there are multiple User-Agent records, there are multiple Robot that will be subject to the limit, at least the file, at least There is a User-agent record. If the value of this item is *, the protocol is valid for any machine, in the "Robots.txt" file, "user-agent: *" records can only have one.
Disallow:
The value of this item is used to describe a URL that does not want to be accessed. This URL can be a complete path, or some, any URL starting with Disallow is not accessed by Robot. For example, "disallow: / help" does not allow search engine access to /Help.html and /Help/index.html, and "Disallow: / Help /" allows Robot to access /Help.html, not access / help / index .html.
Any disallow record is empty, indicating that all parts of the site allow access, in the "/ ROBOTS.TXT" file, at least one DisliW record. If "/Robots.txt" is an empty file, the site is open for all search engines Robot.
Here are some of the basic usages of Robots.txt: l All parts of all search engines are prohibited: user-agent: * Disallow: / L allows all Robot to access user-agent: * dispialow: or build an empty file " /Robots.txt "Filel prohibits all sections of all search engines to access the website (CGI-BIN, TMP, PRIVATE directory in the following example) User-agent: * dispialow: / cgi-bin / disallow: / tmp / disallow: / Private / L Disable a Search Engine Access (BADBOT in the following example) user-agent: BadbotdisAllow: / L Only a search engine is allowed (WebcrawlerDisLow: WebCrawlerDisAllow:
3, common search engine robot ROBOTS name name search engine baiduspider http://www.baidu.comScooter http://www.Altavista.comia_archiver http://www.alexa.comgooglebot http://www.google.comfaSt-Webcrawler Http://www.alltheweb.comslurp http://www.inktomi.commsnbot http://search.msn.com
Second, Robots Meta Tag 1, what is the Robots Meta tag robots.txt file is primarily limited to search engine access to the entire site or directory, while the Robots Meta label is primarily for one specific page. Like other Meta tags (such as language, description, keywords, etc.), the Robots Meta tag is also placed in the page, specifically to tell the search engine Robot how to capture the content of the page.
2, the Robots Meta label is written: there is no case in the robots Meta tag, name = "robots" represents all search engines, which can be written to Name = "baiduspider" for a specific search engine. The Content section has four command options: Index, NoIndex, Follow, NOFOLLOW, and Dances are separated. The index instruction tells the search robot to grab the page; the FOLLOW instruction indicates that the search robot can continue to capture along the link on the page; the default value of the Robots Meta tag is index and follow, except for INKTOMI, for it, default It is index, nofollow. In this way, there are four combinations:
among them
Can write
;
Can write
It should be noted that the above-mentioned Robots.txt and Robots Meta tags restricting the search engine robot's way to grab the site content is just a rule, and the cooperation of the engine robot is needed. It is not all Robots. At present, the vast majority of search engine robots comply with Robots.txt rules, and for the Robots Meta label, there is not much support, but it is gradually increased, such as the famous search engine Google fully supports, and Google also increases. A directive "Archive" can limit whether Google retain web snapshots. E.g:
Represents the page in the site and connects along the page, but does not preserve the page snapshot of the page on GoOLGE.