Web Search Help - Prohibiting Search Engine Retrieving Method [From Baidu]

xiaoxiao2021-03-31 382

Web Search Help - Prohibition of Search Engine Recording

Prohibit search engine recordings

What is a robots.txt file? The search engine automatically accesses web pages on the Internet through a program robot (also known as Spider). You can create a plain text file robots.txt in your website, declare that the site does not want to be accessed by the Robot in this file, so that the part or all of the content can be included in the search engine, or Specifies that the search engine only includes the specified content.

Where is the robots.txt file? Robots.txt file should be placed in the root directory of the website. For example, when Robots accesses a website (such as http://www.abc.com), first check if there is http://www.abc.com/robots.txt this file, if the robot finds This file will determine the scope of its access according to the content of this file.

Website URL The corresponding robots.txt URLHTTP: //www.w3.org/http://www.w3.org/robots.txtttp: //www.w3.org: 80 / http://www.w3. Org: 80 / Robots.txthttp: //www.w3.org: 1234 / http://www.w3.org: 1234 / Robots.txtHttp: //w3.org/ http://w3.org/robots. TXT

The format "robots.txt" file of the robots.txt file contains one or more records that are separated by the space line (with Cr, Cr / NL, or NL as the end value), each record format is as follows: ": . In this document, you can use # to annotate, specific usage methods, and practices in UNIX. Records in this file typically start with a row or multi-line user-agent, and there are several Disallow lines, the details are as follows: user-agent: The value of this item is used to describe the name of the search engine Robot, in "Robots.txt" In the file, if there are multiple User-Agent records, multiple Robot will be subject to the limit of the protocol, at least one User-Agent record is at least one User-Agent record. If the value of this item is *, the protocol is valid for any machine, in the "Robots.txt" file, "user-agent: *" records can only have one. Disallow: The value of this item is used to describe a URL that does not want to be accessed, which can be a complete path, or some, any URL starting with Disallow is not accessed by Robot. For example, "disallow: / help" does not allow search engine access to /Help.html and /Help/index.html, and "Disallow: / Help /" allows Robot to access /Help.html, not access / help / index .html. Any disallow record is empty, indicating that all parts of the site allow access, in the "/ ROBOTS.TXT" file, at least one DisliW record. If "/Robots.txt" is an empty file, the site is open for all search engines Robot. Robots.txt file usage example

Example 1. Prohibit all parts of all search engines to access the website to download the Robots.txt file user-agent: * Disallow: / Example 2. Allow all Robot Access (or an empty file "/Robots.txt" file) User-agent: * Disallow: Example 3. Disable a search engine access user-agent: BadbotdisAllow: / Example 4. Allow a search engine to access User-Agent: BaiduspiderDiSAllow: user-agent: * Dislow: / Example 5 A simple example In this example, the site has three directorys to limit the access of the search engine, that is, the search engine will not access these three directories. It should be noted that each directory must be declared separately, not to write "Disallow: / CGI-BIN / / / TMP /". User-agent: The * has a special meaning, represents "Any Robot", so there is no "disallow: / tmp / *" or "disallow: *. Gif" in this file. ISER-AGENT: * Disallow: / CGI-BIN / DISALLOW: / TMP / DISALLOW: / ~ JOE / Back to top

robots.txt file references robots.txt file and more specific settings, see the following links: · Web Server Administrator's Guide to the Robots Exclusion Protocol · HTML Author's Guide to the Robots Exclusion Protocol · The original 1994 protocol description, as currently deployed · The Revised Internet-Draft Specification, Which Is Not Yet Completed or Impletion

转载请注明原文地址:https://www.9cbs.com/read-130996.html

9cbs

New Post(0)