Robots.txt guide

xiaoxiao2021-03-06 143

Robots.txt guide

When the search engine accesses a website, it will first check if there is a plain text file called Robots.txt. The Robots.txt file is used to qualify the access range of the search engine to its website, that is, what files telling the search engine website is allowed to retrieve (download). This is the "Robots Exclusion Standard" that everyone sees on the Internet. Let's refer to RES below. The format of the robots.txt file: the format of the robots.txt file is special, which consists of records. These records are separated by a space. Each record consists of two domains:

1) A User-agent character serial;

2) Several disallow characters serial.

The record format is: : "

Below we will further explain these two domains respectively.

USER-Agent:

User-Agent Line (User Agent Row) is used to specify the name of the search engine Robot, take Google's search program GoogleBot as an example, with: user-agent: GoogleBot

There must be at least one User-Agent record in a robots.txt. If there are multiple User-Agent records, there is a limit to the ROBOT will be restricted by the RES standard. Of course, if you want to specify all Robot, just use a wildcard "*" to get it, ie: user-agent: *

Disallow (Refusal Access Statement):

In the robots.txt file, the second domain of each record is a DISALLOW: command line. These Disallow rows declare the files and / or directorys that do not want to be accessed in the site. For example, "disallow: email.htm" has declared the file's access, and spiders will prohibit SPIDERS to download the Email.htm file on the website. And "Disallow: / cgi-bin /" declares the visits of the CGI-BIN directory, rejecting the Spiders to enter the directory and its subdirectories. Disallow declares also has a wildcard function. For example, in the above example, "Disallow: / CGI-BIN /" declares the access to the CGI-BIN directory and its subdirectories, and "disallow: / bob" refuses the search engine to /bob.html and / bob / InDes.html Access (ie, files that are named Bob or files named BOB are not allowed to access search engine access). Disallow Record If you leave, you will explain that all parts of the site are open to the search engine.

Space & Comment

In the robots.txt file,

Anyone starting with "#" is considered annotation, which is the same as the convention in UNIX. But everyone needs to pay attention to two questions:

1) The RES standard allows the annotation content to be at the end of the indication line, but this format is not all SPIDERS support. For example, not all SPIDERs can correctly understand the "disallow: bob #comment" instructions. Some spiders will misunderstand the "Bob # Comment" for Dislow. The best way is to make an annotation of your own line. 2) The RES standard allows a space to be present in a command line, like "disallow: bob #comment", but we don't recommend everyone to do this. Creation of the robots.txt file:

It should be noted that Robots.txt plain text files should be created in UNIX command line terminal mode. A good text editor is generally able to provide UNIX mode features, or your FTP client software should also "should" convert it for you. If you try to generate your Robots.txt pure text file with a HTML editor that does not provide text editing mode, then you can make a mosquito - whiteness. Extension to the RES standard:

Although some extension criteria have been proposed, such as Allow rows or Robot controls (for example, write and version numbers should be ignored), but the formal approval approved by the RES workgroup has not yet been obtained.

Appendix I. Robots.txt usage example:

Use wildcard "*" to set access to all ROBOT.

User-agent: *

Disallow:

Indicates: Allows all search engines to access all the content under the website.

User-agent: *

Disallow: /

Indicates that all search engines are prohibited from accessing all web pages under the website.

User-agent: *

Disallow: / cgi-bin /

(Note the first disallow to put a line)

DisLiveow: / Images /

Indicates that all search engines are prohibited from entering the website's CGI-bin and Images directory and all subdirectories. It should be noted that each directory must be separately declared.

User-agent: rovedog

Disallow: /

Indicates that RoverDog is prohibited from accessing any files on the website.

User-agent: Googlebot

Disallow: chese.htm

Indicates: Google's GoogleBot is prohibited from accessing the Cheese.htm file under its website.

Above these simple settings, for more complex settings, see some large sites such as CNN or Looksmart Robots.txt files (www.cnn.com/robots.txt, www.looksmart.com/robots.txt)

Appendix II. Related Robots.txt Article Reference:

Robots.txt FAQ analysis

2. Use of Robots Meta Tag

3. Robots.txt detector

转载请注明原文地址:https://www.9cbs.com/read-109759.html

9cbs

New Post(0)