One of BAIDUs and the use of the website root directory Robots.txt file

xiaoxiao2021-03-06  59

Today, I have noticed the page of Baidu's search results, there is a party with a diploma, point open, and a bunch of "XXX does not assume any legal responsibility". Article 6 is "If any website does not want to be included by Baidu Online Network Technology (Beijing) Co., Ltd., it should be reacted to the service website or Baidu company, or in the website's page according to the rejection Spider Protocol Filling the tag of the reject, otherwise, Baidu's search engine will appreciate it as a collection of websites. "From here, you can learn about the search engine related law, the website is not clearly expressed, the refusal is the default is acceptable, no wonder Internet is open. The feeling of European and American movies is that open women do not have a clear refusal. The real woman's experience is not a clear acceptance. Prohibit search engines included method one. What is a robots.txt file? The search engine automatically accesses web pages on the Internet through a program robot (also known as Spider). You can create a plain text file robots.txt in your website, declare that the site does not want to be accessed by the Robot in this file, so that the part or all of the content can be included in the search engine, or Specifies that the search engine only includes the specified content. 2. Where is the robots.txt file? Robots.txt file should be placed in the root directory of the website. For example, when Robots accesses a website (such as http://www.abc.com), first check if there is http://www.abc.com/robots.txt this file, if the robot finds This file will determine the scope of its access according to the content of this file. Website URL corresponding robots.txt URL http://www.w3.org/ ?? http://www.w3.org/robots.txt http://www.w3.org:80/? Http: / / www.w3.org:80/robots.txt Below is the Robots.txt: http://www.cn.com/robots.txt http://www.google.com/robots.txt http: //www.ibm.com/robots.txt http://www.sun.com/robots.txt http://www.eachnet.com/robots.txt three. Robots.txt file format "robots.txt" The file contains one or more records, which are separated by the space line (in Cr, Cr / NL, or NL as the end value), each recorded format is as follows: "

:

".

In this document, you can use # to annotate, specific usage methods, and practices in UNIX. Records in this file typically start with a row or multi-line user-agent, and there are several Disallow lines, the details are as follows: user-agent: The value of this item is used to describe the name of the search engine Robot, in "Robots.txt" In the file, if there are multiple User-Agent records, multiple Robot will be subject to the limit of the protocol, at least one User-Agent record is at least one User-Agent record. If the value of this item is *, the protocol is valid for any machine, in the "Robots.txt" file, "user-agent: *" records can only have one. Disallow: The value of this item is used to describe a URL that does not want to be accessed, which can be a complete path, or some, any URL starting with Disallow is not accessed by Robot. For example, "disallow: / help" does not allow search engine access to /Help.html and /Help/index.html, and "Disallow: / Help /" allows Robot to access /Help.html, not access / help / index .html. Any disallow record is empty, indicating that all parts of the site allow access, in the "/ ROBOTS.TXT" file, at least one DisliW record. If "/Robots.txt" is an empty file, the site is open for all search engines Robot. Four. Robots.txt File Usage Example The following is some of the basic usage of Robots.txt: l ???????? Prohibit any part of all search engines from accessing the website: user-agent: * dispialow: / l ???? ???? Allow all Robot to access user-agent: * disallow: or you can also build an empty file "/ROBOTS.TXT" file L ???????? Prohibition of several parts of the search engine to access the website ( CGI-BIN, TMP, Private Directory in the following example User-agent: * dispialow: / cgi-bin / disallow: / tmp / disallow: / private / L ???????? Prohibition of a search engine Access (BADBOT in the following example) user-agent: BadbotdisAllow: / l ??????? only allows access to a search engine (WebCrawlerDisLow: user-agent: * Disap://www.searchengineWorld.com/cgi? Fifth, Robotcheck.cgi? Fifth, Robots Meta Tag? 5. http://www.search

1. What is the Robots Meta tag robots.txt file mainly limits the search engine access to the entire site or directory, and the Robots Meta label is primarily for one specific page. Like other Meta tags (such as language, description, keywords, etc.), the Robots Meta tag is also placed in the page, specifically to tell the search engine Robot how to capture the content of the page. Specific form is similar (see the black body part):?

Network Marketing Professional Portal Marketing ... ">

Time Marketing Network is ... "> rel =" stylesheet "href =" / public / css.css "type =" text / css "> ...?

2, Robots Meta labeling:?

There is no case in the Robots Meta tag, name = "robots" means all search engines, which can be written to Name = "baiduspider" for a specific search engine. The Content section has four command options: Index, NoIndex, Follow, NOFOLLOW, and Dances are separated. The index instruction tells the search robot to grab the page; the FOLLOW instruction indicates that the search robot can continue to capture along the link on the page; the default value of the Robots Meta tag is index and follow, except for INKTOMI, for it, default It is index, nofollow. ?

In this way, there are four combinations:

?

among them

Can write

;

Can write

?

?

It should be noted that the above-mentioned Robots.txt and Robots Meta tags restricting the search engine robot (robots) to grab the site content is just a rule, and you need to search for engine robots. It is not every Robots comply. ?

At present, the vast majority of search engine robots comply with Robots.txt rules, and for the Robots Meta label, there is not much support, but it is gradually increased, such as the famous search engine Google fully supports, and Google also increases. A directive "Archive" can limit whether Google retain web snapshots. E.g:

Represents the page in the site and connects along the page, but does not preserve the page snapshot of the page on GoOLGE.

Author Blog:

http://blog.9cbs.net/davidullua/

related articles

One of BAIDUs and the website of the website root root of the root of the Robots.txt file Use Operating System Installation Skills C # Write the minimum flowers to hide the WINDOW AppLiction in the Taskbar icon, the VBA programming software project in Excel - Demand Practice (Essence)

转载请注明原文地址:https://www.9cbs.com/read-118951.html

New Post(0)