One of BAIDUs and the use of the website root directory Robots.txt file

xiaoxiao2021-03-06 96

Today I noticed the page of Baidu's search results, there is a row of relief, point open, and a bunch of "XXX does not assume any legal responsibility". Article 6 is "If any website does not want to be included by Baidu Online Network Technology (Beijing) Co., Ltd., it should be reacted to the service website or Baidu company, or in the website's page according to the rejection Spider Protocol Filling the tag of the reject, otherwise, Baidu's search engine will appreciate it as a collection of websites. "From here, you can learn about the search engine related law, the website is not clearly expressed, the refusal is the default is acceptable, no wonder Internet is open.

The feeling of European and American movies is that open women do not have a clear refusal. The real woman's experience is not a clear acceptance.

Prohibit search engine recordings

One. What is a robots.txt file?

The search engine automatically accesses web pages on the Internet through a program robot (also known as Spider).

You can create a plain text file robots.txt in your website, declare that the site does not want to be accessed by the Robot in this file, so that the part or all of the content can be included in the search engine, or Specifies that the search engine only includes the specified content.

II. Where is the robots.txt file?

Robots.txt files should be placed under the root directory of the website. For example, when Robots accesses a website (such as http://www.abc.com), first check if there is http://www.abc.com/robots.txt this file, if the robot finds This file will determine the scope of its access according to the content of this file.

Website URL The corresponding Robots.txt's URL http://www.w3.org/ http://www.w3.org/robots.txt http://www.w3.org:80/ http: // www. W3.org:80/robots.txt The following is Robots.txt in some famous sites:

http://www.cn.com/robots.txt

http://www.google.com/robots.txt

http://www.ibm.com/robots.txt

http://www.sun.com/robots.txt

http://www.eachnet.com/robots.txt

Three. Robots.txt file format

The "robots.txt" file contains one or more records that are separated by the space line (with Cr, Cr / NL, or NL as the end value), each record format is as follows:

": .

In this document, you can use # to annotate, specific usage methods, and practices in UNIX. Records in this file typically start with a row or multi-line User-Agent, and there are several Disallow lines, and the details are as follows:

User-agent: The value of this item is used to describe the name of the search engine Robot. In the "Robots.txt" file, if there are multiple User-Agent records, multiple Robot will be subject to the limit, for the file Say, there must be at least one User-Agent record. If the value of this item is *, the protocol is valid for any machine, in the "Robots.txt" file, "user-agent: *" records can only have one. Disallow: The value of this item is used to describe a URL that does not want to be accessed, which can be a complete path, or some, any URL starting with Disallow is not accessed by Robot. For example, "disallow: / help" does not allow search engine access to /Help.html and /Help/index.html, and "Disallow: / Help /" allows Robot to access /Help.html, not access / help / index .html. Any disallow record is empty, indicating that all parts of the site allow access, in the "/ ROBOTS.TXT" file, at least one DisliW record. If "/Robots.txt" is an empty file, the site is open for all search engines Robot.

Four. Robots.txt file usage example

Here are some of the basic usage of Robots.txt:

l Disable all search engines from accessing website: user-agent: * dispialow: /

l Allow all Robot to access user-agent: * disallow: or you can also build an empty file "/Robots.txt" file

l Support all the sections of all search engines to access the website (CGI-BIN, TMP, PRIVATE directory in the following example) User-agent: * dispialow: / cgi-bin / disallow: / tmp / disallow: / privat /

l Disable a search engine access (BADBOT in the following example) user-agent: BadbotdisAllow: /

l Only a search engine is allowed (Webcrawler) user-agent: WebcrawlerDisAllow: user-agent: * dispialow: user-agent: * disallow: user-agent: * dispialow: /

The following gadgets specifically check the validity of the robots.txt file:

http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

V. Robots Meta label

1, what is a Robots Meta label

The robots.txt file is mainly to limit the search engine access to the entire site or directory, while the Robots Meta label is primarily for one specific page. As with other Meta tags (such as using language, page description, keywords, etc.), Robots Meta tags are also placed in the of the page, specifically to tell the search engine Robot how to capture the page Content. Specific form is similar (see the black body portion):

Time Marketing - Network Marketing Professional Portal </ Title></p> <p><meta name = "robots" Content = "INDEX, FOLLOW"></p> <p><meta http-equiv = "content-type" content = "text / html; charset = gb2312"> <meta name = "keywords" content = "Marketing ..."></p> <p><meta name = "description" content = "Time Marketing Network is ..."></p> <p><link rel = "stylesheet" href = "/ public / css.css" type = "text / css"></p> <p></ hEAD></p> <p><body></p> <p>...</p> <p></ body></p> <p></ html></p> <p>2, Robots Meta labeling:</p> <p>There is no case in the Robots Meta tag, name = "robots" means all search engines, which can be written to Name = "baiduspider" for a specific search engine. The Content section has four command options: Index, NoIndex, Follow, NOFOLLOW, and Dances are separated.</p> <p>The Index instruction tells the search robot to grab the page;</p> <p>The FOLLOW instruction indicates that the search robot can continue to capture along the link on the page;</p> <p>The default value for the Robots Meta tag is index and follow, except for INKTOMI, for it, default is index, nofollow.</p> <p>In this way, there are four combinations:</p> <p><Meta name = "robots" Content = "INDEX, FOLLOW"></p> <p><Meta name = "robots" Content = "NoIndex, Follow"></p> <p><Meta name = "robots" Content = "INDEX, NOFOLLOW"></p> <p><Meta name = "robots" Content = "NoIndex, Nofollow"></p> <p>among them</p> <p><Meta name = "robots" content = "index, follow"> can be written</p> <p><Meta name = "robots" content = "all">;</p> <p><Meta name = "robots" Content = "noIndex, nofollow"></p> <p><Meta name = "robots" Content = "None"></p> <p>It should be noted that the above-mentioned Robots.txt and Robots Meta tags restricting the search engine robot (robots) to grab the site content is just a rule, and you need to search for engine robots. It is not every Robots comply.</p> <p>At present, the vast majority of search engine robots comply with Robots.txt rules, and for the Robots Meta label, there is not much support, but it is gradually increased, such as the famous search engine Google fully supports, and Google also increases. A directive "Archive" can limit whether Google retain web snapshots. E.g:</p> <p><Meta name = "Googlebot" Content = "INDEX, FOLLOW, NOARCHIVE"> Represents to capture the page in this site and link to the page, but the page snapshot of the page is not retained on GoOLGE.</p></div><div class="text-center mt-3 text-grey"> 转载请注明原文地址:https://www.9cbs.com/read-120717.html</div><div class="plugin d-flex justify-content-center mt-3"></div><hr><div class="row"><div class="col-lg-12 text-muted mt-2"><i class="icon-tags mr-2"></i><span class="badge border border-secondary mr-2"><h2 class="h6 mb-0 small"><a class="text-secondary" href="tag-2.html">9cbs</a></h2></span></div></div></div></div><div class="card card-postlist border-white shadow"><div class="card-body"><div class="card-title"><div class="d-flex justify-content-between"><div><b>New Post</b>(<span class="posts">0</span>) </div><div></div></div></div><ul class="postlist list-unstyled"> </ul></div></div><div class="d-none threadlist"><input type="checkbox" name="modtid" value="120717" checked /></div></div></div></div></div><footer class="text-muted small bg-dark py-4 mt-3" id="footer"><div class="container"><div class="row"><div class="col">CopyRight © 2020 All Rights Reserved </div><div class="col text-right">Processed: <b>0.079</b>, SQL: <b>9</b></div></div></div></footer><script src="./lang/en-us/lang.js?2.2.0"></script><script src="view/js/jquery.min.js?2.2.0"></script><script src="view/js/popper.min.js?2.2.0"></script><script src="view/js/bootstrap.min.js?2.2.0"></script><script src="view/js/xiuno.js?2.2.0"></script><script src="view/js/bootstrap-plugin.js?2.2.0"></script><script src="view/js/async.min.js?2.2.0"></script><script src="view/js/form.js?2.2.0"></script><script> var debug = DEBUG = 0; var url_rewrite_on = 1; var url_path = './'; var forumarr = {"1":"Tech"}; var fid = 1; var uid = 0; var gid = 0; xn.options.water_image_url = 'view/img/water-small.png'; </script><script src="view/js/wellcms.js?2.2.0"></script><a class="scroll-to-top rounded" href="javascript:void(0);"><i class="icon-angle-up"></i></a><a class="scroll-to-bottom rounded" href="javascript:void(0);" style="display: inline;"><i class="icon-angle-down"></i></a></body></html><script> var forum_url = 'list-1.html'; var safe_token = 'Dyg0dypPBdGkEL1PhJwBjUfKurJsUXUBMzVQ8A0oW9DNlsp0iqZOQFZUfGi3_2B43_2FWtJZVTsglA1KrnTe'; var body = $('body'); body.on('submit', '#form', function() { var jthis = $(this); var jsubmit = jthis.find('#submit'); jthis.reset(); jsubmit.button('loading'); var postdata = jthis.serializeObject(); $.xpost(jthis.attr('action'), postdata, function(code, message) { if(code == 0) { location.reload(); } else { $.alert(message); jsubmit.button('reset'); } }); return false; }); function resize_image() { var jmessagelist = $('div.message'); var first_width = jmessagelist.width(); jmessagelist.each(function() { var jdiv = $(this); var maxwidth = jdiv.attr('isfirst') ? first_width : jdiv.width(); var jmessage_width = Math.min(jdiv.width(), maxwidth); jdiv.find('img, embed, iframe, video').each(function() { var jimg = $(this); var img_width = this.org_width; var img_height = this.org_height; if(!img_width) { var img_width = jimg.attr('width'); var img_height = jimg.attr('height'); this.org_width = img_width; this.org_height = img_height; } if(img_width > jmessage_width) { if(this.tagName == 'IMG') { jimg.width(jmessage_width); jimg.css('height', 'auto'); jimg.css('cursor', 'pointer'); jimg.on('click', function() { }); } else { jimg.width(jmessage_width); var height = (img_height / img_width) * jimg.width(); jimg.height(height); } } }); }); } function resize_table() { $('div.message').each(function() { var jdiv = $(this); jdiv.find('table').addClass('table').wrap('<div class="table-responsive"></div>'); }); } $(function() { resize_image(); resize_table(); $(window).on('resize', resize_image); }); var jmessage = $('#message'); jmessage.on('focus', function() {if(jmessage.t) { clearTimeout(jmessage.t); jmessage.t = null; } jmessage.css('height', '6rem'); }); jmessage.on('blur', function() {jmessage.t = setTimeout(function() { jmessage.css('height', '2.5rem');}, 1000); }); $('#nav li[data-active="fid-1"]').addClass('active'); </script>