The relationship between Web Robot and homepage

xiaoxiao2021-03-06 94

The Internet is getting more and more cool, and the popularity of WWW is like a day. Publish company information on the Internet, and e-commerce has evolved from fashion. As a web master, you may have an HTML, JavaScript, Java, ActiveX, if you know what is web rot? Do you know what is the relationship between web robot and the homepage you set?

Tramp on the Internet --- Web Robot

Sometimes you will not find your homepage in a search engine, even if you have never had any contact. In fact, this is the merge of Web Robot. Web robot is actually some programs that can cross all the contents of the network site through a large number of over-text structures of the Internet URL. These programs are sometimes called "Spider", "web wandere", "web Worms" or Web Crawler. Some Internet online well-known search engine sites have special Web Robot programs to complete information acquisition, such as Lycos, Webcrawler, Altavista, etc., and Chinese search engine sites such as Net Nerat, Netease, GoYoyo, etc.

Web robot is like a rapid guest, whether you care, it will be loyal to his own responsibilities, work hard, don't know how to be tired of the World Wide Web, of course, will also visit your homepage, retrieve the home page, and generate the record format it needs it. . Perhaps the home page content you are happy to know, but some content is not willing to be insumed, index. Can you only be able to "horizontally" on your home page, can you command and control the whereabouts of Web Robot? The answer is of course affirmative. As long as you read this article, you can arrange the next road sign like a traffic police, telling Web Robot how to retrieve your homepage, which can be retrieved, which cannot be accessed.

In fact, Web Robot can understand your words.

Don't think that Web Robot is unable to organize, and there is no tuning. Many web robot software gives two administrators or web content producers to network sites to limit Web Robot's whereabouts:

1, Robots Exclusion Protocol protocol

The administrator of the network site can create a specially format file on the site, indicating which portion on the site can be accessed by Robot, this file is placed under the root of the site, ie http: //.../robots.txt .

2, Robots Meta Tag

A web page can use a dedicated HTML Meta Tag to point out if a web page can be indexed, analyzed, or link.

These methods are suitable for most of the Web Robot, as for the implementation of these methods in the software, but also relying on Robot developers, not guaranteed to any robot. If you urgently need to protect your content, you should consider adopting other protection methods such as adding passwords.

Use the Robots Exclusion Protocol protocol

When Robot accesses a Web site, such as http://www.sti.net.cn/, it first checks the file http://www.sti.net.cn/robots.txt. If this file exists, it will analyze in this record format:

User-agent: *

Disallow: / cgi-bin /

Dislow: / TMP /

Disallow: / ~ joe /

To determine if it should retrieve the files of the site. These records are dedicated to Web Robot, and the general browser will never see this file, so don't weigh the sky to join the HTML statement that is similar to class or "How do you do. ? WHERE is you from? "The like a false hyperinity greeting. Only one "/ ROBOTS.TXT" file on a site, and each letter of the file name is written. Each individual "disallow" line in the Robot record represents the URL you do not want Robot accessed, each URL must occupy a row alone, and the "Disallow: / CGI-BIN / / TMP /" is not present. At the same time, blank lines cannot appear in a record, because the space line is a sign of multiple record segments.

The User-Agent line indicates the name of the Robot or other agent. In the User-Agent line, '*' represents a special meaning - all Robot.

Below is an example of several robot.txt:

Reject all Robots across the server:

User-agent: *

Disallow: /

Allow all Robots to access the entire site:

User-agent: *

Disallow:

Or generate an empty "/ Robots.txt" file.

Part of the server allows all Robot Access

User-agent: *

Disallow: / cgi-bin /

Dislow: / TMP /

Dislow: / private /

Reject a special Robot:

User-agent: Badbot

Disallow: /

Only one Robot patronize:

User-agent: Webcrawler

Disallow:

User-agent: *

Disallow: /

Finally we give Robots.txt on http://www.w3.org/ sites:

# For us by search.w3.org

User-agent: w3crobot / 1

Disallow:

User-agent: *

Disallow: / member / # this is restrictted to w3c members ONLY

Disallow: / team / # this is restricted to w3c team only

Dislow: / Tands / Member # this is restrictted to w3c members ONLY

Disllow: / Tands / Team # this is restricted to w3c team only

Disallow: / project

Dislow: / systems

Disllow: / web

Disallow: / team

Use Robots Meta TAG way

Robots Meta Tag allows HTML web page authors to point out if a page can be indexed, or if it can be used to find more link files. There is currently only part of Robot implements this feature.

The format of the Robots Meta Tag is:

Like other Meta Tag, it should be placed in the HTML file HEAD area:

... </ title></p> <p></ hEAD></p> <p><body></p> <p>...</p> <p>The Robots Meta Tag directive is spaced apart using a comma, and the instructions that can be used include [NO] INDEX and [NO] FOLLOW. The Index instruction indicates whether an index Robot can index this page; the FOLLOW instruction indicates whether the Robot can track the links of this page. The default situation is INDEX and FOLLOW. E.g:</p> <p><meta name = "robots" Content = "INDEX, FOLLOW"></p> <p><meta name = "robots" Content = "NoIndex, Follow"></p> <p><meta name = "robots" Content = "INDEX, NOFOLLOW"></p> <p><meta name = "robots" Content = "NoIndex, Nofollow"></p> <p>A good Web site administrator should consider the management of Robot, so that Robot is your own homepage, and it doesn't harm the security of your webpage.</p> <p>Magnific role of small meta in HTML documents</p> <p>Robots.txt and Robots Meta labels</p> <p>Robots.txt guide</p> <p>Use of Robots Meta Tag</p></div><div class="text-center mt-3 text-grey"> 转载请注明原文地址:https://www.9cbs.com/read-103301.html</div><div class="plugin d-flex justify-content-center mt-3"></div><hr><div class="row"><div class="col-lg-12 text-muted mt-2"><i class="icon-tags mr-2"></i><span class="badge border border-secondary mr-2"><h2 class="h6 mb-0 small"><a class="text-secondary" href="tag-2.html">9cbs</a></h2></span></div></div></div></div><div class="card card-postlist border-white shadow"><div class="card-body"><div class="card-title"><div class="d-flex justify-content-between"><div><b>New Post</b>(<span class="posts">0</span>) </div><div></div></div></div><ul class="postlist list-unstyled"> </ul></div></div><div class="d-none threadlist"><input type="checkbox" name="modtid" value="103301" checked /></div></div></div></div></div><footer class="text-muted small bg-dark py-4 mt-3" id="footer"><div class="container"><div class="row"><div class="col">CopyRight © 2020 All Rights Reserved </div><div class="col text-right">Processed: <b>0.056</b>, SQL: <b>9</b></div></div></div></footer><script src="./lang/en-us/lang.js?2.2.0"></script><script src="view/js/jquery.min.js?2.2.0"></script><script src="view/js/popper.min.js?2.2.0"></script><script src="view/js/bootstrap.min.js?2.2.0"></script><script src="view/js/xiuno.js?2.2.0"></script><script src="view/js/bootstrap-plugin.js?2.2.0"></script><script src="view/js/async.min.js?2.2.0"></script><script src="view/js/form.js?2.2.0"></script><script> var debug = DEBUG = 0; var url_rewrite_on = 1; var url_path = './'; var forumarr = {"1":"Tech"}; var fid = 1; var uid = 0; var gid = 0; xn.options.water_image_url = 'view/img/water-small.png'; </script><script src="view/js/wellcms.js?2.2.0"></script><a class="scroll-to-top rounded" href="javascript:void(0);"><i class="icon-angle-up"></i></a><a class="scroll-to-bottom rounded" href="javascript:void(0);" style="display: inline;"><i class="icon-angle-down"></i></a></body></html><script> var forum_url = 'list-1.html'; var safe_token = 'n7SRyBFm4iBSTCwRo4RpHdIjvFXEfwW1ZnzwetwNVPeEXHc08jQYaFrRZ0Rq2Bq4JTqIjO0spLCnekt_2Ff_2B5oUw_3D_3D'; var body = $('body'); body.on('submit', '#form', function() { var jthis = $(this); var jsubmit = jthis.find('#submit'); jthis.reset(); jsubmit.button('loading'); var postdata = jthis.serializeObject(); $.xpost(jthis.attr('action'), postdata, function(code, message) { if(code == 0) { location.reload(); } else { $.alert(message); jsubmit.button('reset'); } }); return false; }); function resize_image() { var jmessagelist = $('div.message'); var first_width = jmessagelist.width(); jmessagelist.each(function() { var jdiv = $(this); var maxwidth = jdiv.attr('isfirst') ? first_width : jdiv.width(); var jmessage_width = Math.min(jdiv.width(), maxwidth); jdiv.find('img, embed, iframe, video').each(function() { var jimg = $(this); var img_width = this.org_width; var img_height = this.org_height; if(!img_width) { var img_width = jimg.attr('width'); var img_height = jimg.attr('height'); this.org_width = img_width; this.org_height = img_height; } if(img_width > jmessage_width) { if(this.tagName == 'IMG') { jimg.width(jmessage_width); jimg.css('height', 'auto'); jimg.css('cursor', 'pointer'); jimg.on('click', function() { }); } else { jimg.width(jmessage_width); var height = (img_height / img_width) * jimg.width(); jimg.height(height); } } }); }); } function resize_table() { $('div.message').each(function() { var jdiv = $(this); jdiv.find('table').addClass('table').wrap('<div class="table-responsive"></div>'); }); } $(function() { resize_image(); resize_table(); $(window).on('resize', resize_image); }); var jmessage = $('#message'); jmessage.on('focus', function() {if(jmessage.t) { clearTimeout(jmessage.t); jmessage.t = null; } jmessage.css('height', '6rem'); }); jmessage.on('blur', function() {jmessage.t = setTimeout(function() { jmessage.css('height', '2.5rem');}, 1000); }); $('#nav li[data-active="fid-1"]').addClass('active'); </script>