The Internet is getting more and more cool, and the popularity of WWW is like a day. Publish company information on the Internet, and e-commerce has evolved from fashion. As a web master, you may have an HTML, JavaScript, Java, ActiveX, if you know what is web rot? Do you know what is the relationship between web robot and the homepage you set?
Tramp on the Internet --- Web Robot
Sometimes you will not find your homepage in a search engine, even if you have never had any contact. In fact, this is the merge of Web Robot. Web robot is actually some programs that can cross all the contents of the network site through a large number of over-text structures of the Internet URL. These programs are sometimes called "Spider", "web wandere", "web Worms" or Web Crawler. Some Internet online well-known search engine sites have special Web Robot programs to complete information acquisition, such as Lycos, Webcrawler, Altavista, etc., and Chinese search engine sites such as Net Nerat, Netease, GoYoyo, etc.
Web robot is like a rapid guest, whether you care, it will be loyal to his own responsibilities, work hard, don't know how to be tired of the World Wide Web, of course, will also visit your homepage, retrieve the home page, and generate the record format it needs it. . Perhaps the home page content you are happy to know, but some content is not willing to be insumed, index. Can you only be able to "horizontally" on your home page, can you command and control the whereabouts of Web Robot? The answer is of course affirmative. As long as you read this article, you can arrange the next road sign like a traffic police, telling Web Robot how to retrieve your homepage, which can be retrieved, which cannot be accessed.
In fact, Web Robot can understand your words.
Don't think that Web Robot is unable to organize, and there is no tuning. Many web robot software gives two administrators or web content producers to network sites to limit Web Robot's whereabouts:
1, Robots Exclusion Protocol protocol
The administrator of the network site can create a specially format file on the site, indicating which portion on the site can be accessed by Robot, this file is placed under the root of the site, ie http: //.../robots.txt .
2, Robots Meta Tag
A web page can use a dedicated HTML Meta Tag to point out if a web page can be indexed, analyzed, or link.
These methods are suitable for most of the Web Robot, as for the implementation of these methods in the software, but also relying on Robot developers, not guaranteed to any robot. If you urgently need to protect your content, you should consider adopting other protection methods such as adding passwords.
Use the Robots Exclusion Protocol protocol
When Robot accesses a Web site, such as http://www.sti.net.cn/, it first checks the file http://www.sti.net.cn/robots.txt. If this file exists, it will analyze in this record format:
User-agent: *
Disallow: / cgi-bin /
Dislow: / TMP /
Disallow: / ~ joe /
To determine if it should retrieve the files of the site. These records are dedicated to Web Robot, and the general browser will never see this file, so don't weigh the sky to join the HTML statement that is similar to class or "How do you do. ? WHERE is you from? "The like a false hyperinity greeting. Only one "/ ROBOTS.TXT" file on a site, and each letter of the file name is written. Each individual "disallow" line in the Robot record represents the URL you do not want Robot accessed, each URL must occupy a row alone, and the "Disallow: / CGI-BIN / / TMP /" is not present. At the same time, blank lines cannot appear in a record, because the space line is a sign of multiple record segments.
The User-Agent line indicates the name of the Robot or other agent. In the User-Agent line, '*' represents a special meaning - all Robot.
Below is an example of several robot.txt:
Reject all Robots across the server:
User-agent: *
Disallow: /
Allow all Robots to access the entire site:
User-agent: *
Disallow:
Or generate an empty "/ Robots.txt" file.
Part of the server allows all Robot Access
User-agent: *
Disallow: / cgi-bin /
Dislow: / TMP /
Dislow: / private /
Reject a special Robot:
User-agent: Badbot
Disallow: /
Only one Robot patronize:
User-agent: Webcrawler
Disallow:
User-agent: *
Disallow: /
Finally we give Robots.txt on http://www.w3.org/ sites:
# For us by search.w3.org
User-agent: w3crobot / 1
Disallow:
User-agent: *
Disallow: / member / # this is restrictted to w3c members ONLY
Disallow: / member / # this is restrictted to w3c members ONLY
Disallow: / team / # this is restricted to w3c team only
Dislow: / Tands / Member # this is restrictted to w3c members ONLY
Disllow: / Tands / Team # this is restricted to w3c team only
Disallow: / project
Dislow: / systems
Disllow: / web
Disallow: / team
Use Robots Meta TAG way
Robots Meta Tag allows HTML web page authors to point out if a page can be indexed, or if it can be used to find more link files. There is currently only part of Robot implements this feature.
The format of the Robots Meta Tag is:
Like other Meta Tag, it should be placed in the HTML file HEAD area:
hEAD>
...
The Robots Meta Tag directive is spaced apart using a comma, and the instructions that can be used include [NO] INDEX and [NO] FOLLOW. The Index instruction indicates whether an index Robot can index this page; the FOLLOW instruction indicates whether the Robot can track the links of this page. The default situation is INDEX and FOLLOW. E.g:
A good Web site administrator should consider the management of Robot, so that Robot is your own homepage, and it doesn't harm the security of your webpage.
Magnific role of small meta in HTML documents
Robots.txt and Robots Meta labels
Robots.txt guide
Use of Robots Meta Tag