Evasting the legacy of search engines

xiaoxiao2021-03-06 54

Author: Xiaofeng moon · Date: 2004-05-11

Why do we have to do this inversely?

If you are a webmaster, you are estimated that you are always working hard to find your site in the search engine, and you can rank the search engine, but sometimes you may not land any search engines, but you can inexplicably The discovery can be searched for your website. Perhaps the home page content you are happy to know, but some content is not willing to be insumed, index. Maybe you asked the user to verify, but this does not escape the search engine search, just search your webpage in the search engine, you don't have to log in without the password. And simple encryption is often easily broken. Do you use the database? This not only consumes the valuable website space resources, but also cannot be implemented for some simple sites. How to do it? Search engines are not inert-oriented, hierarchical thieves. How do I refuse the search engine?

Explore the principles of search engines

First, we must know the working principle of search engines. The network search engine is mainly consisting of three parts: Robot, this is the key to full text, and three parts of the index database and query services. As long as the webpage found by the web robot will establish an index in the search engine's database. Using the query client, you must find your page. So the key below is to study this network robot. The principle of indexing databases and query services We will not analyze.

Web robot is actually a program that detects a large number of Internet URLs and URL connections in the web page, recursively retrieves all content of the network site. These programs are sometimes called "Spider", "web wandere", "web Worms" or Web Crawler. Search Engines has special Web Robot programs to complete the acquisition of this information. High-performance Web root automatically searches for information in the Internet. A typical network robot's way to see a page and find related keywords and web information, such as: Title, Title on your browser, and some vocabulary that is often used to search, etc. . Then it will start from all links in this page, continue to find relevant information, so that you can push until exhaust. The network robot is realized to quickly browse the entire Internet, which is usually technically used to accumulate information on the Internet. Through the use of the predecessor multi-threaded, it can index a URL link-based web page, start a new thread follows each new URL link, index a new URL starting point. Establish an index of the search information to let the user searched. Oh, maybe you will think of it, this is not an infinite loop? Of course, robots also need to rest, the network robot is issued regularly and ends in a working time period. So, just created the finished web page, not immediately being reached in the search engine index. Speaking here, the basic working principle of the online search engine basically let everyone know. Command this network robot, do not let it see, see the road, it is the next job.

Evasting the legacy of search engines

As the developer of the search engine, it also leaves the network administrator or web maker to provide some ways to limit the actions of the network robot:

When Robots accesses a website (such as http://www.yoursite.com), first look at the unfamiliar visitor of a big house, first check whether the home will agree with it. If you don't agree, it will quietly walk away; if you agree, it will look at the owner only allow it to enter those rooms. The network robot first checks if there is http://www.yoursite.com/robots.txt this file, if this file is not found, the robot will cross directly, check the information it needs to be found. If the robot finds this file, it will determine the scope of its access according to the content of this file. Of course, if the content of the file is empty, then it is equivalent to not finding a file, boldly acting. Remember that robots.txt files should be placed in the root directory of the website. Records in the robots.txt file typically start with a row or multi-line User-Agent, followed by a number of Disliow lines, detailed, as follows:

User-agent:

This value is used to describe the name of the search engine Robot. Different search engines have different names, in the "robots.txt" file, if there are multiple User-Agent records, multiple Robot will be limited by this protocol For this document, if you need to limit Robots, then at least there is at least a User-Agent record. If the value of this item is *, the protocol is valid for any machine, in the "Robots.txt" file, "user-agent: *" records can only have one.

Disallow:

This value is used to limit the URL of the Robot accessed. This URL can be a complete path, or some, any URL starting with the Disallow is not accessed by Robot. For example, "Disallow: / HACKER" is not allowed to search engine access to /Hacker.html and /Hacker/index.html, and "Disallow: / Hacker /", Robot can access / HACKER.HTML without access / HACKER / INDEX.html. Any disallow record is empty, that is, in a plurality of Dislow records, as long as one is written "Disallow:", all content of the website is allowed to be accessed, in the "/ Robots.txt" file, at least A disallow record.

Below is some examples of robot.txt, just save the following of the following code to Robots.txt, then pass to the specified location, you can implement the fetching of the search engine:

Example 1. Prohibit any part of all search engines to access the website:

User-agent: * dispialow: /

Example 2. Allow all Robot Access:

User-agent: * disallow:

Example 3. Prohibit a search engine access:

User-agent: Badbot Dislow: /

Example 4. Allow access to a search engine:

User-agent: baiduspider dispilend: user-agent: * dispialow: /

Example 5. A simple example:

In this example, the website has three directorys to limit the access to the search engine, that is, the search engine will not access these three directories. It should be noted that the declaration must be separately declared for each directory, not to "Disallow: / CGI-BIN / / / BBS /". User-agent: The * has a special meaning, represents "any robot", so there is no "disallow: / bbs / *" or "disallow: * .gif" in this file. ISER-AGENT: * Disallow: / CGI-BIN / DISALLOW: / BBS / DISALLOW: / ~ Private /

Conclusion: Is this after setting it? The search engine will not find the webpage we limit. No, just before the start of the article, the network robot is issued regularly. Once the record is recorded in the index database, it is possible to take effect when updating the database next time. A fast way, just go to the search engine to log out your web page, you can wait for a few days. If you have a very important web page, you can replace a directory or file name.

For the webpage you already want to keep confidential, don't connect to these web pages in other unclean pages, you have already said that the working principle of the network robot has been said, it can start from all links in this page, continue Look for related information.

May be here, you have been safe to your confidentiality webpage. However, you think that there is no, for a plain text file, you can download by http, or ftp. That is to say, there are people who have a bad person who can find some clues through this robots.txt. The solution is that it is best to use Dislow, used to limit the directory, and need a confidential web page in this directory, use special file names, do not use the name of Index.html, otherwise, this is the same as the guessing easy. Some words such as D3gey32.html file name, your webpage is too safe.

In the end, I will give you a password verification insurance, so that you will worry about it.

转载请注明原文地址:https://www.9cbs.com/read-79180.html

9cbs

New Post(0)