Network Robot: Is it a blessing or a disaster?
Martijn Koster, NEXOR
April 1995 [1997: Updated Links and Addresses]
Codehunter [program hunter] translation
Summary
The robot has been used on World Wide Website for more than a year (relative to 1995). During this time, they have a useful task, and they have caused great damage to the network. This article focuses on researching the advantages of robots in resource collection and disadvantage. Discuss and compare some standard new resource collection strategies. Finally, it concluded that in a long period of time, the robot will be widely used, but as the network increases, these robots will become lack of efficiency and more problems.
introduction
In the past few years, the World Wide Web [1] has become very popular, and now she has become a platform for the main information released on the Internet. With the continuous increase of network sites and documents, browsing web resources via manually clicked overheads have become impossible, and it is not to be said to be a way to explore efficient resources.
This problem has prompted the automatic browsing "robot" experiment. The network robot is a program that gets all documents by getting a document and then analyzing the web page hyperlink structure. These programs are sometimes called "spiders", "network roamor" or "network worm". These names may be very attractive, but it may make people misunderstood, "spiders" and "network roamors" give people an impression that they can automatically move, and "network worm" can breed themselves, as if they are famous Internet worm [2]. In fact, the robot is a separate software system, which is just a tool for obtaining information from a remote site through a standard Global Network Protocol.
Robot
The robot can handle a lot of useful work.
Statistical Analysis
The first robot [3] is used to find and count the number of web servers. Other statistics include the average of each server, the proportion of some types of files, the average of the web page, the degree of network interconnection, etc.
maintain
A very important difficulty is to maintain a hypertext structure navigating to other pages, when these web pages are moved, these coupling may become "dead link". There is no unified agency to actively inform web maintenance personnel to modify the join. Some servers, such as Cern HTTPD, can record the failure request due to death, find the web page according to the death link, and then resolve manually. This method is not practical. In fact, when the website author pays attention to these, they can only find that the web page contains a bad connection. In many cases, the user is informing them by email.
The robot can check the index, such as MOMSPIDER [4], able to help the authors locate those dead joints while helping the hypertext structure. Robots can also help maintain web content, generally by detecting consistency of HTML [5], unanimous design policy, orderly update, and more. But this is not an ordinary practice. As can be proposed, such a function should be integrated into the HTML development environment. When the document changes, these detections can be repeated, these problems can be solved.
Mirror
The image is a popular technology for maintaining the FTP file. A mirror is copied through the FTP to the entire directory tree, and the regular document has been changed. Mirror allows you to share these files, reduce excess copies due to host errors, and speed access speed, reduce access, and implement offline browsing.
The robot can be used to mirror the site when there is no mature mirror timing tool. Some robots can get a subdirectory file and store them locally. But it cannot easily update those modified web pages. The second question is that the copied webpage needs to be rewritten: those that have been mirrored may need to change the polar points, and those connected to the image that are not mirrored need to be expanded to be absolutely coupled. From performance Consideration to mirroring tools to minimize the dependence, selective updates to mature cache server [6], can ensure that a cached document is updated and can maximize self-maintenance. Regardless of the method, we expect mirror tools to be developed in time. Explore resources
Perhaps the most exciting application of robots is to explore resources. In any place where anyone is not able to copy a lot of information, it is most attractive to let the computer do these work, and some robots have a large number of web pages and provide a search engine that access to this information.
This means that a network user can jointly browse and look up to locate a message, this method is better relying alone; even if the database cannot contain the item you need, it may contain a web page related to the target information.
The second advantage is that these databases can be automatically updated, so the dead link will be perceived and deleted. This is compared to the manual maintenance document, which has a distinct contrast. The function of the robot's exploration resource will be discussed below.
Use
A robot can handle multiple tasks, such as RBSE SPIDER [7] can statistically analyze the documentation that also provides resource exploration. Unfortunately, robots like this combination is so rare.
Working costs and dangers
The use of robots brings a high price, especially when they work on distant Internet. In these paragraphs, we will see the dangers of the robot's expensive demand in the Global Network.
Network resources and server load
Robots require considerable bandwidth. First, the operation of the robot needs to have an additional extension cycle, often a month. In order to speed up the execution of many robots, the bandwidth is always in a high use state in a certain period of time. If the robot has made a lot of file acquisition in a short time ("Fast Fire Rapid Fire"), even the remote network can feel the nervous of network resources. This can lead to insufficient bandwidth of other users, especially in the zone wide link, such as the Intertrome does not have a mature processing load balancing protocol.
It is used to it, the Internet has been recognized as free, and personal users do not have to pay for his operation. This understanding is that it is based on uncharacterized investigations, and it is possible to directly feel the relationship between the network. A company can feel that the value of these (potential) customer services is worth these, but the automatic transmission of the robot can not experience these fees.
In addition to the demand of the micro network, the robot also puts a need for the server. Depending on the frequently requested documentation, it can lead to a considerable load, which will result in a decline in access services to other network users. Especially when this host is still used to do other purposes, it is completely unacceptable. Here is a test, the author runs a simulation, and the server is obtained on his server, and the server runs the PLEXUS service / Sun 4/330 (Sun's computer model). In a few minutes, the machine is slow, like a crawler and can't do anything. In fact, it is the result of continuous acquisition. Only in this week I wrote this article, a robot visited my site with a quick fire method. After 170 consecutive acquisition, my server collapsed.
These facts show that fast fire should be avoided. Unfortunately some popular browsers (such as Netscape) ignore this problem because it is possible to get online pictures. Global Network Agreement, HTTP [8] has shown the low efficiency of this transmission, and the new agreement will correct these issues [10].
Updated internal operation
We have mentioned that the robot can automatically update the database. But unfortunately there is no effective document update control mechanism on the global network; no request can determine a set of URLs (unified resource locator) has been moved, delete or change. (Translator Note: HTTP1.0 has been added to obtain a request for final change time). Incorrect execution
Newly developed robots, especially increased overemployment of hosts and networks. Even if the protocol and URL send it correct, and the robot has also correctly processes the corresponding protocol (including some advanced features such as redirection), unpredictable problems.
The author has observed the same robot to visit his server. Although in some cases, this is caused by people using the website (instead of a local server), which is obviously because of loose execution in some cases. Repeated acquisitions typically happen to someone who has no access history to the address (this is unforgivable), or when the robot cannot distinguish the same URL on those grammar, if different domain name points to the same IP address, or URL cannot The robot specification, such as "foo / bar /../ baz.html" and "foo / baz.html" are the same.
Some robots will get some document types they cannot handle and ignore, such as GIF and dedication.
There are still dangerous that there are some close to unlimited web space. For example, a script returns a web page, links to a deeper web page by this web page. Such as starting with "/ cgi-bin / pit /", link to "/ cgi-bin / pit / a /", "/ cgi-bin / pit / a / a /", and so on. So these URL spaces can deceive robots to fall into it, which is often called "black hole". Some criteria about the Robot exclusion will be discussed below.
Write a directory problem
The resource search database generated by the robot is very uncomfortable. Users use this database to locate resources based on certain rules. However, there are some controversies in terms of the applicability of robots in network resource exploration.
There are too many information in the network, and it is dynamic.
It is determined that the scale of the efficiency of a acquisition information path is "under the RECAL", that is, the score of all the relevant files discovered. BRIAN PINKERTON [15] The ruled rate is correct in the Internet index calibration system, as if it finds the relevant documentation is not a problem. However, if a completely available data on the Internet is regarded as a foundation, The database generated by the robot cannot be high, as the number of data is huge, and the change is often the same. When network grows, the robot database is not possible in practice that the species of special details will become worse with the growth of the Global Network.
Decide what information is included or exclusive
The robot cannot automatically determine if a given web page should be included in his index. The web server may be documents related to local contexts (such as an internal library index), they are only temporary, or for other purposes. From a certain extent, when the robot may not be identified, the robot includes some web pages, depending on the user. In practice, the robots finally store all files. Note Even if a robot can decide whether to exclude a special web page in his database, they have spent the cost of this document; the robot is very wasteful to this high spending document.
Try to alleviate this situation, some robotic organizations have adopted "robot exclusion standards". This standard describes the use of a simple text file to specify the URL space in that aspect (see Figure 1). This method can also be used to avoid robots fall into black holes. Personal robots can give him a special directive, as if some robots can be more sensitive than other robots, or they know that specializes in studying a special field. This standard is unofficial, but he implemented very simple, and there is a considerable public pressure to promote the robot. Decide how to traverse the website is a related problem. Most of the web servers are hierarchically organized, and a breadth-first traversal can find a larger range and more the same depth from the top to a limited depth may be faster than depth-first. Layer documents and servers, this approach is more suitable for resources exploration. However, depth priority traversal may easily find those personal home page, potential new addresses, potential new addresses, potential new addresses that can be linked to other addresses, so it may be more likely to find new sites.
# /robots.txt for http://www.site.com/
User-agent: * # attention all robots:
Dislow: / cyberworld / map # infinite url space
Disallow: / tmp / # temporary Files
Figure 1: an esample roots.txt file
Summary
The summary documentation and index web pages are completely different. Early robots are just a simple storage document title and anchor description text, but new robots use more advanced mechanisms, consider the content of the entire document.
These methods have some good general summary, which can be automatically applied to all web pages, but the author is not very efficient as the manual calibration. HTML provides a mechanism to attach Meta information on the document, as long as one is specified
Element, such as "
However, the tag of the attribute of this value is not defined in semantic definitions, which is seriously restricted by this tag, and this mechanism is very useful.
Get the total number of documents with the proportion of the relevant documentation, this value is very low. Advanced features such as Boolean Operators, Weighted Matches, such as a wide area information service system (WAIS), or relevance feedback (Relevance feedback) can reduce this value, the tremendous information given me by the Internet We granted this problem.
Document classification
Global Network users often need a subject level document. Genvl [17] allows manual maintenance theme levels, which introduces the scope discussed herein. The robot can present a topic level, but this requires some mechanisms for automatic file classification [18].
The Meta tags discussed above provide the author with a mechanism for archiving their own documents. How to apply it when the classification system is used. Even the traditional library also does not use a single standard system, and some of them are used and their own agreement is used to implement it. Implementing a unified solution on the Global Online is no hope.
Determine the structure of the document
Perhaps the most difficult is that the network is not composed of a set of uncommon files. Usually the server is a set of web pages: there is a welcome page, maybe a table-containing web page, perhaps with a background, and some special data. Service suppliers announce a variety of services on the welcome page. There is no way to distinguish these web pages, which may find a link such as a data or a background image, which is not the main page of the file. Such things can happen: the robot stores some of the addresses of "The Perl FAQ", which is not what it involves. | If there is no link between the webpage in the network, then this problem can be avoided. The above problem describes that the content of the webpage is special, with only one access structure, and the context that is detached is often understood. For example, a web page describes a project's goal may be about "the processing", there is no full detailed name, or give a link to the welcome page. Another problem is the URL that has been deleted. Usually the system administrator is reorganizing their URL structure, they will provide a mechanism to compatibility with the previous URL structure to prevent bad links. In some servers, this mechanism may be implemented in a redirect, and the HTTP returns a new URL user when the user accesses the old URL. However, the symbolic link cannot tell the robot to these links. A index robot will store these opposed links, and the system administrator provides a backward compatible mechanism is a necessary condition for the production.
A related issue is that the robot may index a mirror, not the original site. If the mirroring and the original content are accessed, the database has dual information, and the bandwidth will also be wasted. If only the mirror is indexed, the user may not be the latest updated information.
specification
We have seen the machine's understanding, but it requires a high bandwidth, and they have some basic principles when they index the global network. So a robot needs to trade out these issues when developing and deploying a robot. This has become an ethical issue. "Is a robot's operation?" Is it just right? " This is a gray field, people have a wide range of disputes.
When some acceptable disputes initially become the fact (after the event of the load on some robots), the development robot's writer reveals a set of guidelines [19], such as a first step to identify problems and promote understanding. The overview of these guidelines is as follows:
Re-consider: Do you really need a robot; responsibility: Determine that the robot can be identified by the server administrator, and can easily contact the author; test the author on the local machine; moderate resource consumption: prevent fast launch, eliminate unnecessary unnecessary Get; follow the robot exclusion criterion; monitoring operation: constantly look at the robot's log; sharing results: make the original outcome or some advanced results can be shared by other robots
David Eichman [20] For the server agent (this robot is based on the public-oriented information) and user agent (this robot is just a single user into the client robot), and has been made from the perspective of ethics. Advanced identification and distinction.
In fact, many robots developers have handled these guidelines and desire to minimize these effects. The robot mailing list accelerates the discussion of these new issues, and these activities about the public opinions of the robot provide a group of groups of valid robot behavior [21].
The maturity of the robotic field means having fewer things to disturb the data suppliers. Especially robotic rejection is precious, those who do not allow robots to prevent it from accessing it. However, as the Internet has gradually increased, the emergence of more robots in the Global Online is inevitable, and there are some robots that do not have appropriate behavior.
Two available methods available for resources
Robots are expected to continue to be used on Internet information. However, we have seen these practical problems, basic principles, and ethical issues in the use of robots, and explore some ways to choose from, such as AliWeb [22] and Harvest [23] .aliweb has a simple The model is to distribute index services online, based on Archie [24]. In this model, the collection of index information is valid on any host online. Only index local resources, not the third part of resources. On aliweb, this model uses IAFA Templates [25], this template's information resource type is based on simple text format (see Figure 2). This template can be generated manually, or automatically created, for example, list the title and META elements in a document tree. The AliWeb collection engine receives these index files through a normal web access protocol, and combines these files into a queryable database. Note that it is not a robot, it is not recursive to get a document to generate an index.
Template-Type: Service
Title: The Archieplex Archie Gateway
URL: /PUBLIC/Archie/archieplex/archieplex.html
Description: a full hypertext interface to archie.
Keywords: Archie, Anonymous FTP.
Template-Type: Document
Title: The Perl Page
URL: /PUBLIC/PERL/perl.html
Description: Information on The Perl Programming Language.
INCLUDES HYPERText Versions of the Perl 5 Manual
And The Latest FAQ.
Keywords: Perl, Programming Language, Perl-FAQ
Figure 2: An Iafa Index File
This method has some advantages. The quality of the artificial establishment index is combined with the mechanism of automatic update. This integration of this information is much higher than the traditional "activity table" "HotLists", which only maintains the local index information. Because this information is a computer readable format, the lookup interface provides additional methods to constrain the query. Only index information is obtained, so that the network cost is very small. The simplicity of models and index files make information providers can participate.
This method has some disadvantages. Manual maintenance index information has brought a huge burden to the information provider, but the index information of the actual server does not require frequent changes. The system from the TITLE and META tags generates an index has been in the experiment, but this will require a local robot and the quality of the index is worrying. Another limit is that the current information provider has to register its index information file in a central registry, which limits scalability. Finally, the update efficiency is not optimized. When only one record in the index file changes, it is also available throughout.
The ALIWEB used in October 1993, its achievements have been praised. The difficulty of main operations is to understand it; initially trying to register their own HTML files rather than the Iafa index file. Another problem is that AliWeb is a personal project it is running on a time-based and does not receive any funds, so the development is slow.
Harvest is a distributed resource exploration system released by Internet Research Task Force Research Group ON Resource Discovery (IRTF-RD), and provides software systems for automatic index document, high-efficiency replication and store index on remote hosts. Information, finally querying these data through an interface. The response to this system was initially affirmed. A unfavorable condition of Harvest is that it is a big and complex system that requires considerable human and computing resources, which makes it far from those information providers.
Harvest's Using an interactive existing database to form an ordinary platform is probably the most exciting aspect. Harvest is very direct to other systems with platform interactive platforms; experiments prove that AliWeb can serve as a proxy in Harvest. This mechanism gives AliWeb stores and finds the Harvest function and provides Harvest's low-cost portal mechanism.
In terms of resource exploration, these two systems are alternatives to robots are very attractive: AliWeb provides a simple and advanced index, and Harvest provides a comprehensive index system using low-level information. However, no system is for third parties, is negative participation, and for this robot will be expected to continue in this purpose, except for other systems, such as AliWeb and Harvest.
in conclusion
Today's World Wide Website, robots are used to achieve many different goals, including global resources exploration. There are some basic principles of application and ethical issues in robots. As the problem and ethical issues of robot growth, it has been an experience, but it may continue to lead to some accidental issues. The problem of basic principles limits the development of robots. Other methods such as AliWeb and Harvest are very efficient, and can give the author a platform for managing yourself to index information. We expect such systems to be popular and interact with the robot. However, in a long period of time, the robot will become very slow, expensive, and low efficiency.
reference
Ø Berners-Lee, T., R. Cailliau, A. Loutonen, Hfnielsen and A. Secret. "The World-Wide Web". Communications of The ACM, V.37, N. 8, August 1994, PP. 76 -82.
Ø SEELEY, DONN. "A Tour of the Worm". Useninx Association Winter Conference 1989 Proceedings, January 1989, PP. 287-304.
Ø Gray, M. "Growth of The World-Wide Web," DEC. 1993.
http://www.mit.edu:8001/AFT/SIPB/User/mkgray/ht/web-growth.html
>
Ø Fielding, R. "Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web" Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994. Ø Berners-Lee, T., D. Conolly at al, ".. Hypertext Markup Language Spacification 2.0 ". Work in Progress of The Html Working Group of The Ietf.
ftp://nic.merit.edu/documents/internet-drafts/draft-ietf-html-spec-00.txt>
. Ø Luotonen, A., K. Altis "World-Wide Web Proxies" Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994. Ø Eichmann, D. "The RBSE Spider -. Balancing Effective Search against Web Load ". Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994. Ø Berners-Lee, T., R. Fielding, F. Nielsen." HyperText Transfer Protocol ". Work in progress of the HTTP working group Of the ipf.
ftp://nic.merit.edu/documents/internet-drafts/draft-fielding-http-spec-00.txt
>
Ø Spero, S. "Analysis of http performance problem" July 1994
http://sunsite.unc.edu/mdma-release/http-prob.html
>
Ø Spero, S. "Progress On HTTP-NG".
Http://info.cern.ch/hypertext/www/protocols/http-ng/http-ng-status.html
>
Ø De Bra, PME and RDJ Post "Information Retrieval in the World-Wide Web: Making Client-based searching feasable"... Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994. Ø Spetka, Scott " The TkWWW Robot: ". Proceedings of the Second International World-Wide Web Conference, Chicago United States, October 1994. Ø Slade, R.," Beyond Browsing Risks of client search tools, "RISKS-FORUM Digest, v 16, n. Weds 31 August 1994. Ø Riechen, Doug. "Intelligent Agents". Communications of The ACM VOL. 37 No. 7, July 1994. Ø Pinkerton, B., "Finding What People Want: Experiences with the Webcrawler," Proceedings of the Second International World-Wide Web Conference, Chicago United States, October 1994. Ø Koster, M., "A Standard for Robot Exclusion,"
>
Ø Eichmann, D., "Ethical Web Agents", "Proceedings of the second, chicago United States, October 1994. Ø Koster, Martijn." WWW Robots, WandERERS and SPIDERS ".
http://www.robotstxt.org/wc/robots.html
>
Ø Koster, Martijn, "ALIWEB - Archie-Like Indexing in the Web," Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994. Ø Bowman, Mic, Peter B. Danzig, Darren R. Hardy, Udi Manber and Michael F. Schwartz. "Harvest: Scalable, Customizable Discovery and Access System" Technical Report CU-CS-732-94, Department of Computer Science, University of Colorado, Boulder, July 1994.Ø Deutsch, P., A. , "Archie - An Electronic Directory Service for the Internet", Proc. Usenix Winter CONF., PP. 93-110, Jan 92. Ø Deutsch, P., A. Emtage, M. Koster, And M. stumpf. "Publishing Information on The Internet with Anonymous FTP". Work In ProGress of The Integrated Internet Information Retrieval Working Group.
ftp://nic.merit.edu/documents/internet-drafts/draft-ietf-iiir-publishing-02.txt
>