Create a smart network spider - how to use Java network objects and HTML objects (translation)

xiaoxiao2021-03-06 122

Create a smart network spider

- How to use Java network objects and html objects (translation)

Author: Mark O. Pendergast

Original: http://www.javaworld.com/javaworld/jw-11-2004/jw-1101-spider.html

Summary

Have you thought about creating your own specific standard website database? The web spider, sometimes called network reptiles, is some programs that are locked from a website to another website, check the content, and recording positions based on the network link. Business Search Site Use web spiders to enrich their database, researchers can use spiders to obtain relevant information. Create your own spider search content, host and web feature, such as text density and built-in multimedia content. This article will tell you how to use Java's HTML and network classes to create your own powerful web spiders.

This article describes how to create a smart network spider on the basis of a standard Java network object. The core of the spider is a recursive program based on a keyword / phrase standard and web feature. The search process is similar to the JTREE structure on the graph. I mainly introduced questions, such as processing related URLs, preventing cyclic references and monitoring memory / stack use. Also, I will introduce how to use the Java network object in the re-access and decomposition of the remote web page.

l Spider sample program

The sample program includes user interface class SpiderControl, network search class Spider, two class URLTREENODE and URLNodeRereEreEreEreEreEreEreEreEreEreEreEreere used as a JTREE display result, and two class INTEGERVERIFIER and VERIFIERLISTENER that helps verify the digital input in the user interface. There is a full code and documentation in the end of the article.

The SpiderControl interface consists of three attribute pages, one for setting search parameters, another display result search tree (JTREE), the third display error, and status information, as shown in Figure 1

Figure 1 Search parameter properties page

Search parameters include access to the largest number of websites, the maximum depth of the search (link to link to link), keyword / phrase list, search top host, starting site or portal. Once the user enters the search parameters, and press the start button, the network search will start, the second property page will display the progress of the search.

Figure 2 Search tree

An instance of a Spider class performs network search in a stand-alone process. The use of the stand-alone process is for the SpiderControl module to constantly update the search tree display and handle the search button. When the Spider is running, it continues to add nodes (URLTREENODE) for JTREE in the second property page. Search tree nodes containing keywords and phrases are displayed in blue display (UrlNoderendere).

When the search is complete, the user can view the statistics of the site, and you can view the site with an external browser (default is Internet Explorer located in the Program Files Directory). Statistics include the number of keywords, total characters, total picture count, and total number of links.

l Spider class

The Spider class is responsible for searching for the network, a series of keywords and hosts, and restrictions on the search depth and size. Spider inherits Thread, so you can run in a stand-alone thread. This allows the SpiderControl module to constantly update the search tree display and process the stop search button.

The constructor accepts the search parameters containing an empty JTREE and an empty JTEXTAREA reference. JTREE is used as a classification site record during a search process. This provides users with visible feedback to help track the location of the SPDier loop search. Jtextarea displays errors and process information.

The constructor stores the parameters in the class variable and uses the URLNodeRenderer class to initialize the JTREE of the display node. Until SpiderControl calls the run () method search before starting.

The Run () method is executed in a separate thread. It first determines if the entry site is a web reference (starting with HTTP, FTP, or WWW) or a local file reference. It then confirms if the entry site has the correct symbol, reset the run statistics, then call SearchWeb () to start search: public void run ()

{

DefaulttreeModel Treemodel = (DefaulttreeModel) Searchtree.getModel (); // Get Our Model

DEFAULTMUTABLETREENODE ROOT = (defaultmutabletreenode) TreeModel.getroot ();

String Urllc = Startsite.tolowerCase ();

IF (! urllc.startswith "&&! urllc.startswith (" ftp: // ") &&

Urllc.StartSwith ("www.")))

{

Startsite = "file: ///" startsite; // Note You Must Have 3 SLASHES!

}

Else // http missing?

IF (Urllc.StartSwith ("www."))))

{

Startsite = "http: //" startsite; // TACK ON http: //

}

Startsite = startsite.replace ('//', '/'); // fix bad slashes

Sitesfound = 0;

Sitessearched = 0;

Updatestats ();

SearchWeb (root, startsite); // search the Web

Messagearea.Append ("DONE! / N / N");

}

SearchWeb () is a recursive method that accepts search tree parent nodes and search web address parameters. SearchWeb () First checks if the given site has been accessed and unsuccessful search depth and site. SearchWeb () then allows SpiderControl to run (update interface and check if the search button is pressed). If all normal, SearchWeb () continues, otherwise returns.

Before SearchWeb () starts reading and resolving the site, it first verifies whether the URL object created based on the site has the correct type and host. The URL protocol is checked to confirm that it is an HTML address or a file address (there is no need to search for the Mailto: and other protocols). Then check the file extension (if you are currently) to confirm that it is an HTML file (do not have to resolve the PDF or GIF file). Once these work is completed, check the host according to the list of users based on the list of users through the isDomainok () method

... URL URL = New URL (URLSTR); // Create The Url Object from a string.

String protocol = url.getProtocol (); // ask the URL for ITS Protocol

IF (! protocol.equalsignorecase ("http") &&! protocol.equalsignorecase ("file"))

{

Messagearea.Append ("Skipping:" URLSTR "Not a http site / n / n");

Return;

}

String path = url.getpath (); // ask the URL for ITS Pathint LastDot = path.lastindexof ("."); // Check for File Extension

IF (Lastdot> 0)

{

String extension = path.substring (lastdot); // Just the file extension

IF (! ") &&! EXTENSION.EQUALSIGNORECASE (". htm "))

Return; // Skip Everything But HTML Files

}

IF (! IsDomainok (URL))

{

Messagearea.Append ("Skipping:" URLSTR "Not in Domain List / N / N");

Return;

}

Here, SearchWeb () is fair to determine if it has a URL worth searching, then it creates a new node for the search tree, adds to the tree, open an input stream resolution file. The following sections involve many details on parsing the HTML files, processing related URLs, and control recursive.

l parsing the HTML file

There are two two ways to find A HREF to analyze the HTML file method - a hassle method and a simple method.

If you choose a cumbersome method, you will create your own resolution rules using Java's StreamTokenizer class. With these technologies, you must specify words and spaces for the StreamTokenizer object, then remove the symbols to find tags, properties, divide text between tags. Too many work should be done.

A simple way is to use a built-in Parserdelegator class, a subclass of an HTMLDITORKIT.PARSER abstract class. These classes are not perfect in the Java document. With ParserdeLegator, there are three steps: first create an InputStreamReader object for your URL, then create an instance of a ParserCallback object, and finally create an instance of a ParserDelegator object and call it PUBLIC method PARSE ():

URLTREENODE Newode = New Urltreenode (URL); // Create The Data Node

InputStream in = url.openstream (); // ask the URL Object to create an input stream

InputStreamReader ISR = New InputStreamReader (in); // Convert The Stream To a Reader

DefaultmutableTreenode Treenode = addNode (ParentNode, NewNode);

SpiderParserCallback CB = New SpiderParserCallback (Treenode); // Create a callback Object

PARSERDELEGATOR PD = New Parserdelegator (); // Create The Delegator

Pd.Parse (ISR, CB, TRUE); // Parse The Stream

Isor.close (); // Close the stream

Parse () accepts an InputStreamReader, a parsecallback object instance and a flag that specifies whether the CHARSET tag is ignored. The PARSE () method then reads and decodes the HTML file, and the method of calling the ParserCallback object each time the decoding a tag or HTML element is completed. In the sample code, I implemented ParserCallback as an internal class of Spider, which allowed ParsECALLBACK to access Spider methods and properties. The class based on ParserCallback can cover the following method:

n HandlestartTAG (): When you encounter the starting HTML tag, such as> a <

n Handleendtag (): When it encounters ending HTML tags, such as> / a <

n handlesimpletag (): When you encounter no matching end label

n handletext (): When the text between the label is encountered

In the sample code, I covered HandLesImpleTag () so that my code can handle HTML's Base and IMG tags. The Base tag tells what URL is used when processing the relevant URL reference. If there is no Base tag, the current URL is used to handle related references. HandlesImpleTag () accepts three parameters, an HTML.TAG object, a MutableAttributeEt containing all tag properties, and the corresponding location in the file. My code checks the label to determine if it is an Base object instance, if the HREF property is extracted and saved in the data node of the page. This property is used in the URL address of the processing linked site. Every time IMG tag, the number of pages is updated.

I have covered the HandleStartTAG so that the program can handle the HTML A and Title tags. Method Check if the T parameter is a fact A tag, if the HREF property will be extracted.

FixhRef () is used as a large number of references (changing the reverse slope as slant lines, adding a missing end slash), and the links are processed by using the base URL and creating a URL object. Then recursively call SearchWeb () to process the link. If the method encounters the Title tag, it clears the memory that the finals of the final text is in order to the end tag of the title has the correct value (there is no title between the Title tag of the web page).

I covered HandleendTag () so that HTML's Title end tag can be processed. This end tag indicates the title text of the page in the previous text (existing in LastText). This text then exists in the data node of the page. Because the title information is added to the data node, the NodeChanged () method must be called so that the tree can be updated.

I have covered the handleText () method so that the text of the HTML page can be checked according to any keyword or phrase being searched. Handletext () Accepts an array containing a sub-array and the character in the file as a parameter. Handletext () first converts the character array into a String object, all converted to uppercase during this process. Next, each keyword / phrase in the search list is checked according to the indexoF () method of the String object. If IndexOf () returns a non-negative result, the keyword / phrase is displayed in the text of the page. If the keyword / phrase is displayed, the match is recorded in the node of the match list, the statistics are updated:

Public class spiderparsercallback extends htmleditorkit.parsercallback {

/ **

* Inner Class Used to Html Handle Parser Callbacks

* /

Public class spiderparsercallback extends htmleditorkit.parsercallback {/ ** url node being parse * /

Private Urltreenode Node;

/ ** Tree node * /

Private defaultmutableTreenode Treenode;

/ *** CONTENTS OF LAST TEXT ELEMENT * /

Private string lasttext = ""

/ **

* Creates A New Instance of SpiderParserCallback

* @Param Atreenode Search Tree Node That Is Being Parsed

* /

Public SpiderParserCallback (DefaultmutableTreenode Atreenode) {

Treenode = atreenode;

Node = (URLTREENODE) TREENODE.GETUSEROBJECT ();

}

/ **

* Handle HTML Tags That Don't have a start and end tag

* @Param t html tag

* @Param a html attributes

* @Param Pos position within file

* /

Public void handlesimpletag (HTML.TAG T,

MutableAttributeset A,

INT POS)

{

IF (T. Equals (HTML.TAG.IMG))

{

Node.Addimages (1);

Return;

}

IF (t.equals (html.tag.base))

{

Object value = a.GetaTRibute (html.attribute.href);

IF (value! = null)

Node.setBase (Fixhref (Value.toString ()));

}

/ **

* Take Care of Start Tags

* @Param t html tag

* @Param a html attributes

* @Param Pos position within file

* /

Public void handlestartTAG (HTML.TAG T,

MutableAttributeset A,

INT POS)

{

IF (t.equals (html.tag.title)

{

LastText = "";

Return;

}

IF (t.equals (html.tag.a))

{

Object value = a.GetaTRibute (html.attribute.href);

IF (value! = null)

{

Node.Addlinks (1);

String href = value.tostring ();

HREF = Fixhref (HREF);

Try {

URL ReferenceDURL = New URL (Node.getBase (), HREF);

SearchWeb (Treenode, ReferenceDurl.getProtocol () ": //" Reference DURL.GETHOST () ReferenceDURL.GetPath ());

}

Catch (Malformedurlexception E)

{

Messagearea.append ("BAD URL ENCOUNTERED:" HREF "/ N / N"); Return;

}

/ **

* Take Care of Start Tags

* @Param t html tag

* @Param Pos position within file

* /

Public void handleendtag (HTML.TAG T,

INT POS)

{

IF (T. Equals (HTML.TAG.TITLE && LastText! = NULL)

{

Node.Settitle (LastText.trim ());

DEFAULTTREEMODEL TM = (DefaulttreeModel) searchtree.getModel ();

TM.NodeChanged (TREENODE);

}

/ **

* Take Care of Text Between Tags, Check Against Keyword List for matches, IF

* Match Found, Set The Node Match Status to True

* @Param Data TEXT BETWEEN TAGS

* @Param Pos position of text within webpage

* /

Public void handletext (char [] data, int POS)

{

LastText = New String (data);

Node.Addchars (lasttext.length ());

String text = lasttext.touppercase ();

For (int i = 0; i

{

IF (TEXT.INDEXOF (KeywordList [i])> = 0)

{

IF (! node.ismatch ())

{

Sitesfound ;

Updatestats ();

}

Node.setmatch (keywordlist [i]);

Return;

}

l Handling and complement URL

When you encounter a link to the relevant page, you must create a complete link on their base URL. Basic URLs may be clearly defined in the page via the Base tag, or in the links to the current page. Java's URL objects provide you with this issue, which provides a similar creation based on its link structure.

URL (URL Context, String Spec) Accepts links to the Spec parameter and the underlying link of the Context parameter. If the SPEC is a associated link, the builder will use context to create a complete reference URL object. URL it recommends URLs that follow strict (UNIX) formats. Using a backslash, in Microsoft Windows, not the slash, will be a wrong reference. If SPEC or Context points to a directory (including index.html or default.html), it must have an end slash. FixhRef () method Check these references and fixes them:

Public Static String Fixhref (String Href)

{

String newhref = href.replace ('//', '/'); // fix sloppy Web References

Int lastdot = newhref.lastIndexof ('.');

Int lastslash = newhref.lastindexof ('/'); if (Lastslash> LastDot)

{

IF (Newhref.Charat (Newhref.Length () - 1)! = '/')

NEWHREF = NewHREF "/"; // add missing /

}

Return newhref;

}

l Control recursive

SearchWeb () begins to be called to search for the starting web address specified by the user. It then calls itself when encountering an HTML link. This forms a basis for depth priority search and bringing two problems. First, very dangerous memory / stack overflow issues will be generated because there are too many recursive calls. If there is a ring reference, this problem will happen, that is, a page link is connected to another link back, which is an immediate thing in WWW. In order to prevent this phenomenon, searchWeb () checks the search tree (by the URLHASBEENVISITED () method) to determine if the reference page already exists. This link will be ignored if already existed. If you choose a spider that does not search for a tree, you still have to maintain a list of access sites (in the vector or array) so that you can determine if you are repeating the site.

The second problem of recursive comes from depth priority search and WWW structure. Depending on the selected entry, the depth priority search is caused by a large number of recursive calls before the initial page of the initial page. This has caused two unwanted results: first memory / stack overflow may occur, the second searched page may have been removed from the few outgings of the initial entrance. In order to control these, I added the maximum search depth setting for the spider. The user can select the depth level (link to link to link), when encountering each link, the current depth checks by calling the DEPTHLIMITEXCEED () method. If the limit is reached, the link is ignored. Test Just check the level of the node in the JTree.

The sample program also adds site restrictions, the user is specified, and the search can be stopped after the specific number of URLs is checked, so that the program can be stopped! Site restrictions are controlled by a simple digital counter sitessearched, and this number is updated and checked each time you call SearchWeb ().

l URLTREENODE and URLNODERENDERER

URLTREENODE and URLNodereere are classs used to create personalized tree nodes in JTREE in the SpiderControl user interface. URLTREENODE contains URL information and statistics for each search site. URLTREENODE is stored in JTree as a standard defaultmutabletreenode object as a user object property. Data includes the ability to track keywords in nodes, the NRLs, nodes of nodes, the number of links, the number of pictures, and the number of characters, and whether the node meets the search rules.

URLTREENODERENDERER is an implementation of the DEFAULTTREECELLRENDERER interface. URLTREENODERENDERER enables the node to contain matching keywords to be blue. URLTREENODERENDERER has also added a personalized icon for Jtreenodes. Personalized display is implemented by overlying the gettreecellrendereboPonent () method (as follows). This method creates a Component object in the tree. Most Component properties are set by subclasses, and urltreenoderendere has changed the color (foreground color) and icon of the text:

Public Component Gettreecellrenderercomponent

JTree Tree,

Object Value,

Boolean SEL,

Boolean Expanded,

Boolean Leaf, Int Row,

Boolean Hasfocus) {

Super.gettreecellrenderercomponent

Tree, Value, SEL,

Expanded, Leaf, ROW,

Hasfocus;

URLTREENODE Node = (URLTREENODE) ((DefaultmutableTreenode) value) .GetuserObject ());

IF (node.ismatch ()) // set color

SetForeground (color.blue);

Else

SetForeground (Color.Black);

IF (icon! = null) // set a custom icon

{

Setopenicon (icon);

SetCloseDicon (icon);

Setleaficon (icon);

}

Return this;

}

l summarize

This article shows you how to create a web spider and control its user interface. The user interface uses JTREE to track the progress of the spider and the site that records visits. Of course, you can also use the Vector to record accessed sites and use a simple counter to display progress. Other enhancements can include interfaces that record keywords and sites via databases, increase the ability to search through multiple portions, with a large number or very little text content to display the site, and provide the search engine with the ability to search synonymous.

The Spider class displayed in this article uses recursive call search procedures, of course, a new spider's stand-alone thread can start when you encounter each link. This benefit is to allow link remote URLs to execute and improve speed. However, remember those JTREE objects called DefaultmutableTreenode, not thread security, so programmers must synchronize themselves.

Resources:

The source code and Java documentation of this article:

Http://www.javaworld.com/javaworld/jw-11-2004/spider/jw-1101-spider.zip

转载请注明原文地址:https://www.9cbs.com/read-95017.html

9cbs

New Post(0)