Write an HTML file analysis program with Java

xiaoxiao2021-03-06 144

http://it.sohu.com/webcourse/webmonkey/1-teach/java/page8.html

Using Java Writing HTML File Analysis Program Tianjin University Cuiula Abstract: This article focuses on the Practical perspective, the input streaming streamTokenizer in the Java language is written in the HTML file analysis program, and introduces an example of downloading the web page in bytes. Process. 1. Overview of the core of the web server is to make a correct analysis of each mark (tag) in the HTML file, an interpretation of a programming language is also an explanation of reserved words in the source file. In practical applications, we often encounter the case of keyword analysis of a particular type of file. For example, you need to download a HTML file and download and download .GIF, .class and other files, at this time The markings in the HTML file are required to separate, find the required file names and directories. Before Java appears, similar work needs to analyze each character in the file, finding the required part, not only the program is large, not error. The author uses Java's input stream streamTokenizer to analyze the analysis of the HTML file in the near future project. Here, we have to implement the HTML file from known web pages. After analysis, download the HTML file contained in the page (if in Frame), the image file, and the Class (Java Applet) file. Second, the StreamTokenizer class StreamTokenizer is the function of the signage input stream to become a token stream in an input stream. The token entity in the token stream has three categories: words (ie, multi-character tokens), single-character tokens, and blank (including Java and C / C ). The constructor of the StreamTokenizer class is: StreamTokenizer (InputStream in) This class has some public instance variables: TTYPE, SVAL, and NVAL, indicate the token type, the current string value, and the current digital value. When we need to get the characters between the token (ie, the tags in HTML), the variable SVAL should be accessed. The way to read the next token is to call nextToken (). Method NEXTTOKEN () return value is an int type, a total of four possible returns: streamTokenizer.tt_number: The token that is read is a number, the value of the number is a Double type, and can be read from the instance variable NVAl. StreamTokenizer.tt_word: Indicates that the token read is a non-digital word (other characters are also there), and the words can be read from the instance variable sval. StreamTokenizer.tt_eol: The token that is read is a row end. If you have read the end of the stream, NEXTTOKEN () returns TT_EOF. Before you start calling nextToken (), set the syntax table to enter the stream to make the analyzer identify different characters. WhitespaceChars (int low, int hi) defines the range of unspecified characters. WordChars (int low, int hi) defines the character range of constructing words. Third, the program implementation 1, the implementation of the HTMLTokenizer class before analyzing a token stream, first set the syntax table of the token stream, in this case, that is, let the program share which word is an HTML mark. The token stream definition of the HTML tag we need is given, which is a Subclass of StreamTokenizer:

Import java.io. *;

Import java.lang.string;

Class Htmltokenizer Extends

StreetTokenizer {

/ / Define each mark, the mark here is only necessary,

Can be expanded from yourself as needed

Static int html_text = -1;

Static int html_unknown = -2;

Static int html_eof = -3;

Static int html_image = -4;

Static int html_frame = -5;

Static int html_background = -6;

Static int html_applet = -7;

Boolean outsidetag = true; // Judgment is in the mark

// Constructor defines the syntax table of the token stream.

Public HTMLTOKENIZER (BufferedReader R) {

Super (r);

this.ResetSyntax (); // Reset Syntax

THIS.Wordchars (0,255); // token range is all characters

this.ordinaryChar ('<'); // HTML mark sides of the split

THIS.ORDINARYCHAR ('>');

} // End of constructor

Public int nexttml () {

Int token; // token

Try {

Switch (token = this.nextToken ()) {

Case streamTokenizer.tt_eof:

// If you have read the end of the stream, return TT_EOF

Return HTML_EOF;

CASE '<': // Enter the tag field

Outsidetag = false;

Return nexthtml ();

Case '>': // Outline Field

Outsidetag = true;

Return nexthtml ();

Case streamTokenizer.tt_word:

// If the current token is word, it is determined which tag is

IF (Allwhite (SVAL))

Return nexthtml (); // filter where space

Else if (sval.touppercase (). Indexof ("frame")

! = - 1 &&! Outsidetag) // Mark Frame

Return HTML_FRAME;

Else if (sval.touppercase (). Indexof ("IMG")

! = - 1 &&! Outsidetag) // Tag IMG

Return HTML_IMAGE;

Else if (sval.touppercase (). Indexof ("Background")

! = - 1 &&! Outsidetag) // Tag Background

Return HTML_BACKGROUND;

Else if (sval.touppercase (). INDEXOF ("applet")

! = - 1 &&! Outsidetag) // Tag Applet

Return HTML_Applet;

DEFAULT:

System.out.println ("Unknown Tag:" TOKEN);

Return HTML_Unknown;

} // End of Case

} catch (ioexception e) {

System.out.println ("Error:" E.getMessage ());

Return HTML_Unknown;

} // end of nexthtml

Protected Boolean Allwhite (String s) {// Filter all spaces / / Implementation

} // end of allwhite

} // End of class

The above method is tested by the author in the recent project. The operating system is Windows NT4, and the programming tool uses Inprise JBuilder3.

This article is provided by China Computer World.

转载请注明原文地址:https://www.9cbs.com/read-97332.html

9cbs

New Post(0)