http://it.sohu.com/webcourse/webmonkey/1-teach/java/page8.html
Using Java Writing HTML File Analysis Program Tianjin University Cuiula Abstract: This article focuses on the Practical perspective, the input streaming streamTokenizer in the Java language is written in the HTML file analysis program, and introduces an example of downloading the web page in bytes. Process. 1. Overview of the core of the web server is to make a correct analysis of each mark (tag) in the HTML file, an interpretation of a programming language is also an explanation of reserved words in the source file. In practical applications, we often encounter the case of keyword analysis of a particular type of file. For example, you need to download a HTML file and download and download .GIF, .class and other files, at this time The markings in the HTML file are required to separate, find the required file names and directories. Before Java appears, similar work needs to analyze each character in the file, finding the required part, not only the program is large, not error. The author uses Java's input stream streamTokenizer to analyze the analysis of the HTML file in the near future project. Here, we have to implement the HTML file from known web pages. After analysis, download the HTML file contained in the page (if in Frame), the image file, and the Class (Java Applet) file. Second, the StreamTokenizer class StreamTokenizer is the function of the signage input stream to become a token stream in an input stream. The token entity in the token stream has three categories: words (ie, multi-character tokens), single-character tokens, and blank (including Java and C / C ). The constructor of the StreamTokenizer class is: StreamTokenizer (InputStream in) This class has some public instance variables: TTYPE, SVAL, and NVAL, indicate the token type, the current string value, and the current digital value. When we need to get the characters between the token (ie, the tags in HTML), the variable SVAL should be accessed. The way to read the next token is to call nextToken (). Method NEXTTOKEN () return value is an int type, a total of four possible returns: streamTokenizer.tt_number: The token that is read is a number, the value of the number is a Double type, and can be read from the instance variable NVAl. StreamTokenizer.tt_word: Indicates that the token read is a non-digital word (other characters are also there), and the words can be read from the instance variable sval. StreamTokenizer.tt_eol: The token that is read is a row end. If you have read the end of the stream, NEXTTOKEN () returns TT_EOF. Before you start calling nextToken (), set the syntax table to enter the stream to make the analyzer identify different characters. WhitespaceChars (int low, int hi) defines the range of unspecified characters. WordChars (int low, int hi) defines the character range of constructing words. Third, the program implementation 1, the implementation of the HTMLTokenizer class before analyzing a token stream, first set the syntax table of the token stream, in this case, that is, let the program share which word is an HTML mark. The token stream definition of the HTML tag we need is given, which is a Subclass of StreamTokenizer:
Import java.io. *;
Import java.lang.string;
Class Htmltokenizer Extends
StreetTokenizer {
/ / Define each mark, the mark here is only necessary,
Can be expanded from yourself as needed
Static int html_text = -1;
Static int html_unknown = -2;
Static int html_eof = -3;
Static int html_image = -4;
Static int html_frame = -5;
Static int html_background = -6;
Static int html_applet = -7;
Boolean outsidetag = true; // Judgment is in the mark
// Constructor defines the syntax table of the token stream.
Public HTMLTOKENIZER (BufferedReader R) {
Super (r);
this.ResetSyntax (); // Reset Syntax
THIS.Wordchars (0,255); // token range is all characters
this.ordinaryChar ('<'); // HTML mark sides of the split
THIS.ORDINARYCHAR ('>');
} // End of constructor
Public int nexttml () {
Int token; // token
Try {
Switch (token = this.nextToken ()) {
Case streamTokenizer.tt_eof:
// If you have read the end of the stream, return TT_EOF
Return HTML_EOF;
CASE '<': // Enter the tag field
Outsidetag = false;
Return nexthtml ();
Case '>': // Outline Field
Outsidetag = true;
Return nexthtml ();
Case streamTokenizer.tt_word:
// If the current token is word, it is determined which tag is
IF (Allwhite (SVAL))
Return nexthtml (); // filter where space
Else if (sval.touppercase (). Indexof ("frame")
! = - 1 &&! Outsidetag) // Mark Frame
Return HTML_FRAME;
Else if (sval.touppercase (). Indexof ("IMG")
! = - 1 &&! Outsidetag) // Tag IMG
Return HTML_IMAGE;
Else if (sval.touppercase (). Indexof ("Background")
! = - 1 &&! Outsidetag) // Tag Background
Return HTML_BACKGROUND;
Else if (sval.touppercase (). INDEXOF ("applet")
! = - 1 &&! Outsidetag) // Tag Applet
Return HTML_Applet;
DEFAULT:
System.out.println ("Unknown Tag:" TOKEN);
Return HTML_Unknown;
} // End of Case
} catch (ioexception e) {
System.out.println ("Error:" E.getMessage ());
Return HTML_Unknown;
} // end of nexthtml
Protected Boolean Allwhite (String s) {// Filter all spaces / / Implementation
} // end of allwhite
} // End of class
The above method is tested by the author in the recent project. The operating system is Windows NT4, and the programming tool uses Inprise JBuilder3.
This article is provided by China Computer World.