XML's Java analysis (1)
Analyze the XML document as Java object with SAX API
Summary is superior in the run in the run in the DOM API. The Explore the XML document to the Java object with SAX will be explored. SAX is used up as DOM, so we must first familiarize with SAX usage. (3,000 words (translation: English original 3,000 words))
Robert Hustead
Now XML is very hot. Because XML is a custom data (translation: self-describing data, this word is very common in English literature, XML has DTD and other methods to describe its nature and format itself or some part of the content), so It can store data for different encoding methods. XML often uses media that exchange data in heterogeneous systems. The data in XML format can be easily output from various system such as COBOL programs, C programs.
However, there are two puzzles with XML establish the system: First, generate XML data is a simple process, but in turn calls these data from a program. Second, today's XML technology is easily disposed of, which will cause slower speed and large memory consumption. In systems in XML as basic data exchange format, slow speed and memory consumption have proven to be two bottlenecks.
In the current various general XML processing tools, some are relatively better. SAX API has some code that has high performance requirements. In this article, we have to develop some SAX API encoding modes. With this mode, you can write speed, XML-Java mapping code that is small in memory consumption, even for some quite complex XML structures (but excluding recursive structures).
In the second part, we will solve the recursive XML structure. Some XML elements of this structure represent a list of a list. We also have to develop a class library that handles data navigation with SAX API. This library simplifies the SAX-based XML analysis.
The parsing code is like a compiler write an XML parser like a compiler. Look, almost all compilers are divided into three steps to change the source code as an executable program. First, the syntax module makes characters into compilers that can be identified - the so-called lexical analysis. The second module calls the parser and analyzes each group word to identify the legal syntax. The final third module handles a series of legitimate clause structures to generate executable code. Sometimes, source file resolution and executable code generation are interleaving.
To parse XML data with Java, we have to pass a similar process. First we analyze each character in the XML document to identify legitimate XML composition, such as start tags, properties, end tags, and cdata parts.
Then we confirm these compositions to form a legal XML structure. If it is completely composed of a legitimate structure that meets XML 1.0, it is a good structure XML document. For example, the most basic, we have to determine that all tags have the starting and end tags, and all belongings exist in the start tab in the correct form.
In addition, if there is a corresponding DTD, we can selectively comply with the DTD description by verifying the XML structure to determine how the XML document is well structure.
Finally, we use the data in the XML document to make some meaningful things - I manage this called "XML mapping to Java object" (Mapping XML INTO JAVA)
The XML parser is fortunately, there are some ready-made components --XML parsers - can be used to complete similar compilation. XML parser processes all syntax analysis and parsing. Nowadays, many Java-based XML parsers are in accordance with two parses: SAX and DOM APIs (translation: parser generally choose a standard, not two simultaneous implementations)
With these ready-made XML parsers, there is nothing else to use XML in Java, actually using these XML parsers is also a very difficult thing. SAX and DOM APISAX API are event-based. The XML parser of the SAX API follows the different characteristics of the XML document parsed. By capturing these events in the Java code, you can write a program that is driven by the XML data.
The DOM API is an object-based API. Implementing the DOM XML parser generates a general object model representing the content of the XML document in memory. Once the XML parser completes the resolution, there is a DOM object tree that contains the structure and content information of the XML document.
The concept of DOM comes from the HTML fluor industry, and the HTML traffickeload typically uses a normal document object to represent the loaded HTML file. This can access these HTML DOM objects like this scripting like JavaScript. HTML DOM is successful in this area.
At first glance, the DOM API is a big special feature, so it is better than the SAX API. However, DOM has serious shortcomings in the design of high performance requirements.
The XML parser currently supporting DOM uses an object store, which is to create many small objects that represent DOM nodes, which also contain text or nested other DOM nodes. This seems to be synonymous, but it has caused a decline in performance. One of the most influential performance of Java is the new operator. Each execution corresponding to the New operator is lost after all the references to the resulting object, the garbage collector is responsible for clearing this object from memory. Many small objects of the DOM API generally abandoned immediately after analyzing, almost all the memory of JVM.
Another shortcoming of the DOM is that it puts the entire XML document into memory. For large documents, this has become a problem. Once again, because the DOM is based on many small objects implementation, while the JVM is saved with additional several bytes while saving the information about all these objects, the memory use is better. The XML document itself is larger.
There is still a very trouble, a lot of Java programs do not actually use the general form of the dominant form of the DOM. Instead, the DOM structure is placed immediately to the object structure of the specific problem domain corresponding to their specific issues of memory - a complicated and redundant process.
Another unrecognizable problem with the DOM API is that the code it is written to scan the XML document twice. The first time the DOM structure reads into memory, and the data of interest is positioned. It is naturally, and the different data blocks are positioned backwards in the DOM structure. Instead, SAX programming mode supports a simultaneous positioning and collecting XML data.
These issues can be resolved internally by designing a better underlying data structure to address the DOM object. As a problem like multi-scan and general-specific object model conversion cannot be solved in the XML parser.
To ribs SAX relative to the DOM API, SAX API is a quite attractive solution. SAX does not have a general object model, so there is no scruple in the abuse of New operators on memory consumption and performance issues. At the same time, if you want to design an object model of your own specific problem domain, SAX has no redundant object model. And, SAX can handle the XML document over again, and the processing time it needs is greatly reduced.
SAX does have its shortcomings, but these shortcomings are related to programmers, not the performance of API itself. Let's take a look at it.
The first disadvantage is conceptual. Programmers are accustomed to obtaining data by positioning. (Translation: The author means that the programmer likes to take the data from the active acquisition, and if the data is destroyed, not the SAX data is thrown, and the programmer is processed.) In order to find a file on the server You are positioned by changing the catalog. Similarly, in order to get data in a database, you will write a SQL query statement. This model is the opposite for SAX. That is, you first build your own code to monitor each column valid XML data sheet. This code only is called only when the XML data of interest is. The SAX API seems to be very awkward, but how long does it take for it, this thinking method will become a habit. The second shortcomings are a bit dangerous. For SAX code, the innovation of the innocent grass rate will be faster, because the XML structure is filtered over again while collecting data. Most people only pay attention to data analysis and neglect data flow is the order. If you don't take into account the order in which the data stream will appear in your own code, the code that is positioned during the SAX parsing process will be dispersed and a lot of complex mutual coupling (or typing). This problem is a bit like the problems generated by excessive dependence on global variables in general procedures. But if you learn to build an elegant SAX code correctly, it is even more intuitive than the DOM API. (Translation: I have encountered a lot of trouble when I understand this place, I have been asking for Robert. After repeated reading, I understand a little. SAX parsing XML is not the order of data, but the order of data is only predictable. If you change, it is necessary to keep this in mind when dealing with data. To build a so-called elegant code, my way is to try to make too complicated operations while collecting data, don't try to return the events that have already appeared Get the "previous" data. The following is the answer of Mr. Robert: - The point I'm making is that the navigation exist WHether You Are Aware of the the the...........? affect how you code. to directly address is to acknowledge the presence and impact of the navigational aspects explicitly during design. The opposite would be to ignore the aspects and instead have the navigational aspects just show up in little pockets of code in unrelated areas of the Application.)
Basic SAX current SAX API has two versions. We use the second edition (see resource) to do an example. The class name and method name in the second edition have access to the first edition, but the structure of the code is the same.
SAX is an API, not a parser, so this code is common in the XML parser. To let the sample run, you will need a XML parser that supports SAX V2. I use Apache's Xerces parser. (See Resources) Refer to your parser's getting-start document to get information that calls a SAX parser.
SAX API's instructions are easy to read. It contains a lot of details. The main task using the SAX API is to create a callback interface that implements the ContentHandler interface, one for the XML parser to distribute the SAX events that occur when the XML document is distributed to the handler's callback interface.
Convenience, SAX API also provides a defaulthandler adapter class that has already implemented ContentHandler interfaces. One but realize ContentHandler or expands the DefaultHandler class, you only need to parse a specific document directly.
Our first example extension DefaultHandler prints each SAX event to the console. This will give you a preliminary image to explain what SAX event will happen and in the order.
As a start, the following is the XML document sample that will be used in our first example:
XML Version = "1.0"?> XmlReader XR = XmlReaderFactory.createxmlReader (); // Install ContentHandler. . . Xr.SetContentHandler (New Example1 ()); // Resolution file. . . Xr.Parse (New FileRead ("ExceReader (" EXAMPLE1.XML "))));} catch (Exception E) {E.PrintStackTrace ();}}} finally obtained the first example of running the first example to resolve our XML sample The output generated by the document: Example1 SAX Events: SAX Event: START DOCUMENTSAX Event: START ELEMENT [simple] ATTRIBUTE: date VALUE: 7/7 / 2000SAX Event: CHARACTERS [] SAX Event: START ELEMENT [name] SAX Event: CHARACTERS [Bob] SAX Event: END ELEMENT [NAME] SAX Event: Characters [] SAX Event: Start Element [Location] SAX Event: Characters [New York] SAX Event: End Event [Location] Sax Event: Characters [] Sax Event: Event: Event: End Element [Simple] SAX Event : End Document As you can see, the SAX parser calls the correct ContentHandler member method for each SAX event that appears in the XML document. (Endlessly)