Author: Ping Guo, Julie Basu, Mark Scardina and K. Karun
Select the appropriate XML analysis technology for your Java application
As XML is becoming more widely used, efficient analysis of XML documents has become increasingly important. Efficient analysis of XML data is very important, especially for applications to handle large amounts of data, this technology is particularly important. Incorrect analysis can result in excessive memory consumption and excessive processing time, thereby depleting scalability.
The XML parser has a variety of types. Which one is best for you? This paper studies three popular XML analysis techniques for Java, telling you how to choose the right way according to your application requirements.
The XML parser uses an unprocessed sequence string as an input and performs some specific operations. First it checks if the XML data meets the syntax rules, ensuring that the start tag has its matching end tag, and there is no overlapping element. Most parsers are also confirmed according to document type definition, DTDs or XML Schema, and verify their structure and content is what you specified. Finally, the resolution output provides access to the XML document content by programming API.
There are three popular XML analysis techniques for Java:
Document object model (DOCUMENT OBJECT Model, DOM), a mature standard from W3C. Simple API (Simple API for XML, SAX) for XML, the first XML API written in Java, is a fact of fact. A very promising new analytical model used in XML data stream API (Streaming API for XML, Stax), and JSR-173.
Each of these technologies has its advantages and disadvantages.
The following XML document books.xml describes a book directory and is used as an example in this article:
book>
book>
catalog>
DOM analysis
The DOM is a tree-based parsing technology that builds a complete parsing tree in memory. It enables comprehensive and dynamically accessed the entire XML document.
Figure 1 shows the tree structure of the DOM parsing model. The document is the root of all DOM trees. This root has at least one child node, that is, the root element, which is a Catalog element in the sample code. Another node is DocumentType for DTD description, which is not defined in our example. The catalog element has a sub-node, and its child node has its own child node. The child node can be an element, text, annotation, processing instruction, and similar information.
Figure 1: DOM tree
The following example shows the usage of the DOM API. This sample code prints all the names of all books in a directory from the previous XML document.
DOMPARSER PARSER = New Domparser (); Parser.Parse ("Books.xml");
Document Document = PARSER.GETDocument ();
Nodelist nodes =
Document.getlementsBytagname ("Title");
While (int i = 0;
I Element TitleElem = (Element) Nodes.Item (i); Node childNode = TitleElem.GetfirstChild (); CHildNode InstanceOf text) { System.out.println ("Book Title IS:" ChildNode.GetnodeValue ()); } } This program obtains an XML file name to create a DOM tree, find all DOM element nodes of each Title element by using the getElementsByTagname () method. Finally, by repeating the list of Title elements and check using the getFirstChild () method, confirm that the first child node contains text between the beginning and end tags of the element, prints the text information related to each title element. It can be seen that it is very simple to use DOM. You can randomly access the XML document because the entire tree is built in memory. These nodes can be modified by the DOM API, for example, add a child node or modification to delete a node. Although the memory tree structure provides good navigation support, there are still some resolution strategies to consider. First, the entire XML document must be parsed once, it is impossible to do only partially parsing. Second, loading the entire document and builds a full tree structure in memory, especially when the document is very large. Typically, the capacity of the DOM tree is larger than the document capacity, so it consumes a lot of memory. Third, a general DOM node type has an advantage in interoperability, but may not be the best for object type bindings. Some types of applications are more suitable than other types of applications to use DOM parsing. It is well appropriate when the application needs to be randomly accessing the XML document. A preferred example is that you need to repeat the XSL processor for the entire file when you process the template. Because the DOM enables you to update your document, it is also convenient to modify the application, such as the XML editor. SAX analysis SAX is a "push" model for processing XML event drivers. It is not a W3C standard, but it is a widely recognized API, most SAX parsers follow standards when they are implemented. The SAX parser is not intended to create a tree type of the entire document as DOM, but activates a series of events when reading a document. These events are pushed to the event processor, and the event processor provides access to the contents of the document. There are three basic types of event processors: DTDHandler for accessing XML DTD; Errorhandler for low-level access parsing errors; the most common type of CONTENTHANDLER for accessing document content. Figure 2 shows how the SAX parser reports an event through a callback mechanism. The parser reads the input document and pushes each event to myContentHandler when processing a document. Figure 2: SAX reports documentation to the application in a series of events. The following example is the same thing as the previous DOM example: print out a book name information. First, write a CONTENTHANDLER implementation class, which is based on the DefaultHandler class and replaces the method used by the event category you are interested in. This code discards other events from the DefaultHandler class. Customized ContentHandler classes provide callback methods, must handle status management, operational elements events, end element events, and character events - for all elements, not just title elements. Public class mycontenthandler extends defaulthandler { Boolean IStitle; Public Void StartElement (String Uri, String Localname, String Qname, Attributes atts) { IF (LocalName.Equals ("Title")) iStitle = true; } Public void endelement (String Uri, String LocalName, String Qname) { IF (LocalName.Equals ("Title")) iStitle = false; } Public void characters (char [] chars, int start, int layth) { IF (iStitle) System.out.println (New String (Chars, Start, Length); } } Second, configure your custom ContentHandler for the SAX parser, then the parser starts processing the XML document. The parser generates some events and pushes these events to ContentHandler when reading documents from beginning to end. SAXPARSER SAXPARSER = New Saxparser (); MyContentHandler MyHandler = New MyContentHandler (); SAXPARSER.SETCONTENTENTHANDLER (MyHandler); SAXPARSER.PARSE (New File ("Books.xml"); SAX parsers provide better performance advantages than DOM. It provides effective low-level access to XML document content. The biggest advantage of SAX model is that memory consumption is small, because the entire document does not need to be loaded into memory once, this allows the SAX parser to resolve the document greater than the system memory. In addition, you don't need to create objects for all nodes like this in the DOM. Finally, the SAX "push" model can be used in the broadcast environment, where you can register multiple ContentHandler, parallel reception events, rather than only one by one in a pipe. The disadvantage of SAX is that you must implement an event handler that handles all arrival events. You must maintain this event status in your application code. Because the SAX parser cannot communicate meta information, such as the DOM's parent / sub-support, you must track which location of the parser is in the document level. In this way, your document is more complicated, the more complex your application logic. Although there is no need to load the entire document into memory once, the SAX parser still needs to resolve the entire document, which is the same as DOM. Perhaps SAX faces the biggest problem is that it has no built-in navigation support as XPath. Coupled with its single spread, this means it does not support random access. This limit is also manifested on the namespace: do not have an annotation for the elements that inherit the namespace. These limits make SAX rarely selected to operate or modify documents. Oracle and XML Oracle XML Developer Toolkit provides an XML parser for Java, C and C . Each is provided for the DOM and SAX interfaces implemented by the enterprise. A technical preview of a Stax parser for Java is also provided. These components can be downloaded from Oracle Technology Center OTN.OrCle.com/tech/xml. Applications that only need to read content only can be greatly benefited from SAX parsing. Many B2B and EAI applications use XML as package format, and receive all data simply by this format. This is where SAX is obviously better than DOM, the former is highly efficient and thus obtained. SAX 2.0 has a built-in filtering mechanism that can easily output a document subset or a simple document conversion. Finally, SAX resolution is very useful for confirmation of DTD and XML Schema. In fact, Oracle uses SAX parsers internally to complete this confirmation, compared to DOM to use fewer memory and achieve higher efficiency. Stax analysis Stax is an exciting new analytical technology, like SAX, using an event-driven model. However, Stax does not use SAX's push-mode, but use the "pull" model for event processing. Moreover, the Stax parser does not use the callback mechanism, but returns the event according to the requirements of the application. Stax also provides user-friendly APIs for reading and writing. Although SAX returns different types of events to ContentHandler, Stax returns its event to the application, and even events can be provided in the form of an object. Figure 3 shows that when an application requires an event, the Stax parser reads the application from the XML document as needed to return the event to the application. Figure 3: Application Requires Stax Report One Event Stax provides a tool for creating a STAX reader, so the application can use the STAX interface without referring to the details of a particular implementation. Unlike DOM and SAX, Stax specifies two parsing models: pointer model, such as SAX, it simply returns event; iterative program model, it returns event in object form, providing more conforming interface, but require additional object creation Overhead. The following Stax Analysis API illustrates an example of each model. The following example uses a pointer model to print the book name information from the above XML book catalog. XMLStreamReader Reader = XMLINPUTFACTORY .newinstance (). CreatexmlstreamReader ( New fileInputstream ("Books.xml"); WHILE (reader.hasnext ()) { int evenettype = reader.next (); if (eventtype == xmlevent .Start_Element && Reader.getlocalName () .equals ("title")) { Reader.next (); System.out.println (Reader.getText ()); } } In this example, after Reader is established, the application calls the request next event via the Reader.next () method. This makes the Stax parser move the pointer to the next event. If this event indicates a beginning of an element named "Title", the application code will call reader.next (), move the pointer forward (), then call the text of the Title element via the Reader.getText () method. . The following example uses an iterative program model, and the event is returned in the form of an object under this method. XMleventReader EventReader = XmlinputFactory.newinstance () .createxmleventreader ( New fileInputstream ("Books.xml")); while (eventReader.hasNext ()) { XMlevent Event = EventReader.next (); IF (Event InstanceOf StartElement &&) (StartElement) .GetlocalName () .equals ("title")) { System.out.println ((Characters) EventReader.next ()) .getdata ()); } } In this example, the application requests the next event to advance the Stax parser to the next event location and return the corresponding event object. The application can use the getData () method returned to the book name to access the content through the event object. Next step Access OTN's XML Center OTN. Oracle.com/tech/xml Download Oracle Xdkotn.Oracle.com/tech/xml Learn more XML Knowledge from Oracle University Oracle.com/eduCation/om keyword: XML The performance of the Stax pointer model is equivalent to SAX parsing. However, using the Stax application can control parsing so that the code is easier to write and maintain. Stax also provides an iterative program model that is easy to use, but in this case, Creating an event object is to pay performance. SAX requires the application to track its location in the document, and Stax is different, because it has the ability to return the requested events, making the application without this tracking. These examples do not show STAX filtering, this function is much more powerful than SAX. Compared with DOM, Stax has some shortcomings as SAX, that is, lack of comprehensive navigation support. The forward traversal of the document is easier than SAX in STAX because the application can control which event and get time. The ability of Stax modifies the document is similar to SAX, because all created a new document. Stax's pointer model and iterative program model provides a write API, but if you want not only to convert more, then the document modification is quite difficult. Stax parsing can meet most of the SAX application, so if an application is suitable for SAX resolution, it also benefits from STAX. Moreover, when the application needs to utilize the data stream model in order to improve performance, STAX resolution is the best choice when maintaining full support for the namespace. Finally, in order to handle multiple inputs, as in one imported mode collection, the application can easily request events to multiple STAX parsers and put them in a single environment without launching multiple threads. STAX is particularly useful in new areas such as Web services and JAX-RPC, and all of these features are required in these areas. Select the correct parsing model This article has explained three standard parsing techniques for Java, which describes their working principles and their own advantages, disadvantages, and applicable applications. The points of this article can summarize several simple rules: Use the DOM parsing when your application needs to constantly navigate, modify the document or access the entire document at a random manner. When you need a simple read-only data stream and want a strong implementation of mature standards, use SAX parsing. When you require full namespace or multi-document support or need an object interface, use the Stax to use the data stream application. Different parsing patterns have their own reasons, and with the development of distributed service-based applications, optimized performance will become an important factor in success. Ping guo (ping.guo@oracle.com) and Julie Basu (julie.basu@oracle.com) are members of the Oracle Java and XML Technologies. Mark Scardina (mark.scardina@oracle.com) and K. Karun (k.karun@oracle.com) are members of the Oracle Core and XML development group. XML analysis technology list Technical advantages disadvantage is best suited for ... DOM analysis Easy to use rich API collection, can be used to easily navigate through the entire tree to load to memory, allow random access to XML documents The entire XML document must be parsed once to load the entire tree to a higher memory cost, the DOM node is not ideal for object types that must create objects for all nodes. Need to modify an application or XSLT application (not available for read-only XML) SAX resolution There is no need to load the entire document to memory, so memory consumption is less pushed to register multiple ContentHandler No built-in document navigation support is not able to random access XML document does not support the modified XML in place to modify the namespace scope Use only applications that read data from XML (not available or modified XML documents) Stax resolution Provide two analytical models, which are easily supported by multiple input powerful filtering functions for simple and performance considerations. No built-in navigation support cannot be randomly access XML documentation cannot be modified in place. Requires data stream model and application support namespace (not available or modifying XML document)