Characteristics and performance of XML document model in Java (extracted from IBM DeveloperWorks China website)

xiaoxiao2021-03-06  93

China

search for:

In DW full content Java technology Linux open source security Web service XML ................. DB2 WebSphere Lotus special Tivoli special .............. ... all the contents of IBM only - "" ()

IBM Home | Products & Services | Support & Download | Personalized Services

IBM: DeveloperWorks China website: XML: All articles

XML in Java: Document Model, Part 1: Performance

English original

content:

Document Model DOMJDOM4JELECTRIC XMLXML PULL PERSER Test Details Performance Compare Document Time Document Traditional Time Document Modification Time Text Generating Time Document Memory Size Java Serialization Endy Learning ... Reference About the author to this article

related information:

Tutorial: Understanding SAX Tutorial: Understanding DOM Tutorial: XML Programming in Java

Also in the Web Service area:

Tutorial tools and product articles

Study Characteristics and Performance of XML Document Models in Java Dennis M. Sosnoski (DMS@sosnoski.com) President, Sosnoski Software Solutions, Inc. 2001 September

In this article, Java consultants Dennis Sosnoski compares the performance and functionality of several Java document models. When the model is selected, it is unable to do everything very well. If the idea is changed later, you need a large number of encoding to switch. The author puts the performance outcomes into the context of characteristic collection and follows the standard, and some suggestions are given to the requested correct choice. This article contains several charts and source code for this group test.

Java developers using the XML document using memory can choose to use the standard DOM represent or several Java-specific models. This flexibility has helped Java build an excellent platform for XML work. However, due to the increase in the number of different models, it has been more difficult to determine how the functions, performance, and ease of use of the model. About the first article in Java in the XML series in Java studies the characteristics and performance of some leading XML document models in Java. It includes a set of performance tests (with downloadable test code, see Resources). The second article in the series will study the ease of use of sample code to achieve different models used by the same task. The number of available document model in the document model Java has been increasing. For this article, I have covered the most common models and several options, which demonstrates particularly interesting features that may not be widely understood or used. As the importance of "XML Namespace" increases, I already contain models that only support this feature. The models with a brief introduction and version information are listed below. Terms are merely illustrated in this article:

The parser refers to the program document representation of the Interpretation of the XML Text Documentation Representation means that the data structure document model of the program for the document is used to support the library and API of the document represented by the document, and some XML applications do not require document models at all. If the application can collect the information it needs through a traversal of the document, the parser may be used directly. This method may need to increase some workload, but its performance is always better than building documentation in memory. DomDom ("Document Object Model") is an official W3C standard that represents an XML document in a manner that is independent of platform and language. For any Java-specific model, it is a good control. In order to be worth separated from the DOM standard, the Java-specific model should provide advantages over the performance and / or ease of use of Java DOM. The DOM definition takes full advantage of the interface and inheritance of the XML document different components. This brings the developer to use the public interface to several different types of components, but increase the complexity of the API. Because the DOM is independent of the language, the interface does not need to utilize public Java components, for example, the Collectes class. This article relates to two DOM implementations: Crimson and Xerces Java. Crimson is an Apache project based on the Sun Project X parser. It merges a full verification parser containing DTD support. The parser can be accessed via the SAX2 interface, and the DOM implementation can work with other SAX2 parsers. Crimson is an open source released under the Apache license. Versions for performance comparisons are Crimson 1.1.1 (JAR file size is 0.2MB), which contains a SAX2 parser for DOM from a text file. Another Test DOM implementation, ie Xerces Java is another Apache project. Initially, Xerces is based on an IBM Java parser (commonly referred to as XML4J). (The updated Xerces Java 2 currently in the early Beta beta version will eventually inherit it. The current version is sometimes referred to as Xerces Java.) As with Crimson, you can access the Xerces parser through the SAX2 interface and DOM. However, Xerces does not provide any ways to use Xerces Dom with different SAX2 parsers. Xerces Java contains validation support for DTD and XML Schema (only with minimal restrictions supported by Schema). Xerces Java also supports the DOM delay node extension mode (please refer to the delay Xerces or Xerces DEF.), The Chinese file component is initially expressed in a compressed format, which extends it to a complete DOM representation only when in use. This way is intended to allow faster resolution and reducing memory, especially for applications that may only use some input documents. Similar to Crimson, Xerces is an open source released under the Apache license. The version used for performance comparison is Xerces 1.4.2 (JAR file size is 1.8MB). The purpose of Jdomjdom is to become a Java-specific document model that simplifies the interaction with XML and is faster than using the DOM. Since it is the first Java specific model, JDOM has been vigorously promoted and promoted. It is considering that it is ultimately used as "Java Standard Extension" through "Java Specification Request JSR-102". Although the format that is actually used is still developing, it is still a big change to the two Beta beta JDOM API. JDM development has begun from early 2000. There are two main aspects of JDOM and DOM. First, JDOM only uses a specific class without using an interface.

This simplifies API in some respects, but also limits flexibility. Second, the API uses a Collections class, simplifies the use of Java developers that are familiar with these classes. The JDOM document declares that its purpose is to "use 20% (or fewer) energy to solve 80% (or more) Java / XML issues" (assuming 20% ​​depending on the learning curve). Jdom is of course useful for most Java / XML applications, and most developers have found that API is much easier to understand than DOM. JDOM also includes a considerable extensive check of program behavior to prevent users from doing anything in XML. However, it still needs you to fully understand XML to do some work beyond basic work (or even understand in some cases). This may be more meaningful than learning the DOM or JDOM interface. JDOM does not contain a parser. It usually uses the SAX2 parser to parse and verify the input XML document (although it can also represent the previously constructed DOM as input). It contains some converters to indicate the JDOM to the SAX2 event stream, a DOM model, or an XML text document. JDOM is an open source released under the Apache license variant. Versions for performance comparisons are jdom beta 0.7 (JAR file size is 0.1MB) it comes with a Crimson SAX2 parser for building a JDOM from a text file. Although DOM4J represents a completely independent development result, it is initially, it is a smart branch of JDOM. It merges many functions that exceed the basic XML document, including integrated XPath support, XML Schema support (currently ALPHA format) and event-based processing for large documentation or fluidized documents. It also provides an option to build a document, which has parallel access functions via the DOM4J API and the standard DOM interface. Starting from the second half of 2000, it has been in development, retaining the existing API between the recent release. To support all of these features, DOM4J uses interfaces and abstract basic classes. DOM4J uses a large number of COLLECTIONS classes in the API, but in many cases, it also provides some alternative methods to allow better performance or more direct coding methods. Direct advantage is that although DOM4J has paid a more complex API price, it provides much flexibility than JDOM. When adding flexibility, XPath integration, and targets for large documents, DOM4J's goals are the same as JDOM: for easy-to-use and intuitive operations of Java developers. It is also committed to becoming a more complete solution than JDOM, achieving the goals of all Java / XML issues in nature. When this goal is completed, it is more emphasized than JDOM to prevent incorrect application behavior. DOM4J uses the same method as a JDOM output, relying on SAX2 parser input processing, relying on the converter to process the output into SAX2 event stream, DOM model or XML text document. DOM4J is an open source released under the BSD style license, which is essentially equivalent to the Apache style license. Versions for performance comparisons are DOM4J 0.9 (JAR file size is 0.4MB), with a bound Aelfred Sax2 parser for constructing a text file bundled (due to SAX2 option settings, one of the test files cannot be used by DOM4J The same Crimson SAX2 parser for JDOM test is handled). Electric XMLECTRIC XML (Exml) is an affiliate product that supports distributed computing commercial projects.

It differs from the other models discussed so far that it can only properly support the subset of XML documents, which does not provide any support for verification and have more stringent licenses. However, the advantage of Exml is small and small and provides direct support for XPath subsets, because in recent articles have enhanced alternative models of other models, so through this comparison, it becomes a candidate . Although Exml has achieved certain effects by using abstract basic classes, it uses methods similar to JDOM in avoiding interfaces (main differences to interfaces to extension implementation). It is also different from JDOM to avoid using Collectes classes. This combination provides a very simple API, similar to a DOM API that is similar to an additional XPath operation. EXML remains blank in the document only when blank is neighboring the contents of non-blank text, which limits the exml to a subset of the XML document. Standard XML needs to retain this blank when reading a document unless there is no blank insignificant to confirm that there is a blank. For many XML applications that have been known in advance, the exml method works very well, but it prevents the expected a blank document (for example, an application that generates a document displayed or viewed by the browser) using EXML. (For the authors of the authors, please refer to the purpose of using blank.) This blank delete will generate misleading effects in performance comparisons - many types of test ranges and components in the documentation, and by Each blank sequence deleted in Exml is a component in other models. Exml is included in the results displayed in this article, but remember this effect when interpreting performance is different. EXML uses integrated parsers to build documentation based on text documentation. In addition to the textual approach, it does not provide any way from DOM (or SAX2) or converted into SAX2 (or DOM) event streams. Exml is an open source released by MIND Electric in a restricted license that embeds it in some types of applications or libraries. The version used for performance comparison is Electric XML 2.2 (JAR file size is 0.05MB). XML PULL PARSERXML PULL PARSER (XPP) is recently developed, which demonstrates different methods of XML parsing. Like Exml, XPP can only properly support the subset of XML documents and do not provide any support for verification. It also has a small size of size. This advantage is combined with the pulling parser method to make it a good replacement item in this comparison. XPP uses an interface almost exclusively, but it only uses a small part of all classes. Like Exml, XPP avoids the use of Collectes classes in the API. In general, it is the easiest document model API in this article. Limiting XPP to XML document subsets is a limitations that it does not support entities, comments, or processing instructions in documents. XPP creates only elements, attributes (including "namespaces") and document structures for content text. This is a very stringent limit for some types of applications. However, it is usually the impact of performance than the performance of Exml blank. In this article I only use a test file that is incompatible with XPP, and the XPP result is displayed in the chart with a comment, which does not contain the file. The pull-back parser support in XPP (referred to herein as XPP pullback) is performed by dealing with the resolution actually postponed a component to access the document, and then parsing the document according to the need to construct the component. This technique wants to implement a very fast document display or classification application, especially when forwarding or removing (not fully parsing and processing documents).

The use of this method is optional, and if XPP is used in a non-pull-back type, it analyzes the entire document and simultaneously constructs a complete representation. Like Exml, XPP uses an integrated syntax parser represented by text documentation, and except for text, it does not provide any way from DOM (or SAX2) or converted into SAX2 (or DOM) event streams. XPP is open source with an Apache style license. Versions for performance comparison are PULLPARSER 2.0.1 Beta 8 (JAR file size is 0.04MB). The timing result displayed by the test details is from the use of Sun Microstems Java Version 1.3.1, Java HotSpot Client VM 1.3.1-B24 test, which is running under Redhat Linux 7.1 on the Athlon 1GHz system with 256MB RAM. Setting these tests of the initial JVM and maximum memory size to 128MB, I want to expand it as a server type execution environment. In a test using the initial default JVM memory set to 2MB and maximum memory for 64MB, the result of the model with larger JAR file size (DOM, JDOM and DOM4J) is very poor, especially in the average time of running tests. This may be due to the invalid operation of the HOTSPOT JVM executed by memory. The two (XPP and EXML) support in the document model support directly into a "string" or a character array. This type directly inputs that you cannot represent the actual application, so I avoid using it in these tests. For input and output, I use Java streaming bytes to eliminate I / O on performance, but retains applications for XML document inputs and outputs using language interfaces used in the case. Performance Compare the performance comparisons used herein based on parsing and use of a set of XML documents, these documents tries to represent a large-scale application:

Much_ado.xml, marked into Shakespeare play with XML. There is no attribute and is a quite simple structure (202k bytes). Periodic.xml, periodic table of elements in XML. Some properties are also quite simple (117k bytes). SOAP1.XML, taken from a specified sample SOAP document. A large number of namespace and attributes (0.4K bytes, each test needs to be repeated 49 times). SOAP2.XML, the list of values ​​in the SOAP document format. A large number of namespace and attributes (134K bytes). NT.XML, "New Test" marked as XML. There is no attribute and very simple structure, a lot of text content (1047K bytes). XML.XML, XML specification, without DTD reference, all entities are defined inside. Text style tag with a large amount of mixed content, some attributes (160k bytes). For more information on the test platform, see the sub-bar test details and view the reference information to get the link for the source code for testing. In addition to very small SOAP1.XML documents, all evaluation times are the time experienced by each specific test of the document. In the case of soap1.xml, the evaluation time is 49 consecutive document tests (the total number of a total number of 20K byte text). Test Framework runs a specific test on a document multiple times (10 times here), tracking the shortest time and average time of the test, then continue the next test on the same document. After completing all the test sequences for a document, it repeats the process for the next document. To prevent interactions between document models, only one model is tested when performing each test framework. Hotspot and timing benchmarks similar to dynamically optimized JVM are distinguished; small changes in the test sequence often lead to a large change in timing results. I have found that this is true for the average time to perform a particular code segment; the shortest time is consistent, it is the value I have listed in these results. You can refer to the average and shortest time of the first test (document build time). Document Building Time Document Building Time Test Checks the time required for text documentation and constructing document representation. For the purpose of comparison purposes, the SAX2 parsing time to be parsed using Crimson and Xerces Sax2 is already included in the chart because most document models (all documents except exml and xpp) use SAX2 parsing event streams into input of the document build process. Figure 1 depicts the test results. Figure 1. Documentation Build Time For most test documents, the XPP pullback build time is too short to compute (because in this case, it does not actually parse the document), only the display is very short. SOAP1.XML. For this file, pull back parser memory size and related creative overhead make XPP look relatively slow. This is because the test program creates a new pull-back parser copy for each copy of the document being parsed. In SOAP1.XML, 49 copies were used each time the time. The overhead of assignment and initializing these parser instances is greater than the time required to repeat the text and build the document representation. The XPP author pointed out in an email discussion, in the actual application, can be used to pull back the parser instance to reuse. If this is done, the overhead of the SOAP1.xml file will be significantly reduced to ignore the extent. For larger files, there is no need to use it, and pull back the parser to create overhead can also be ignored. In this test, XPP (with a complete parsing), Xerces and DOM4J with delay nodes are displayed as a whole.

The delayed Xerces is particularly excellent for larger documents, but for smaller documents, it is much higher than conventional Xerces DOM. When a part of the first use of the document, the overhead of the delayed node creation method is also high, which will reduce the advantage of rapid resolution. For smaller SOAP1.XML files, all formats (SAX2 resolution, regular DOM, and delayed DOM) Xerces are highly overhead. For this file XPP (fully parsed), it is particularually good, for SOAP1.XML, EXML, even exceeds SAX2-based models. Although Exml has the advantage of discarding individual blank content, it is generally, it is the worst in this test. Document Travel Time Document Travel Time Test Checks the documentation of traversal constructs indicates the time required, and sequentially traverses each element, attribute, and text content segment in document order. It tries to indicate the performance of the document model interface, which may be important for applications that repeat access information from the parsed document. Overall, overall calendar is much more than parsing time. For applications that only parsed the parsed documents, the parsing time will be more important than overhead time. Figure 2 shows the result. Figure 2. Document traversal time In this test, XPP performance has greatly exceeded the rest of the test object. Xerces DOM spends approximately twice as much as XPP. Although Exml has the advantage of the individual blank content in the discarded document, Exml spends almost three times the XPP. Dom4j is in the middle of this figure. When using XPP to pull back until the document is accessed, the resolution of the document text is really happened. This results in a very large overhead of the first traversal document (not shown in the table). If you have access to the entire document later, the XPP display performance is net loss when using the pull-back parsing method. For the pull-back parser, the total time required for the first two tests is longer than the usage of XPP (20% to 100%, depending on the document). However, when the document being parsed is not fully accessed, the parser method still has a considerable performance advantage. Xerces created with the delay node showed similar behavior, and the performance decreased during the first access document (not shown). However, in Xerces case, node creation overhead is the same as the performance difference created by regular DOM during resolution. For larger documents, the total time required to use the first two tests of XERCES is substantially the same as the time used to use Xerces with regular DOM. If you use Xerces on a very large document (probably 10KB or more), the delayed node creation seems to be a good choice. Document Modification Time This test checks the time required to modify the constructive document representation, and the result is shown in FIG. It traverses it, deletes all individual blank content and encapsulates each non-blank content string with newly added elements. It also adds an attribute to each element containing the original document of non-blank content. This test attempts to indicate the performance of the document model after a certain range of documents. For example, in the past, the modification time is much shorter than the parsing time. Therefore, the parsing time will be more important for applications that only traverse each parsed document. Figure 3. Document modification time This test exml is in a leading position, but because it is always discarding individual blank content during parsing, it has a performance advantage than other models. This means that there is no content to be deleted from the Exml representation during the test. In terms of modifying performance, XPP is second only to Exml and is different from Exml, and the XPP test contains deletion. Xerces Dom and DOM4J are close to the intermediate position, and the performance of the Jdom and Crimson DOM model is still the worst.

Text Generating Time This test checks the time required to output a document into a text XML document; the result is shown in Figure 4. For any application that does not specifically use the XML document, this step seems to be an important part of the overall performance, especially because the time required to parse the document input is the time required to parse the document input. In order to make these times have direct comparability, the test uses the original document without using the modified document generated by the previous test. Figure 4. Text Generation Time Text Generation Time Test Indicates that the difference between the models is less than the difference in the previous test, and Xerces DOM performance is best, but the lead is not much, the JDM performance is the worst. The performance of Exml is better than JDOM, but this is also due to Exml discard blank content. Many models provide options for controlling text output formats, and some options seem to affect text generation time. This test uses only the most basic output format of each model, so the results only display the default performance without displaying the best possible performance. Document Memory Size This test checks for memory spaces represented by the document. This is especially important for developers using large documents or using multiple smaller documents. Figure 5 shows the results of this test. Figure 5. Document memory size memory size is different from the timing test, because the value of the small SOAP1.xml file displayed indicates a single copy of the file without indicating 49 copies used in the timing evaluation. In most models, memory for brief documents is too small to not be displayed on the scale of the figure. In addition to the XPP pullback (until the document is really built), the difference between the model in memory size is relatively small compared to the difference shown in some timing tests. The delayed XERCES has the most compact representation (expanding it into a basic XERCES size when the first access is indicated), followed by DOM4J. Although EXML discards blank content contained in other models, it still has the least compact representation. Because even the most compact model, it is necessary to occupy a space of approximately the original document text (in bytes), all models seem to require too much memory for large documents. By providing the method of using some documentation, XPP pull back and DOM4J provides the best support for very large documents. XPP pulls back to complete the task by building only actually accessed representations, and DOM4J includes event processing-based support, making only part of the document at a time. Java serialization These tests evaluate the time and output size of Java serialization represented by documents. This mainly involves functions that use Java RMI ("Remote Method Call") transmitted between Java programs (including EJB (Enterprise JavaBean) applications. In these tests, only those models that support Java serialization are included. The following three diagrams show the results of the test. Figure 6. Serialized output time Figure 7. Serialization input time Figure 8. Serialized document size DOM4J shows the best serialization of output (generated serialization format) and input (from serialized format Recommend documentation) Performance, and Xerces Dom displays the worst performance. The time spenting Exml is close to DOM4J, but Exml still has the advantage of using a fewer number of objects in the representation, as it is discarding blank content. If you output a document into a text and then make a re-build document, not using Java serialization, all performance-time and size will be much better. The problem here is the structure as an XML document representing a large number of small objects. Java serialization cannot effectively handle this type of structure, which leads to time and output size overhead. It can be designed that the text represents a small document sequence format that is smaller than text input and output, but can only be done by bypassing Java serialization.

(I have a project to implement this customized serialization of the XML document. You can find its open source on our Web site. Please refer to the reference information.) Each of the different Java XML document models have each other, but from performance In view, some models have a significant advantage. In most aspects, XPP performance is in a leader. Although XPP is a new model, it is a very good choice for middleware type applications that do not require verification, entity, processing instruction information, or comment. It is especially suitable for applications that are run as a browser applet or in a memory-restricted environment. Although DOM4J is not equal to XPP, it does provide more standardized superior performance and more fully realization, including built-in support for SAX2, DOM, and even XPath. Although Xerces DOM (created with delayed nodes) is not good for small files and Java serialization, it is still good at most evaluation. For conventional XML processing, DOM4J and Xerces Dom are good choices, depending on their choice depends on whether Java-specific features are more important or cross-language compatibility. Jdom and Crimson DOM have been poor during performance testing. In the case of small documents, it is also worth considering using Crimson DOM, while Xerces is very poor. Although JDOM developers have explained that they expect to focus on performance issues before formal release, it is indeed worth recommending. However, if the API is not re-built, JDOM may be difficult to achieve performance with other models. Using blank purposes typically need to keep blanks, but many XML applications use only readability to retain blank formats. For these applications, Exml discarded blank methods played. Most documents used in these properties belong to "blank" for readability "categories. These documents are formatted into the form of people to view, and one line is up to one element. As a result, the number of blank content strings actually exceeds the number of elements in the document. This greatly increases unnecessary overhead of each step. Support for trimming inputs This type of blank will help improve the performance of all document models with negligible blank applications (except for Exml). As long as the trim is an option, it will not affect applications that need to be completely retained blank. The parser level support will be better because the parser must process the input character one by one. In short, this type of option will be very helpful to many XML applications. EXML is very small (in JAR file size) and in some performance tests well. Although Exml has the advantage of deleting individual blank content, it is less than XPP in terms of performance. Unless you need Exml support and XPP lack of features, XPP may be a better choice in a memory-constrained environment. Although the DOM4J performance is best, there is currently no model to provide good performance for Java serialization. If you need to pass a document between programs, the usual best choice is to write a document into text then parsing to rebuild the representation. In the future, custom serialization formats may provide a better choice. Subsequent content ... I have covered the basic features of some document models, and show performance evaluation of several types of document operations. Remember, although, performance is just a factor in the document model. For most developers, availability is as important as performance, and these models can have different APIs that like this without like that. The availability will be studied in subsequent article, where I will compare sample code used to complete the same operations in these different models. Please check the second part of this comparison. When you wait, you can share your comments and questions from this article through the links in the forum below. Reference

Participate in the forums on this article. If you need the background knowledge, try developerWorks tutorial XML Java programming, understand SAX and understand DOM. Download the test program and document model library for this article from the download page. View the updated test and test results on the home page of the test program. Get the author's details regarding the XML Serial (XMLS) Encoding work as a Java serialization alternative. Research or Download the Java XML Document Model discussed herein: Xerces Java Crimson JDM DOM4J Electric XML (Exml) XML Pull Parser (XPP) IBM WebSphere Application Server contains Xerces Java XML4J parsers. How-to information about product XML support can be found in WAS Advanced Edition 3.0 Online Documentation.

About the author Dennis Sosnoski (DMS@sosnoski.com) is the founder and Chief Consultant of Sosnoski Software Solutions, Inc., Seattle Region. He has more than 30 years of professional software development experience. In recent years, he concentrates on Servers Java technology, including servlet, Enterprise JavaBeans, and XML. He has repeatedly demonstrated Java performance issues and Java technology in conventional server, and he is the chairman of Seattle Java-XML Sig.

Page

(c) Copyright IBM Corp. 2001, (C) Copyright IBM China 2001, All Right Reserved About IBM | Privacy Treaty | Terms of Use | Contact IBM