Convert HTML content for PDF format

xiaoxiao2021-04-01  225

Author: rainy14f

Providing PDF file support for web pages In this article, Nick AfShartous describes a way to convert HTML's content to a PDF format. This method is quite useful, for example, a web program can provide functions such as Download as PDF on its page. This feature is convenient for printing and storage for future use. Afshartous's conversion method only uses open source components. There are also some commercial products available. Therefore, this method described in this article is both in the price, and the source code of the component used can be obtained. Putting the web content in PDF format facilitating the propagation of content. In some applications, documents that provide format easy to print are required, such as employee interests, etc. In fact, the law requires Summmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmary Plan Descriptions (SPDS) must be able to print, even if they are online provided. However, only the print web is not enough because the print format must contain the table content and page number. In order to provide such functions, developers can convert HTML content to PDF formats. That is to do this. This method introduced here is only using open source components. Some commercial products also support dynamic documentation, such as Adobe, which has a Document Server product line. However, the overhead of using commercial products is considerable. Use an open source solution to alleviate the problem of overhead and increase the transparency of the component source code. The conversion process contains three steps: 1. Convert HTML to XHTML; 2. Convert XHTML to XSL-FO (Extensible Stylesheet Language Formatting Objects Extended Style Table Language Format). Here you will use the XSL style sheet and the XSLT converter; 3. Transmit the XSL-FO document to the formatter to generate the target PDF document. This article first introduces how this conversion is used to use the command line interface, then describes how to use the DOM interface in Java to do the same job. Component version: The code in this article tests in the following versions: Component version JDK 1.5_06JTIDY R7-Devxalan-J 2.7FOP 0.20.5

Using the command line interface in each step in the conversion process contains the process of generating an output file from an input file. This process can be represented by the following picture: The command line interface using these three tools begins to work is a good way, although this method is not suitable for product-level systems, because it needs to write temporary intermediate files in the disk. . This additional I / O causes a decrease in performance. Later, this issue will be resolved when we call these three tools with Java. Step 1: Convert HTML for the first step in XHTML is to convert HTML into a new XHTML file. Of course, if the file is already XHTML, it doesn't need this step. I use JTIDY to complete this conversion. JTIDY is a Java version of the Tidy HTML parser. In the process of conversion, JTIDY automatically adds a missing label to create a well-final XML document. I use the latest version R7-dev on SourceForge. You can use the following scripts to run Jtidy: # / bin / shjava -classpath lib / tidy.jar org.w3c.tidy.tidy -asxml $ 1> $ 2 This script sets ClassPath and calls JTIDY. When running, the file to be entered is passed to JTIDY in the form of command line parameters. By default, the generated XHTML will be output to the standard output device. -Modify switch can be used to overwrite input files. -ASXML switch redirects the output of JTIDY to XML in format. When calling, like this: tidy.sh hello.html hello.xmlhello.html (input) and Hello.xml content is as follows:

Hello World! is an automatic [translation 1] for JTIDY. Step 2: Conversion XHTML is below XSL-FO [Translation 2], XHTML will be converted to XSL-FO, one for specifying a print format for an XML document. I complete this conversion by working with the XSLT converter (Apache Xalan) to complete this conversion. The style sheet I use is XHTML2FO.XSL provided by Antenna House. Antenna House is a company that sells business format programs on XSL-FO. The XHTML2FO.XSL style sheet specifies how to translate each HTML tag into a corresponding XSL-FO formatting command sequence. For example, H2 tags in HTML are defined in translation as: [code] During the process, each time the H2 tag is encountered, the above XSLT template is called. HTML: Prefix indicates that the H2 tag is a namespace of HTML (Namespace). The namespace of the style sheet is specified in the attribute of the top XSL: Stylesheet indicator. At the top level of XHTML2FO.XSL, we can see that it specifies three namespaces, respectively corresponds to XSL, XSL-FO, and HTML languages.

...

The second line in the template

The FO: Block tag is output, and the attribute of the H2 is generated as the properties and values ​​of the BLOCK tag. Each XSL-FO block (block) is a paragraph, and their format is based on the value of the block's attribute. The attribute of the H2 is defined in the style sheet as:

10mm 10mm 1em 0.5em x-large bold

This template is specified in the style sheet. Its role is to check some normal HTML properties (such as LANG, ID, Align, Valign, Style) and generate the corresponding XSL-FO indicator. To trigger translation of any label embedded in the top H2 tag, Process-Common-Attributes-and-Children will call:

Therefore, if the input is

Hello there

Then the in the template of the H2 triggers template for translating the tag. The output of the translation of the H2 label is: We call Xalan to apply XHTML2FO.XSL. Before calling Xalan, use UNIX script Xalan.sh to set the ClassPath variable it needs to be used. # / bin / shexport classpath = '.; ./ lib / xalan.jar; ./ lib / xerceptimpl.jar; ./ lib / xml-apis.jar; lib / serializer.jar'java -classpath $ classpath org.apache .xalan.xslt.process -in $ 1 -XSL XHTML2FO.XSL -OUT $ 2 -TT because Xalan needs an XML parser, so Apache Xerces and XML-API Jars are also required. All JAR files can be found in the Xalan's release package. To create a new XSL-FO file by using the XHTML application style table, you can call the script: Xalan.sh Hello.xml Hello.fo I like to display the application's template with Xalan trace switch (-TT). The Hello.fo file is as follows:

< FO: Simple-page-master page-width = "auto" page-height = "auto" master-name = "all-pages"> Hello World Hello World - Hello World! Step 3: XSL-FO to PDF The third step is that the last step is to transmit the XSL-FO document to the format program to generate a PDF. I use Apache FOP (Formatting Objects Processor). The FOP section implements the XSL-FO standard and provides the best support for the PDF output format. For PostScript is still in the primary phase, the support of Microsoft's RTF is still planned. The FOP release contains shell scripts fop.sh/fop.bat, which requires the XSL-FO file as an input parameter to generate the target PDF file. Under UNIX: Fop.sh Hello.fo Hello.pdf The only prerequisite is to set the environment variable to the FOP directory set to this script. The file hello.pdf is the output of the FOP, and you can find it in the source code herein. Since the FOP is currently not fully implemented XSL-FO standard, there is a certain limit. Specifically, which subsets are realized, you can find a detailed description on the Compliance section on the FOP website. The Java program will then show a Java program by using the DOM API of the three tools used in the above steps. It needs two command line parameters when running, automatically generates the corresponding PDF document and does not generate any temporary files. The first program created an InputStream object of an HTML file, then this object is passed to JTIDY. JTIDY has a method called paSedom (), which can be used to generate the Document object of the output XHTML document.

Public static void main (String [] args) {// Open file IF (args.length! = 2) {system.out.println ("usage: html2pdf htmlfile style); system.exit (1);} fileInputstream Input = null; String htmlFileName = args [0]; try {input = new FileInputStream (htmlFileName);} catch (java.io.FileNotFoundException e) {System.out.println ( "File not found:" htmlFileName);} Tidy Tidy = new tidy (); Document Xmldoc = Tidy.Parsedom (Input, NULL); JTIDY DOM implementation does not support XML namespace. Therefore, we must modify the style sheet for Antenna House, let it use the default namespace. For example, it is:

< / fo: block>

After being modified, it is:

This change must be applied to all templates in XHTML2F0.xs, because JTIDY generated Document objects as roots as labels, such as:

Modified XHTML2FO.xsl is included in the source code supplied with this article. Next, the XML2FO () method calls Xalan to apply the style table to the DOM object generated by JTIDY:

Document FODOC = XML2FO (XMLDoc, Args [1]);

Method XML2FO () first calls GetTransformer () to get a TRANSFORMER object of a specified style sheet. Then, the Document representing the result of the conversion result is returned:

private static Document xml2FO (Document xml, String styleSheet) {DOMSource xmlDomSource = new DOMSource (xml); DOMResult domResult = new DOMResult (); Transformer transformer = getTransformer (styleSheet); if (transformer == null) {System.out.println ( "Error creating transformer for" styleSheet); System.exit (1);} try {transformer.transform (xmlDomSource, domResult);} catch (javax.xml.transform.TransformerException e) {return null;} return (Document DomResult.getNode ();} Next, the main method opens a FileoutPutStream with the same prefix as the HTML input file. The result obtained by calling the FO2PDF () method is written to OutputStream:

String pdfFileName = htmlFileName.substring (0, htmlFileName.indexOf () ".") ".Pdf"; try {OutputStream pdf = new FileOutputStream (new File (pdfFileName)); pdf.write (fo2PDF (foDoc));} Catch (java.io.filenotfoundexception e) {system.out.println ("ERROR CREANG PDF: PDFFILENAME);} catch (java.io.Exception e) {system.out.println (" Error Writing PDF: " PDFFileName);

Method FO2PDF () will use XSL-FO Document generated in the conversion to generate a FOP Driver object. A PDF file can be generated by calling Driver.run. The result was returned as a Byte Array:

private static byte [] fo2PDF (Document foDocument) {DocumentInputSource fopInputSource = new DocumentInputSource (foDocument); try {ByteArrayOutputStream out = new ByteArrayOutputStream (); Logger log = new ConsoleLogger (ConsoleLogger.LEVEL_WARN); Driver driver = new Driver (fopInputSource, out Driver.setLogger (log); driver.sethrenderer (driver.Run (); return out.tobyteaRray ();} catch (exception ex) {return null;}}

转载请注明原文地址:https://www.9cbs.com/read-131109.html

New Post(0)