Summary
Since Adobe has issued public PDF references in 1993, the PDF tools and libraries that support various languages and platforms have emerged as the rain. However, the support of Adobe technology in Java applications is relatively lag. Since Adobe has issued public PDF references in 1993, the PDF tools and libraries that support various languages and platforms have emerged as the rain. However, the support of Adobe technology in Java applications is relatively lag. This is a strange phenomenon because the PDF document is the general trend of enterprise information system storage and exchange information, while Java technology is particularly suitable for this application. However, Java developers seem to have recently received a mature available PDF support. PDFBOX (Source Open Project under a BSD) is a pure Java class library that reads and creates a PDF document for developers. It provides the following features:
Extract text, including Unicode characters. The integration process of text search engines such as Jakarta Lucene is very simple. Encrypt / decrypt PDF document. Import or export form data from the PDF and XFDF format. Add content to an existing PDF document. Separate a PDF document into multiple documents. Cover the PDF document.
PDFBOX API
The PDFBOX is designed to describe the PDF document in an object-oriented manner. PDF documents of data is a collection of basic objects: arrays, Boolean, dictionaries, numbers, strings, and binary streams. Pdfbox defines these basic object types in org.pdfbox.cos package (COS model). You can use these objects to communicate with the PDF document, but you should first make some in-depth understanding of the internal structure of the PDF document and the high-level concept. For example, both pages and fonts are dictionary objects with special attributes; PDF reference manuals provide instructions for these special properties, but this is a boring document review process. Thus, org.pdfbox.pdfmodel package (PD model) came into being, it is based on the COS model, but provides a high-level API (Figure 1) that accesses the PDF document object in a familiar manner. The packaged PDPAGE and PDFONT, etc. in the underlying CoS model are in this package. Note that although the PD model provides some excellent features, it is still a model in development. In some instances, you may need to access the specific functionality of the PDF with the COS model. All PD model objects provide a method of returning the corresponding COS model object. So, in general, you will use the PD model, but the PD model whirr is timely you can directly operate the bottom COS model. The above is a substantial introduction to PDFBOX, and now it is some examples. We start from how to read existing PDF documents:
PdDocument Document = pddocument.load ("./test.pdf");
The above statement parses the specified PDF file and creates its document object in memory. Taking into account the efficiency problem when dealing with large documents, the PDFBOX is only stored in the memory structure, the image, embedded font, and page content in memory will be cached in a temporary file. Note: The pdDocument object is used to call its Close () method to release the resources used during creation.
Text extraction and lucene integration
This is an information showing the ANFORMATION RETRIEVAL AGE. Regardless of whether the information is stored in which medium is stored, the application should support retrieval and indexes. It is critical to organize and classify information to form a retrievalable format. This is very simple for text documents and HTML documents, but the PDF document contains a lot of structural and meta information, and the content of the document will be quite simple. PDF language and postscript are similar, and the objects in both are some locations that are drawn as vectors. For example: / Helv 12 TF 0 13.0847 TD (Hello World) TJ
The above instruction set the font to the Helvetica of No. 12, moved to the next row and then printed "Hello World". These command streams are often compressed, and the order of the text on the screen is not necessarily the order in which characters in the file. Therefore, you can sometimes extract strings directly from the original PDF document. However, the PDFBOX mature text extraction algorithm enables developers to extract documentation, just like they are present in the reader. Lucene is a subproject of the Apache Jakarta project, which is a popular source code open search engine library. Developers can use Lucene to create an index and perform complex retrieval based on the index. Lucene only supports the search for text content, so developers need to convert other forms of data to text forms to use Lucene. For example, Microsoft Word and STAROFFICE documents must be converted to a text form to be added to the Lucene index. The PDF file is no exception, but PDFBOX provides a special integrated object, which makes it easy to include PDF documents in the Lucene index. Convert a basic PDF document to the Lucene document only one statement:
Document doc = LucenePdfdocument.getDocument (file);
This statement parses the specified PDF document and extracts its contents and creates a Lucene document object. Then you can add this object to the Lucene index. As mentioned above, the PDF document also includes metadata and keyword equivalents in the PDF document, which is important when tracking these metadata when indexing the PDF document. Table 1 lists the fields that PDFBOX will fill in (PDFBOX) when creating the Lucene document. This integration enables developers to easily use Lucene to support the retrieval and index of the PDF document. Of course, some applications require more mature text extraction methods. At this point, you can directly use the PDFTEXTSTRIPPER class, or inherit this class to meet this complex demand. By inheriting PDFTextStripper and overrides showcharacter () methods, you can control text extraction from many aspects. For example, use x, y position information to limit to extract a particular text block. You can effectively ignore all Y coordinates greater than a certain value, so that the document header will be excluded. another example. This is often in this case: a set of PDF documents from the form, but these raw data are lost. That is, these documents contain some text you are interested in, and these texts are in a similar location, but the form data of the populated document is lost. For example, you have some envelopes, all have names and address information in the same location. At this time, you can use the PDFTextStripper's derived class to extract the desired fields, which is like an apparatus that intercepts the screen area.
encrypt and decode
A pop characteristic of the PDF is to allow encryption of document content, controlling access, and restrictions can only be read. A master password and an optional user password are used when the PDF document is encrypted. If the user password is set, the PDF reader (such as Acrobat) will prompt the password before the document is displayed. The master password is used to authorize the content of the document. The PDF specification allows creators of the PDF document to limit some of the operations when users use the Acrobat reader to view documents. These restrictions include: printing modified content extraction content
PDF Document Security Discussion is not within this article, interested readers can refer to the relevant parts of the PDF specification. The security model of the PDF document is pluggable, you can use different security processors (Security Handler) when encrypted documents. For this article, PDFBOX supports standard security processors, which is used by most PDF documents. When encrypted documentation, you must first specify a security processor and then use a master password and user password to encrypt. In the following code, the document is encrypted, and the user can open it in Acrobat without tapping, but the document cannot be printed.
// load the document PDDocument pdf = PDDocument.load ( "test.pdf"); // create the encryption options PDStandardEncryption encryptionOptions = new PDStandardEncryption (); encryptionOptions.setCanPrint (false); pdf.setEncryptionDictionary (encryptionOptions); // encrypt The Document Pdf.Encrypt ("Master", NULL); // Save the Encrypted Document // To The File System Pdf.save ("Test-Output.pdf");
More detailed examples see the encryption tool class source code included in the PDFBOX release: Org.pdfbox.encrypt. Many applications can generate a PDF document, but do not support security options for control documents. At this time, PDFBOX can be used to intercept and encrypt the PDF document before sending to the user.
Form integration
When the output of the application is a value of a series of tables, it is necessary to provide a function of saving a form to a file. At this time, PDF technology will be a good choice. Developers can manually write PDF instructions to draw graphics, tables, and text. Or use the data into the XML form and use the XSL-FO template to create a PDF document. However, these methods are time consuming, and it is easy to errors, and flexibility is also relatively poor. For simple forms, a better way is to create a template, then fill the given input data into the template to generate a document. Employment Eligibility Verification is a form that most people are familiar with, it is called "I-9 Form", see:
Http://uscis.gov/graphics/formsfee/forms/files/i-9.pdf
You can list a form field list using an example program in the PDFBOX release:
Java Org.pdfbox.examples.fdf.printfields I-9.pdf
Another example program is used to insert data in the text form in the specified domain:
Java org.pdfbox.examples.fdf.setfield i-9.pdf name1 smith
Open this PDF document in Acrobat You will see the "Last Name" domain has been filled. You can use the following code to accomplish the same operation: PDDocument pdf = PDDocument.load ( "i-9.pdf"); PDDocumentCatalog docCatalog = pdf.getDocumentCatalog (); PDAcroForm acroForm = docCatalog.getAcroForm (); PDField field = acroForm .Getfield ("Name1"); Field.SetValue ("Smith"); Pdf.save ("I-9-Copy.pdf");
The following code can be used to extract the value of the form field just filled out:
Pdfield Field = Acroform.Getfield ("Name1"); System.out.Println ("First Name =" Field.getValue ());
Acrobat supports importing form data into or exports to a specific file format "Forms Data Format). This file has two categories: FDF and XFDF. The format of the FDF file storage form data is the same as PDF, while XFDF stores form data in XML format. PDFBOX Processes FDF and XFDF: fdfdocument in a class. The following code snippet demonstrates how to export FDF data from the above I-9 table:
PDDocument pdf = PDDocument.load ( "i-9.pdf"); PDDocumentCatalog docCatalog = pdf.getDocumentCatalog (); PDAcroForm acroForm = docCatalog.getAcroForm (); FDFDocument fdf = acroForm.exportFDF (); fdf.save ( "exportedData. FDF ");
PDFBOX form integrated steps:
Creating a PDF form template using an Acrobat or other visualization tool to write down the name of each required (Desirable) table single domain to store the template to the application can access when the PDF is requested, use the PDFBOX parsing PDF template to populate the specified form field Return the fill result (PDF) to the user
tool
In addition to the API described above, PDFBOX also provides a range of command line tools. Table 2 lists these tool classes and makes a short introduction.
Note
PDF specification has a total of 1172 pages, and it is indeed a vast project. Similarly, in the PDFBOX release, it says it "is in progress", the new feature will slowly add it. Its main weakness is to create a PDF document from zero. However, there are some source-opening Java projects available to fill this gap. For example, the Apache FOP project supports generating PDF from a special XML document, which describes the PDF document to be generated. In addition, ITEXT provides a high-level API for creating a table and a list. The next version of PDFBOX will support new PDF 1.5 object streams and cross-reference streams. The support will then be provided in the inner font and the image. Under the effort of PDFBOX, PDF technology in the Java application is expected to be fully supported.