"Java extracts Word, PDF's four weapons" (reproduced Liao River Digital)

xiaoxiao2021-03-06 103

http://www.matrix.org.cn/down_view.asp?id=14

"Java extracts the four weapons of Word, PDF"

ijsp.net 2003-08-15

Chris 2003-07-01 19:04:00 542 View Chris (chris@matrix.org.cn) graduated from China People's University Information Institute in June 2003, many people often encounter a problem when using Java documentation. How do I get the contents of Word, Excel, PDF and other documents? I have studied it here, summarize several ways to extract Word, PDF. 1. Using Jacob, Jacob is a bridage, connecting the Java and COM or Win32 functions, Jacob can't take directly to Word, Excel and other files, you need to write DLLs yourself, but you have already written for you, just Jacob's author is offering.

Jacob Jar and DLL file Download: http://www.matrix.org.cn/down_view.asp?id=13 Download Jacob and place it after the specified path (DLL is placed in the path, the JAR file is placed in the classpath), You can write your own extraction program, below is a simple example: import java.io.file; import com.jacob.com. *; Import com.jacob.activex. *; / ** * title: pdf extract * Description: email: chris@matrix.org.cn * Copyright: Matrix Copyright 2003 * Company: Matrix.org.cn * @author chris * @version 1.0, who use this example pls remain the declare * / public class FileExtracter {public static Void main (string [] args) {activeXcomponent component = new activxcomponent ("word.application"); string infile = "c: //test.doc"; string tpfile = "c: //temp.htm"; string Otfile = "c: //temp.xml"; boolean flag = false; try {component.setproperty ("visible", new variant (false)); Object WordAcc = Component.getProperty ("Document."). Todispatch (); Object Wordfile = dispatch.invoke (WordAc, "Open", Dispatch.Method, New Object [] {Infile, New Variant (false), New Variant (TRUE)}, new int [1]) .todispatch (); Invoke (Wordfile, "Saveas", Dispatch.m Ethod, New Object [] {TPFILE, New Variant}, New Int [1]); Variant F = New Variant (False); Dispatch.call (Wordfile, "Close", F); Flag = true;} catch (Exception e) {E.PrintStackTrace ();} finally {component.invoke ("quit", new variant [{});}}} 2. Use Apache's POI to extract Word, Excel.

POI is a project of Apache, but you may feel very annoying, but don't matter, it provides a simpler interface to you: download the packaged POI package: http://www.matrix.org. CN / DOWN_VIEW.ASP? ID = 14 After downloading, you can put it in your classpath, how is the following example: Import java.io. *; import org.textmining.text.extraction.Wordextractor; / * * * Title: word extraction * Description: email: chris@matrix.org.cn* Copyright: Matrix Copyright 2003 * Company: Matrix.org.cn * @author chris * @version 1.0, who use this example pls remain the declare * / public class PdfExtractor {public PdfExtractor () {} public static void main (String args []) throws Exception {FileInputStream in = new FileInputStream ( "c: //a.doc"); WordExtractor extractor = new WordExtractor (); String Str = extractor.extracttext (in); system.out.println ("The result length is" str.length ()); system.out.println ("the result is" str);}} 3. PDFBOX- Used to extract PDF files but PDFBOX is not good, first download pdfbox: http://www.matrix.org.cn/down_view.asp?id=12 The following is an example of how to extract PDF files using PDFBOX: Import Org.pdfbox.pdmodel.pddocument. import org.pdfbox.pdfparser.PDFParser; import java.io. *; import org.pdfbox.util.PDFTextStripper; import java.util.Date; / ** * Title: pdf extraction * Description: email: chris@matrix.org .cn * Copyright: Matrix Copyright 2003 * Company: Matrix.org.cn * @author chris * @version 1.0, who use this example pls remain the declare * / public class PdfExtracter {public PdfExtracter () {} public String GetTextFromPdf (String Filename) throws exception {string temp = null; pddocument. Nbsppdfdocument. Null; fileinputstream is = new fileinputstream (filename); pdfParser Parser = New PDFPARSER (); Parser.Parse (); pdfdocument.

NBSP = parse.getpddocument. ); ByteArrayOutputStream out = new ByteArrayOutputStream (); OutputStreamWriter writer = new OutputStreamWriter (out); PDFTextStripper stripper = new PDFTextStripper (); stripper.writeText (pdfdocument.getdocument), writer);. Writer.close (); byte [] contents = Out.TobyteArray (); string ts = new string (content); system.out.println ("THE STRINGTH IS" CONTENTS.LENGTH "/ N"); Return Ts;} public static void main (String args ]) {PdFextracter Pf = New PdfexTracter (); pddocument. Nbsppdfdocument. NBSP = NULL; try {string ts = pf.getTextFromPDF ("c: //a.pdf"); system.out.println (TS);} catch (exception e) {E.PrintStackTrace ();}}} 4 PDF file-xpdf xpdf-xpdf xpdf is an open source project, we can call his local approach to extract a Chinese PDF file.

转载请注明原文地址:https://www.9cbs.com/read-93439.html

9cbs

New Post(0)