Lucene study notes

xiaoxiao2021-03-06  27

This article mainly introduces the origin, development, current situation, and LuENCE's initial application, which can be used as an introduction of Lucene.

1. Origin and development

Lucene is a high-performance, pure Java full-text search engine, free, open source. Lucene is almost suitable for any application that requires full retrieval, especially cross-platform applications.

Lucene's Doug Cutting is a senior full-text search expert, just started, Doug cutting put Lucene on its homepage, transferred to Sourceforge in March 2000, donated to Apache in 2001, as a child of Jakarta engineering.

2. Use status quo

After years of development, Lucene has had a lot of successful cases in the full text search field, and has accumulated a good reputation.

LUCENE-based full-text search products (Lucene itself is just a component, not a complete application) and application Lucene project is very much in the world, more well-known:

l Eclipse: Mainstream Java Development Tool, which helps document use Lucene as the search engine

l JIVE: Well-known Forum system, its search function is based on Lucene

L iFinder: From Germany website search system, based on Lucene (http://iFinder.Intrafind.org/)

l Mit DSPACE FEDERATION: A document management system (http://www.dspace.org/)

There are also many of the domestic and foreign in the website full-text search engine, more well-known:

l http://www.blogchina.com/weblucene/

l http://www.ioffer.com/

l http://search.soufun.com/

l http://www.taminn.com/

(More cases, please see http://wiki.apache.org/jakarta-lucene/poweredby)

In all these cases, open source applications account for a large part, but more or commercial products and websites. It is not exaggerated that Lucene's appearance has greatly promoted the deep application of full-text search techniques in various industries or fields.

3. Preliminary application

As mentioned earlier, Lucene itself is just a component, not a complete application, so if Lucene is running, it has to make the necessary secondary development on the basis of Lucene.

Download and installation

First, you need to go to Lucene's official website http://jakarta.apache.org/lucene/ to download a copy, the latest version is 1.4. After downloading, you will get a compressed file called Lucene-1.4-final.zip, unzip it, there is a file called Lucene-1.4-final.jar, which is the Lucene component package, if you need to use Lucene Only need to put Lucene-1.4-final.jar in a classpath, as for other files that are decompressed are referenced.

Next, I use Eclipse to build a project to implement LUCENE-based construction library, record loading, and record queries.

As shown in the figure above, this is the engineering completion of the completion, with three source files CreateDatabase.java, INSERTRECORDS.JAVA, QueryRecords.java, implements the functions of the library, storage, and retrieval. The following is an analysis of these three source files.

Construction library source code and instructions

CreateDataBase.java packagecom.holen.part1; importjava.io.File; importorg.apache.lucene.analysis.standard.StandardAnalyzer; importorg.apache.lucene.index.IndexWriter; / ** * @authorHolenChen * initialize retrieval library * / public classCreateDataBase {publicCreateDataBase () {} public intcreateDataBase (Filefile) {intreturnValue = 0; if {file.mkdirs () (file.isDirectory ()!);} try {IndexWriterindexWriter = newIndexWriter (file, newStandardAnalyzer (), true); indexWriter .close (); ReturnValue = 1;} catch (exceptionex) {EX.PrintStackTrace ();} returnreturnValue;} / ** * Introduction to search library path, initialization library * @ParamFile * @return * / public intcreatedatabase (StringFile) {Return this.created;} public static voidmain (string [] args) {createDatabaseTemp = newcreatedatabase (); if (Temp.createdTabase ("E: // Lucene // Holendb") == 1) { System.out.println ("DB INIT SUCC");}}}

Description: The most critical statement here is indexwriterindexwriter = newIndexwriter (File, NewstandardAnalyzer (), true).

The first parameter is the path to the library, that is, you are ready to save which location in the full-text search library, such as "E: // Lucene // Holendb" set in the main method, Lucene supports multiple libraries, and each library The location is allowed.

The second parameter is the analyzer. This is the standard analyzer comes with Lucene itself. The analyzer is used to resolve the entire article. The standard analyzer here is in English (or Latin, which is composed of letters. Space-separated words can be screened by the sections, the analyzer will cut the whole English into one word (in the full-text search, the incandment is one of the core technologies of the full text, Lucene can only Cut English or other Latin, the default does not support the two-word text of Sino-Japanese and Korea, and the Chinese income words will focus on subsequent chapters). The third parameter is to initialize the library. Here I set up true. True means a new library or override existing libraries, and false means additional libraries already existing. Here the new library, so it is definitely necessary to initialize, after initialization, there is only one file named segments in the library directory, the size is 1K. However, when there is a record in the library, the content is performed, the contents are all lost, and the library is replied to the initial state, which is equivalent to the new library, so the method must be used with caution.

Load recording source code and description

InsertRecords.java packagecom.holen.part1; importjava.io.File; importjava.io.FileReader; importjava.io.Reader; importorg.apache.lucene.analysis.standard.StandardAnalyzer; importorg.apache.lucene.document.Document; importorg .apache.lucene.document.Field; importorg.apache.lucene.index.IndexWriter; / ** * @authorHolenChen * load record * / public classInsertRecords {publicInsertRecords () {} public intinsertRecords (Stringdbpath, Filefile) {intreturnValue = 0; try {IndexWriterindexWriter = newIndexWriter (dbpath, newStandardAnalyzer (), false); this.addFiles (indexWriter, file); returnValue = 1;} catch (Exceptionex) {ex.printStackTrace ();} returnreturnValue;} / ** * incoming Required file name * @paramfile * @return * / public intinsertRecords (StringDbPath, StringFile) {Return this.insertRecords (dbpath, newfile (file);} public voidaddfiles r, filefile) {documentdoc = newdocument (); try {doc.add (field.keyword ("filename", file.getname ())))); // The following two sentences can only be taken, the former is indexed not store, the latter Is indexing and store / //doc.add (Field.Text (FileRext (File)); Doc.Add (Field.Text ("Content", this.chgfiletString (file)); indexwriter.addDocument (DOC); indexwriter.close ();} catch (exceptionex) {ex.printStackTrace ();}} / ** * Read content from text file * @Paramfile * @return * / publicStringchgfileTString (filefile) {stringreturnValue = NULL; STRINGBUFFERSB = newstringbuffer ();

Char [] c = new char [4096]; try {readerreader = newfileReader (file); intN = 0; while (true) {n = reader.read (c); if (n> 0) {sb.append (c , 0, n);}} reader.close ();} catch (exceptionex) {ex.printStackTrace ();} returnvalue = sb.toString (); returnreturnvalue;} public static voidmain (String [String " Args) {INSERTRECORDSTEMP = newinsertRecords (); stringdbpath = "e: // lucene // Holendb"; //holen1.txt Contains keyword "holen" and "Java" if (Temp.InsertRecords (DBPATH, "E: / /lucene//holen1.txt ") == 1) {system.out.println (" add file1 succ ");} //holen2.txt Contains keyword" holen "and" chen "if (Temp.insertRecords) Dbpath, "E: //lucene//holen2.txt") == 1) {System.Out.Println ("Add file2 SUCC");}}} Description: This class has three main methods insertrecords (StringDbPath, FILEFILE, ADDFI Les (INDEXWRITERINDEXWRITER, FILEFILE), CHGFILETOSTRING (FILEFILE).

The ChgfileTString method is used to read text files into a string variable.

InsertRecords method is used to load a record, here is a single file into a full-text search library, the first parameter is a library path, and the second parameter is a file that requires the library.

INSERTRECORDS needs to call AddFiles, addFiles is the true executor of the file storage. There are counting key code in AddFiles:

Doc.Add (Field.Keyword ("filename", file.getname ()));

Note that there is no strict sense in Lucene. The field name is filename, the content of this field is file.getname ().

Commonly used Field methods are as follows:

Methods of the words Field.Text (String Name, String Value) YYY Title String name, string value) NNY file path Field.unStored (String Name, String Value) YYN and the second category in order to understand the full text inspection library, we can use the full text inspection library with the usual relational database (such as Oracle, Mysql) makes a comparison.

The full-text inspection cable library contrasts the comparison of the Compare the full-text search library (ORACLE) core function, insert, insert, delete, modification (UPDATE), suitable for large Text block query. Insert, delete, modification (UPDATE) is very convenient, there is a special SQL command, but the retrieval efficiency of large text blocks (such as Clob) types is low. Library is similar to Oracle, you can build multiple libraries, and the storage locations of each library can be different. You can build multiple libraries, each library generally has control files and data files, etc., complicated. The table does not have a strict table concept, such as Lucene's table is just loose consisting of definition fields at the time of storage. There are strict table structure, have primary keys, with field types, etc. Record Due to no strict form, the record is reflected as an object, and the corresponding class records in Lucene is Document. Record, corresponding to the table structure. Field field types are only available for text and date, and the fields generally do not support operations, and there is no functionality. The class of fields in Lucene is Field, such as Document (Field1, Field2 ...) field type, powerful. Record (FIELD1, FIELD2 ...) Query Results In Lucene Indicates the class of Query Results, such as Hits (DOC1, DOC2, DOC3 ...) in JDBC, ResultSet (Record1, Record2, Record3 ...)

The two library comparison pictures are as follows:

Search source code and instructions

QueryRecords.java packagecom.holen.part1; importjava.util.ArrayList; importorg.apache.lucene.analysis.standard.StandardAnalyzer; importorg.apache.lucene.document.Document; importorg.apache.lucene.queryParser.QueryParser; importorg.apache .lucene.search.Hits; importorg.apache.lucene.search.IndexSearcher; importorg.apache.lucene.search.Query; importorg.apache.lucene.search.Searcher; / ** * @authorHolenChen * search query * / public classQueryRecords {publicQueryRecords () {} / ** * search query, the result sets are returned * @paramsearchkey * @paramdbpath * @paramsearchfield * @return * / publicArrayListqueryRecords (Stringsearchkey, Stringdbpath, Stringsearchfield) {ArrayListlist = null; try {Searchersearcher = newIndexSearcher ( Dbpath); queryQuery = queryparser.parse (SearchKey, Searchfield, NewstandardAnalyzer ()); Hitshits = Searcher.Search (Query); if (hits! = null) {list = newaRayList (); INTTEMP_HITSLENGTH = Hits.Length (); DocumentDoc = NULL; for (inti = 0; i

}}}}}}}}: Searcher in this class is responsible for querying, and returning the results in the HITS object set, HITS is better than the Recordset in the JDBC, and Hits is the collection of documents, each document is equivalent to a record, and the document contains one or more in Document. Fields, you can get the contents of each field via the Document.get ("Field Name) method.

Through these three classes, a simple LUCENE-based full-text retrieval application is completed.

4. to sum up

Lucene is very refined, just a JAR package, introduced into your project, call its interface, you can add a full text search function for your application.

The preliminary application of the previous section will find that Lucene is very simple, some similar to JDBC, the application focuses on INDEXWRITER, Document, Field, Searcher and other classes.

Lucene's structure is very clear, each package, such as org.apache.lucene.Search is responsible for retrieving, org.apache.lucene.index index, org.apache.lucene.Analysis clever, etc., and Lucene's main action The abstract class is used to expand it.

Compared to some commercial full-text search, Lucene's storage speed is faster. Because its storage takes a step-by-step merge, set a small index, and the time is mature to consolidate the small index to the large Scech tree. Therefore, we can synchronize the operation of the full-text inspection library while operating application data without (perhaps little) affects the performance of the system.

Lucene is stable, simple, and open source free, Apache funds are supported in the back, funds and technical strengths are very strong, and these two years have always been steadily updated, and each new version is launched, the industry is reported.

Reference

1. Introduction to Text Indexing with apache jakarta lucene (Otis GospodNetic)

2. Lucene Introduction In China (Car East)

3. Lucene Tutorial (Steven J. Owens)

转载请注明原文地址:https://www.9cbs.com/read-77137.html

New Post(0)