LUCENE Getting Started and Used (Go to: Grass Bear Lace)

xiaoxiao2021-03-06  45

LUCENE entry and use

This article is mainly for specific use, suitable for LUCENE beginners who are familiar with Java programming. Introduction to Lucene

1.1 Lucene History

Org.apache.lucene package is a full-text search toolkit for pure Java languages. The author of Lucene is a senior full-text index / retrieval expert, starting to publish on his own homepage, contributing to Apache in October 2001 to become a sub-project of Apache Fund Jakarta. Currently, Lucene is widely used in a full-text index / retrieval project. Lucene is also translated into a C # version, which is currently developed to Lucene.NET (but recently as a message with abortion).

1.2 lucene principle

Lucene's retrieval algorithm belongs to an index retrieval, that is, the space is used to exchange the time. The full-text index of the file that needs to be retrieved. When the index is quickly retrieved, the retrieval position is retrieved, this location record search term appears Path or a keyword. In a project using the database, the reason for the retrieval without using the database is: the database uses the query language "Like% Keyword%" when the non-accuracy query is used, and the database is queried to all records, and the field is "% Keyword% matches, this traversal is fatal when the data of the database is huge and the amount of data stored in a field, it needs to match the query on all records. Therefore, Lucene is mainly suitable for the full text search of the document set, and the fuzzy retrieval of massive databases, especially the character type of XML or big data for the database.

2. LUCENE download and configuration

2.1 LUCENE Download

LUCENE Published in the Jakarta project: http://jakarta.apache.org/lucene/docs/index.html. The following mainly target Windows users, other users are found in the above address.

Lucene's .jar package download (including .jar and an example demo): http://apache.oregonState.edu/jakarta/lucene/binaries/lucene-1.4-final.zip

Lucene's source code download: http://www.signal42.com/mirrors/apache/jakarta/lucene/source/lucene-1.4-final-src.zip

Lucene's API address: http://jakarta.apache.org/lucene/docs/api/index.html

This article uses the Lucene version: Lucene-1.4-final.jar.

2.2 Configuration of Lucene

First, make sure your machine has performed the basic configuration of the Java usage environment, ensuring that the Java source code can be run under a platform, otherwise check the relevant documentation for configuration. Next, enter Lucene configuration: Ordinary users: Add Lucene's location in the ClassPath of the environment variable. For example: "D: / java /lucene-1.4-final/lucene-1.4-ft.jar;". JBuilder Users: In "Project" - "Required Libraries" is added. JSP users: You can also put Lucene-1.4-final.jar files directly under / web-inf / class.

3. Example of Lucene (Demo)

3.1 Demo Description

Demo, which can be obtained includes: Lucene-Demos-1.4-Final, XmlIndexingDemo, Lucene-Demos-1.4-Final includes two indexes of normal files and HTML files, XMLIndexingDemo's index of XML files. Their difference is mainly: when indexing the normal files, as long as the full text of the file is indexed, while the TML, the XML file, the label type cannot be indexed, and the index of HTML and XML requires additional data stream analysis. Which content is useful for analysis. Therefore, in the latter two, the time of the index has an additional expense, or even more than the index itself, and the retrieval time is not different. In Demo, Lucene-Demos-1.4-Final is self-contained in Lucene-1.4-Final.zip, XMLINDEXINGDEMO Download address: http://cvs.cgi/jakarta-lucene-sandbox/Contributions/ XML-Indexing-Demo /

3.2 Demo's operation

First add the pathpath of the demo.jar, for example: "D: /java/lucene-1.4-final/lucene-demos-1.4-final.jar;", ensure that Lucene-1.4-Final has been added .jar.

Then the full-text index of the file, in the DOS console, enter the "java org.apache.lucene.Demo.demo.indexfiles {ful-path-to-lucene} / src", the back path is the folder to be indexed, for example : "Java Org.Apache.lucene.Demo.indexfiles C: / Test".

Then, the index is retrieved, knocks "java org.apache.lucene.demo.searchfiles", "Query:" after prompt "Query:", the program will retrieve the result of the search result (the file path that appears in the search term) . For other demo, please refer to /DOCS/Demo.html. Read the source code of Demo after running DEMO for in-depth learning.

4. Use Lucene to index

After LUCENE is familiar, we will learn how to use Lucene. Application example of a index:

// need to catch an IOException // create a IndexWriter, save directory for the index "index" String [] stopStrs = { "his grandmother", "fuck"}; StandardAnalyzer analyzer = new StandardAnalyzer (stopStrs); IndexWriter writer = new IndexWriter ("INDEX", Analyzer, True; // Add a document Document Doc = new document (); doc.add (Field.UnIndexed ("ID", "1")); // "ID" is the field name, "1" is a field value doc.add (Field.Text ("Text", "Fuck, his grandmother, entry and use")); Writer.AddDocument (DOC); // After processing Writer.Optimize ( Writer.close (); After reading this example, we began to familiarize with Lucene:

4.1 LUCENE's index interface

When learning the index, you first need to be familiar with several interfaces:

4.1.1 Analyzer Analyzer

The main task of the analyzer is to screen, after a document comes in, after it, only those useful parts are left, others are removed. And this analyzer can also be written as needed. Org.apache.lucene.Analysis.Analyzer: This is a fiction class. The following two excuses have inherited it.

Org.apache.lucene.Analysis.SIMPLEANALZER: Analyzer, support the simplest latin language.

Org.apache.lucene.Alysis.standard.standardArDAryzer: Standard Analyzer, in addition to the Latin language also supports Asian languages ​​and perfect in some matching functions. There is also a very important constructor in this interface: StandardAnalyzer (String [] Stopwords), you can define some use words to the analyzer, which can not only retrieve some useless information, but also define prohibited politicality in the search. , Illegal search keywords.

4.1.2 IndexWriter

The constructor of IndexWriter has three interfaces for directory Directory, file file, file path string. For example, IndexWriter (String Path, Analyzer A, Boolean Create), Path is the file path, A is the analyzer, the Create flag is rebuilt index (True: Establish or override the existing index, FALSE: Extension existing index.) Some important Methods:

Interface name

Note

AddDocument (Document Doc)

Index Add a document

AddIndexes (Directory [] DIRS)

Add an index existing in the directory to this index

AddIndexes (IndexReader [] Readers)

Add the index provided to this index

Optimize ()

Merger index and optimization

Close ()

shut down

IndexWriter creates a new small index file after a certain amount of IO maintenance operation (the minimum unit of the author is 10), then integrate them into an index file, so it is indexed in an index file. WiRTER. Optimize () must be performed at the end to merge all indexes. 4.1.3 Org.Apache.lucene.Document

The following introductions two main classes: a) org.apache.lucene.document.document: Document Document Similar to a record in the database, can be composed of several fields (field), and the field can be used in different types (see B ). Several interfaces of Document:

Interface name

Note

Add (Field Field)

Add a field (Field) to Document

String Get (String Name)

Get a field corresponding to a field from the document

Field getField (String name)

Get field value by the field name

Field [] getfields (String name)

Get a set of field values ​​by the field name

b) org.apache.lucene.Document.field The "field" mentioned above, it is the fragment section of Document. Field constructor: Field (String Name, String String, Boolean Store, Boolean Index, Boolean Token). Indexed: If the field is indexed, it means that this field is retrieved. Stored: If the field is Stored, it means that the value of this field can be obtained from the search result. Tokenized: If a field is tokenized, it indicates that it has become an TokenS sequence after the Analyzer transition. In this transition process tokenization, the Analyzer extracts the text you need to index, and remove some redundant words (for example: A, THE, THEY, etc. See Org.Apache.lucene.Alysis.Stopanalyzer.English_stop_words and org.Apache.lucene.Alysis.standard.standardAnalyzer (String [] Stopwords) API). Token is an indexed basic unit representing an indexed word, such as an English word, or a Chinese character. Therefore, all texts containing Chinese must be tokenized. Several interfaces of Field:

Name

Stored

Indexed

Tokenized

USE

Keyword (String Name,

String value)

Y

Y

N

Date, URL

TEXT (STRING NAME, Reader Value)

N

Y

Y

Short text Fields:

Title, SUBJECT

TEXT (STRING NAME, STRING VALUE)

Y

Y

Y

Longer text Fields,

Like "Body"

Unindexed (String Name,

String value)

Y

N

N

Unstru (String Name,

String value)

N

Y

Y

5. Use lucene to search

5.1 A simple search code // Need to capture ioException, Parsexception exception // processing search criteria query query = queryParser.Parse ("Getting Started", "Text", Analyzer;

// Retrieve Searcher Searcher = New IndexSearcher ("./ INDEX"); // "Index" Specifies the Index File Location Hits Hits = Searcher.Search (Query);

// Print Result value set for (int i = 0; i

5.2 Using Lucene's search interface

5.2.1 Query and QueryParser

Mainly used: queryParser .Parse (String Query, String Field, Analyzer Analyzer), for example: query query = queryParser.Parse ("Getting Started", "Text", Analyzer; "Getting Started" is searching the word, "Text" Field name, Analyzer is the analyzer

5.2.2 Hits and Searcher

Hits' main use interface:

Interface name

Note

DOC (INT N)

Returns all the fields of the Nth document

Length ()

Returns this set of available numbers

6. Other use of Lucene

6.1 LUCENE index modification

The code for modifying an index is given below, please interpretation according to Lucene's API:

/ ** * Add a new index for existing index * @Param IDStr string: I want to modify the id * @Param Doc Document: value to modify * / public void addressEx (String idstr, string valueStr) {StandardAnalyzer Analyzer = New StandardAnalyzer (); indexwriter write = null; try {writer = new indexwriter (indexpath, analysis); Writer.MergeFactor = 2; // Fixing Lucene 1.4.2 BUG, ​​does not merge the original index

Document doc = new document (); doc.add ("ID", IDSTR)); // "ID" is a field name, "1" is a field value doc.add (Field.Text ("text" , Valuestr); Writer.AddDocument (DOC);

Writer.optimize (); writer.close ();} catch (ioexception ie) {ooe.printstacktrace ();}}

/ ** * delete indexes * * @param idStr String * / public void deleteIndex (String idStr) {try {Directory dirt = FSDirectory.getDirectory (indexPath, false); IndexReader reader = IndexReader.open (dirt); IndexXML.deleteIndex ( IDSTR, Reader; Reader.close (); Dirt.close ();} catch (ooexception ooe) {IOE.PrintStackTrace ();}} 6.2 Lucene search results Sort

The sorting of Lucene is mainly used for org.apache.lucene.search.sort. SORT can be generated directly according to field field, or can be generated according to standard Sortfield, but as a field of sort, the following conditions must meet the following conditions: unique value and indexed. Three types of Integers, Floats, Strings can be sorted. Sort the integer type ID search results as long as the following simple operations:

Sort sort = new sort ("id"); hits Hits = Searcher.Search (Query, Sort);

Users can also define more complex sorting according to their own definition, please refer to the API for details.

7 summary

Lucene brings very powerful forces to Java's full-silled index, which is only simple to introduce Lucene.

References: 1. Overview (Lucene 1.4-Final API) 2. Add full-text search function in the application - Java-based full-text index engine Lucene Introduction "3. http://www.mail-archive.com/ Lucene-user@jakarta.apache.org/index.html

转载请注明原文地址:https://www.9cbs.com/read-74881.html

New Post(0)