Introduction to the full-text index engine LUCENE based on Java

xiaoxiao2021-03-11  187

Java-based full-text index / retrieval engine - Lucene

Lucene is not a complete full full-text index application, but a full-text index engine toolkit written by Java, which can be easily embedded in a variety of applications to implement the full-text index / retrieval function for applications.

Lucene author: Lucene contributor Doug Cutting is an experienced full-text indexing / retrieval expert, once the V-Twin search engine (one of the achievements of Apple's Copland operating system) the main developer, after a senior system architecture in Excite Designers are currently engaged in research on INTERNET undergrowth architecture. He contributed to Lucene's goal is to join a full-text search function for a variety of small and medium-sized applications.

Lucene's development process: I have published it in the author's own www.lucene.com, and later released in Sourceforge, became a subpredal item for the Apache Foundation Jakarta by the end of 2001: http://jakarta.apache.org/lucene/

There have been many Java projects that use Lucene as a full-text index engine in the future, which is more famous:

JIVE: WEB Forum System; Eyebrows: Mail list HTML Archive / Browse / Query System, this article "The Lucene Search Engine: Powerful, Flexible, And Free" author is one of the main developers of the Eyebrows system, and Eyebrows already Become a major mailing list archiving system for the current Apache project. COCOON: Based on XML-based web publishing framework, full-text retrieval sections use Lucene Eclipse: Java-based open development platform, the full-text index of the help part uses Lucene

For Chinese users, the most concerned issue is whether it supports Chinese full-text search. However, through the introduction of Lucene's structure, you will learn that due to the LUCENE's good architecture design, the support of Chinese can implement support for Chinese retrieval simultaneously.

Realization mechanism for full text search

The comparison of Lucene's API interface design is very common. The input and output structure is very similar to the table ==> record ==> field of the database, so many traditional applications, database, etc. can be easily mapped to Lucene's storage structure / interface. in. Overall: You can first treat Lucene as a database system that supports full-text indexes.

Compare lucene and databases:

Lucene database index data source: DOC (Field1, Field2 ...) DOC (Field1, Field2 ...) / Indexer / _____________ | Lucene Index | ------------ / Searcher / Result Output: Hits (DOC (Field1, Field2) DOC (Field1 ...)) Index Data Source: Record (Field1, Field2 ...) Record (Field1 ..) / SQL: INSERT / _____________ | DB INDEX | --- ---------- / SQL: SELECT / Result Output: Results (Record (Field1, Field2 ..) Record (Field1 ...)) Document: A "unit" that needs to index is more The field consists of the record: records, contains multiple fields Field: Field Field: Field Hits: Query result set, composed of matching documents RECORDSET: Query result set, composed of multiple Record Composition full-text search ≠ Like "% keyword%"

Usually, you are often attached to the keywords (such as: Beijing: 12, 34, Shanghai: 3, 77 ...), which helps readers find the page number of the relevant content. And the database index can greatly improve the speed principle of the query, imagine how much the index looks higher than one page after the retrieval of the book ... and the index is high, and the other reason is that it is Needs. The core is a sorting problem for the search system.

Since the database index is not designed for full-text index, the database index does not work when using the LIKE "% keyword%". When using the Like query, the search process becomes a traversal process similar to a page. Therefore, for database services containing fuzzy queries, LIKE is extremely harmful to performance. If you need to make a plurality of keywords: Like "% keyword1%" and lot "% keyword2%" ... It is also conceivable.

Therefore, the key to establishing an efficient search system is to establish a reverse index mechanism similar to the scientific index, and sequentially store the data source (such as a plural article), there is another keyword list, which is used. Storage Keyword ==> Article mapping relationship, using such a mapping relationship index: [Key words ==> "post-name number, number of times (even the location: position: start offset, end offset), appear Frequency], the retrieval process is to turn the fuzzy query into multiple processes that can utilize the accurate query of the index. Thus, it greatly improves the efficiency of multi-keyword queries, so the full text search problem is finally a sorting problem.

It can be seen that the exact query of the fuzzy query relative database is a very uncertain problem, which is also a limited reason for the limited reasons for full-text retrieval. The most core of Lucene is a full-text index mechanism that is not good at traditional databases through a special index structure, and provides an extended interface to facilitate customization of different applications. You can compare the fuzzy query of the database through a table:

Lucene Full-text Index Engine Database Index Establishs the data from the data source to establish a reverse index for the Like query, and the data traditional index is not at all. The data needs to be ambiguously recorded by a convenient record, and there is a plurality of orders of magnitude more than the indexed search speed. The matching effect is matched by the word element (TERM), and the support for non-English such as Chinese can be implemented through the implementation of the language analysis interface. Using: Like "% Net%" will also match Netherlands, and the fuzzy matching of multiple keywords: use the LIKE "% COM% Net%": You cannot match the word step down xxx.net..xx.com matching degree The matching degree algorithm is ranked in front of the result of the degree (similarity). There is no control of the degree of matching: For example, 5 words appear in the record and 1 time, the result is the same. The result output through a special algorithm, outputs the highest gathering head 100 results, and the result set is read by the buffer small batch. Returns all result sets that require a lot of memory to save these temporary result sets when matching entries (such as tens of thousands). Deliveability is implemented through different language analysis interfaces, which can be convenient to customize the index rules that meet the application needs (including support for Chinese) without interface or interface complex, unable to customize the fuzzy query application of high loads, which requires the responsible blur query. The rules, the amount of data of the index is relatively large, the fuzzy matching rules are simple or need to be fuzzy query

The full-text search and database application is the biggest difference in: Let the most relevant head 100 results to meet more than 98% of users' needs Lucene's innovation:

Most of the search (database) engines use B tree structures to maintain indexes, and the index will result in a large number of IO operations, Lucene is in the implementation: this is slightly improved: not maintaining an index file, but extension When indexing, you will continue to create a new index file, then merge these new small index files into the original big index (for different update policies, the size of the batch can be adjusted), which does not affect the efficiency of the retrieval Under the premise, the efficiency of the index is improved.

Comparison of Lucene and Some Full Search System / Application:

Lucene Other Open Source Full-text Retrieval System Increment Index and Batch Index can be an index (APPEND), which can be batch indexes for large amounts of data, and interface is designed to optimize incremental indexing of bulk indexes and small batch. Many systems only support batch index, and sometimes data sources have additional additions also need to rebuild indexes. The data source Lucene does not define a specific data source, but a structure of a document, so it can be very flexible to adapt to various applications (as long as there is a suitable converter with a suitable converter into a corresponding structure), many systems are only for web pages, lacks Flexibility of other format documents. Index Content Crappiness The documentation is made up of multiple fields, and even those fields need to be indexed. Those fields do not need indexes, and the fields of the near-step index are also divided into types that require word and no description: Index, such as: Title, article content field does not need to perform a word index, such as: Author / Date field lacks versatility, often expands the entire document throughout the language analysis through language analyzer Different extensions: You can filter out unwanted Word: An the of the et al, Western Siculial Analysis: Both Jumps Jumped Jumper into JUMP to index / retrieve non-English support: to Asian language, the index support for the Arab language lacks universal interface implementation query analysis through query analysis interface implementation, You can customize your own query syntax rules: such as: - AND OR relationships between multiple keywords, etc. Connected access to multi-user's use of cut words about Asian languages ​​(Word Segment)

For Chinese, the full-text index must first solve a problem with language analysis. For English, the words in the statement are naturally separated by spaces, but the word in the China-Japanese Chinese Japanese sentence is a word. All, first, if you want to index in the statement, this word is a big problem.

First, you must not use a single character as an index unit, otherwise check "Shanghai", you can't make "sea" also match.

But in a word: "Beijing Tiananmen", how does the computer use the Chinese language habits? "Beijing Tiananmen" or "North Beijing Tiananmen"? Let the computer can separate according to language habits, often requiring a machine to have a relatively rich word library to be more accurate to identify words in statements.

Another solution is to adopt automatic cutting algorithms: Separate words according to the 2 yuan syntax (BIGRAM), such as "Beijing Tiananmen" ==> "Beijing Jingtian Tian Anmen".

In this way, in the query, whether it is inquiry "Beijing" or query "Tiananmen", put the query phrase according to the same rules: "Beijing", "Tiananmen", multiple keywords, "and" "The relationship is combined, and it can also be mapped to the corresponding index. This approach is common for other Asian languages: Korean, Japanese is common.

Based on automatic segmentation is that there is no maintenance cost, simple, and the disadvantage is that the index efficiency is low, but for small and medium-sized applications, 2-element syntax is still sufficient. The index after 2 yuan is similar to the general size and source files, and for English, index files generally only have 30% -40% of the original file,

Automatic segmentation word analysis implementation is very simple to implement complex queries to increase the complexity of query analysis, suitable for realizing complicated query comparison rules storage efficiency index redundancy, index is almost as high as the original, high text size The 30% left and right maintenance costs No word maintenance costs the maintenance cost is very high: Sino-Japanese and Korea is maintained separately. It also needs to include word frequency statistics, etc. Applicable domain embedded system: operating environmental resource limited distributed system: no word synchronization problem multi-language environment: no word maintenance costs for professional search engines with high query and storage efficiency requirements are currently larger The language analysis algorithm of the search engine is generally based on the combination of the above two mechanisms. About Chinese language analysis algorithm, you can find more information on Google Keyword "Word Segment Search".

Installation and use

Download: http://jakarta.apache.org/lucene/

Note: Some of Lucene's complicated lexical analysis is generated by Javacc (Javacc: Java Compiler, pure Java's lexical analysis generator), so if you compile from source code or need to modify QueryParser, customize your own lexical analysis The Javacc is also required to download Javacc from http://www.webgain.com/products/java_cc/.

The composition of Lucene: Index Module (INDEX) and retrieval module (Search) is the main external application entry for external applications.

Org.apache.lucene.Search/ Search Entrance Org.Apache.lucene.index / Cable Inlet Org.Apache.lucene.Analysis / Language ORG.APache.lucene.queryParser / Query Analyzer Org.Apache.lucene.Document / Storage structure org.apache.lucene.Store/ underlayer IO / storage structure org.apache.lucene.UTIL / Some public data structures

Simple example demonstrates the use of Lucene: Index Process: Use the file name (multiple) from the command line, store the file division (PATH field) and content (Body field), and full text Index: The unit of index is a Document object. Each Document object contains multiple field field objects. For different field properties and data output requirements, you can also choose different index / storage field rules, the list is as follows:

Methods The word index (String name, string value) YESYESYES is stored, for example: title, content field Field.Text (String name, reader value) YESYESNO division word index does not store, such as: META information Not to return display, but need to retrieve content field.Keyword (String name, string value) Noyesyes non-cut index and store, such as: Date field Field.unIndexed (String name, string value) Nonoyes does not index, only store, For example: file path field.unstored (String name, string value) YESYESNO only full-text index, not stored

Public class indexfiles {// Usage :: IndexFiles [index output directory] [index file list] ... public static void main (string [] args) throws exception {string indexpath = args [0]; IndexWriter Writer; / / Use the specified language parser to construct a new write-indexer (third parameter indicating whether it is an additional index) Writer = New SimpleanAl (), FALSE); for (int i = 1; i

The language analyzer provides an abstract interface, so language analysis can be customized, although Lucene default provides 2 comparison universal analyzers SimpleAlySer and StandardAnalyser, these 2 analyzers are not supported by Chinese, so To join the segmentation rules for Chinese language, you need to modify these 2 analyzers. Lucene does not specify the format of the data source, but only one universal structure (Document object) is provided to accept the index input, so the input data source can be: database, Word document, PDF document, HTML document ... as long as it is designed The corresponding parsing converter can index the data source to a docuement object. For large quantities of data index, it is also possible to increase the efficiency of bulk index by adjusting the INDEXERWRITE's file merge frequency properties.

Retrieval process and results show:

The search results returned to the HITS object, which can be accessed again to access the contents in Document ==> Field.

Suppose the full-text search is performed according to the Body field, you can print the Path field of the query result and the corresponding query,

public class Search {public static void main (String [] args) throws Exception {String indexPath = args [0], queryString = args [1]; // index to the directory searcher Searcher searcher = new IndexSearcher (indexPath); / / Query parser: Use and index the same language analyzer Query Query = queryParser.Parse (QueryString, "Body", new SimpleAnalyzer ()); // Search results Using HITS storage Hits Hits = Searcher.Search (Query); / / With HITS can access the data and query of the corresponding field and the matching of the query for (INT i = 0; i

Simplified query analyzer

After the personal feelings, Lucene has become a Jakarta project, drawing too much time for debugging queryparser, and most of them are not familiar, the symptoms of LUCENE support:

Query :: = (CLAUSE) * CLAUSE :: = [" ", "-"] [: "] ( |" ("query") ")

The middle logic includes: AND or - && ||, etc., and also "phrase query" and prefix / blur inquiry for Western, personal feelings for general applications, these functions have some Chinese, actually Currently similar to Google's query statement analysis feature is enough for most users. So, Lucene's early version of QueryParser is still a better choice.

Add a modification to delete the specified record (Document)

Lucene provides an indexed extension mechanism, so the indexed dynamic extension should be no problem, and the modification of the specified record also seems to only be deleted by the record, then re-joining the implementation. How to delete the specified record? The method of deleting is also very simple, just need to separate indexing according to the record ID in the data source while indexing, then using the indexreader.delete (Term Term) method to delete the corresponding document.

Sort by a certain field value

Lucene default is based on its own correlation algorithm (score), but can be sorted according to other fields is a problem that is often mentioned in the list of Lucene's development mailing list, and many of them are originally based on database applications. Sorting function other than the matching degree. From the principle of full-text search, we can understand that any indexable search process efficiency will result in very low efficiency. If the sort based on other fields need to access the storage field during the search process, the speed back is greatly reduced, so very Take. However, there is also a compromise here: 2 parameters that can affect the Docid and Score that have been stored in the index during the search process, so, based on SCORE, the data source can be pre-rinted in advance Order, then sort it according to DOCID. This avoids sorting the results of the Lucene search results and access a field value in the index during the search process during the search process.

Here you need to modify the Hitcollector process in IndexSearcher:

... scorer.score (new hitcollector () {private float minscore = 0.0f; public final void collection {if (score> 0.0f && // ignore zeroed buckets (bits == null || Bits.get (DOC))) {// Skip Docs Not in Bits Totalhits [0] ; if (Score> = Minscore) {/ * Original: Lucene Experients with Docid and Corresponding Match SCORE : * HQ.PUT (New Scoredoc (SCORE)); // Update Hit Queue * If you use DOC or 1 / DOC instead of score, you have implemented it according to DOCID compliance or reverse row * assumes that the data source index has been in accordance with The field is ranked, and the results are implemented in accordance with the DOCID sorting. * Sort for a field, even more complex score and docid fit. * / Hq.put (new scoredoc (DOC, (Float) 1 / DOC)); if (hq.size ()> ndocs) {// if hit queue overfull hq.pop (); // remove lowest in hit queue minscore = ((scoredoc) HQ.top ()). Score; // RESET MINSCORE}}}}}

More universal input output interface

Although Lucene did not define a certain input document format, more and more people think of using a standard intermediate format as Lucene's data import interface, then other data, such as PDF only need to convert into standard intermediate formats through the parser You can perform data index. This intermediate format is mainly based on XML, similar to achieving 4 or 5:

Data Source: Word PDF HTML DB Other / | | | / XML Intermediate Format | Lucene INDEX Index Process Optimization

Indexes are generally 2 cases, one is a small batch index extension, one is a large quantity of index reconstruction. During the index process, it is not a write operation of the index file every time the new DOC is added to the index (file I / O is a very consumable resource).

Lucene first indexed operation in memory, and writes files based on certain batch. The larger the spacing of this batch, the less the number of writes of the file, but there will be a lot of memory. Easely occupy less memory, but the file IO operation is frequent, the index speed will be very slow. There is a MERGE_FACTOR parameter in IndexWriter to help you make full use of memory reduction files based on the application environment after constructing the indexer. According to my experience: Default Indexer is written once every 20 record indexes, each increases Merge_Factor by 50 times, and the index speed can be increased by about 1 times.

Retrieval cache optimization

Optimization of Lucene's full-text retrieval is that after the first index search, the document is not read, and the ID of the top 100 results in all the results (TOPDOCS) will result in the result. The cache is returned, here you can compare the database search: If it is a 10,000 database retrieval result set, the database is necessary to get all the recorded contents will then return to the application result set. So even if the total number of match matches, the memory space for Lucene does not use a lot of memory spaces. For general fuzzy search applications, there is no so much result, and the head 100 has been able to meet more than 90% retrieval requirements.

If the first batch of cache results are used to read more results, Searcher retrieves and generates a cache for 1 times the number of last search caches, and re-rewriting. So if you construct a Searcher to find 1-120 results, Searcher is actually a 2-time search process: After the first 100 is taken, the cache result is used, and the Searcher retries and constructs a 200 result cache, and so on 400 buffers, 800 cahes. Since each Searcher object disappears, these cache also access to that, you may want to cache the results record, the number of cache is try to guarantee the following to take full use of the first result cache, do not let Lucene retrieved multiple times, and The result cache can be classified.

Another feature of Lucene is to filter out the result of low matching degree during the process of collecting results. This is also the difference between the results of the search needs to return the results of the search.

I have some attempts:

Support Chinese tokenizer: Here is 2 versions, one is generated by javacc, pressing a TOKEN index for the CJK section, and the other is overwritten from SimpleTokenizer, supporting the number and letters token, in English, by iterative index. The indexer based on the XML data source: XMLINDEXER, so all data sources can be indexed with XMLINDXER as long as the DTD is converted to the specified XML. Sort by a field: Search for the results of the record index sequence: IndexOrDersearcher, so if you need to let the search results are sorted according to a field, you can let the data are sequential (such as: Pricefield), so index After that, the result is equivalent to the result of the field sorting by using this sequence of recorded ID. Learn more from Lucene

Luene is indeed a model of object design

All issues are convenient for extension and reuse through an additional abstraction: You can reach your own purpose, and don't need it for other modules; simple application entry Searcher, Indexer, and call a series of components Collaborative completion search tasks; all objects tasks are very special: such as search procedures: QueryParser analysis converts query statements into a series of accurate queries, Query, read structure IndexReader reading through the underlying index reading And use the corresponding scorer to score / sort the search results. All functional modules are very high, so other modules can be modified by re-implementing without need to modify other modules. In addition to flexible application interface design, Lucene also provides some language analyzer implementations that are suitable for most applications, which is also one of the important reasons for new users to get started.

These advantages are very worth learning from future development. As a universal toolkit, LuneCe has a lot of conveniences that need to be embedded in the application in the application in the application.

In addition, through LUCENE learning and use, I also understand why many database optimization design requirements, such as:

As much as possible, the fields can be corrected to improve the query speed, but too much index will slow down the update operation of the database table, and excessive sorting conditions for the result is actually one of the performance of the killer. Many commercial databases provide some optimized parameters for large quantity data insertions. This effect and indexer's Merge_Factor effect is similar, 20% / 80% principles: the results of checking are not equal to quality, especially for returning results The collection is very large, how to optimize this quality of dozens of results is often the most important. To get the application from the database from the database, it is also a very resource-consuming operation for the random access to the result set even for large databases.

Reference:

Apache: Lucene Projectttp: //jakarta.apache.org/lucene/lucene Mail list Archive Lucene-Dev @ jakarta.apache.orglucene-user @ jakarta.apache.org

The Lucene Search Engine: Powerful, Flexible, And FreeHttp://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene_p.html

Lucene TutorialHttp://www.darksleep.com/puff/lucene/lucene.html

NOTES ON Distributed Searching with Lucenehttp: //home.clara.net/markharwood/lucene/ 中文 语 切 切 切 h = = = = = = = = = = = = = = = = =

Search engine tool introduction http://searchtools.com/

Search Engine Industry Research http://www.searchenginewatch.com/

Several papers of Lucene author cutting: http://lucene.sourceforge.net/publications.html

Lucene's .NET implementation: http://sourceforge.net/projects/nlucene

Another item of Lucene author cutting: Java-based index engine nutchhttp: //www.nutch.org/ http://sourceforge.net/projects/notch/

Comparison of sewrams based on whisati and N-Gram http://china.nikkeibp.co.jp/cgi-bin/china/news/int/int200302100112.html

Original source: http://www.chedong.com/tech/lucene.html

转载请注明原文地址:https://www.9cbs.com/read-128970.html

New Post(0)