Author: Cha Dong Email: chedongATbigfoot.com/chedongATchedong.com
abstract:
Lucene is a Java-based full-text index kit.
1. Introduction to the full-text index engine based on Java: About the author and Lucene history
2. Realization of full text: Lune full-text index and database index comparison
3. Introduction to the Chinese Tipnotes: Comparison Based on the Ratual Library and Automatic Practice Current Algorithm
4. Specific installation and use Introduction: System Structure Introduction and Demos
5. Hacking Lucene: Simplified query analyzer, delete implementation, custom sort, application interface extension
6. What can we learn from Lucene?
Java-based full-text index / retrieval engine - Lucene
Lucene is not a complete full full-text index application, but a full-text index engine toolkit written by Java, which can be easily embedded in a variety of applications to implement the full-text index / retrieval function for applications.
Lucene author: Lucene contributor DougCutting is a senior full-text indexing / retrieval expert, once the V-Twin search engine (one of the achievements of Apple's Copland operating system) the main developer, after a senior system architecture design in Excite Teacher, currently engaged in research on the underlying architecture of Internet. He contributed to Lucene's goal is to join a full-text search function for a variety of small and medium-sized applications.
Lucene's development process: I have published it in the author's own www.lucene.com, and later released in Sourceforge, became a subpredal item for the Apache Foundation Jakarta by the end of 2001: http://jakarta.apache.org/lucene/
There have been many Java projects that use Lucene as a full-text index engine in the future, which is more famous:
· JIVE: Web Forum System;
· Eyebrows: Mail list HTML Archive / Browse / Query System Mail list archive system.
· COCOON: XML-based web publishing framework, full-text search section uses Lucene
· Eclipse: Based on Java-based open development platform, the full-text index of the help part uses Lucene
For Chinese users, the most concerned issue is whether it supports Chinese full-text search. However, through the introduction of Lucene's structure, you will learn that due to the LUCENE's good architecture design, the support of Chinese can implement support for Chinese retrieval simultaneously.
Realization mechanism for full text search
The comparison of Lucene's API interface design is very common. The input and output structure is very similar to the table ==> record ==> field of the database, so many traditional applications, database, etc. can be easily mapped to Lucene's storage structure / interface. in. Overall: You can first treat Lucene as a database system that supports full-text indexes.
Compare lucene and databases:
Lucene database
Index Data Source: DOC (Field1, Field2 ...) DOC (Field1, Field2 ...) / indexer / _____________ | Lucene Index | -------------- / Searcher / Result Output: Hits (DOC1, Field2) DOC (Field1 ...)) Index Data Source: Record (Field1, Field2 ...) Record (Field1 ..) / SQL: INSERT / _____________ | DB INDEX | ----- -------- / SQL: SELECT / Result Output: Results (Record (Field1, Field2 ..) Record (Field1 ...)) Document: A "unit" that needs to index "unit" by multiple fields Composition RECORD: Record, contain multiple fields
Field: Field Field: Field
Hits: Query result set, composed of matching documents: Query result set, consisting of multiple Record
Full text search ≠ Like "% keyword%"
Usually relatively thick books often attach keywords (such as: Beijing: 12, 34, Shanghai: 3, 77 ...), it helps readers find the page number of related content more quickly. And the database index can greatly improve the speed principle of the query, imagine how much the index looks higher than one page after the retrieval of the book ... and the index is high, and the other reason is that it is Needs. The core is a sorting problem for the search system.
Since the database index is not designed for full-text index, the database index does not work when using the LIKE "% keyword%". When using the Like query, the search process becomes a traversal process similar to a page. Therefore, for database services containing fuzzy queries, LIKE is extremely harmful to performance. If you need to make a plurality of keywords: Like "% keyword1%" and lot "% keyword2%" ... It is also conceivable.
Therefore, the key to establishing an efficient search system is to establish a reverse index mechanism similar to the scientific index, and sequentially store the data source (such as a plural article), there is another keyword list, which is used. Storage Keyword ==> Article mapping relationship, using such a mapping relationship index: [Key words ==> "post-name number, number of times (even the location: position: start offset, end offset), appear Frequency], the retrieval process is to turn the fuzzy query into multiple processes that can utilize the accurate query of the index. Thus, it greatly improves the efficiency of multi-keyword queries, so the full text search problem is finally a sorting problem.
It can be seen that the exact query of the fuzzy query relative database is a very uncertain problem, which is also a limited reason for the limited reasons for full-text retrieval. The most core of Lucene is a full-text index mechanism that is not good at traditional databases through a special index structure, and provides an extended interface to facilitate customization of different applications.
You can compare the fuzzy query of the database through a table:
Lucene Full-text Index Engine Database
Index Setting data from the data source through full-text indexing For the Like query, the data traditional index is not used. The data needs to be ambiguously recorded by a convenient record, and there is a plurality of orders of magnitude more than the indexed search speed. The matching effect is matched by the word element (TERM), and the support for non-English such as Chinese can be implemented through the implementation of the language analysis interface. Using: Like "% net%" will also match Netherlands, and the fuzzy matching of multiple keywords: use the LIKE "% COM% Net%": You cannot match the word step down XXX.NET..xxx.com
The match is a matching degree algorithm, and the result of the degree (similarity) is higher than the result. There is no control of the degree of matching: For example, 5 words appear in the record and 1 time, the result is the same.
The result output through a special algorithm, outputs the highest gathering head 100 results, and the result set is read by the buffer small batch. Returns all result sets that require a lot of memory to save these temporary result sets when matching entries (such as tens of thousands).
The customizable is achieved by different language analysis interfaces, which can be convenient to customize the index rules that meet the application (including support for Chinese) without interfaces or interfaces, and cannot be customized.
Conclusion High-load fuzzy query applications, the rules of the fuzzy query, the amount of data of the index is low, the fuzzy matching rules are simple or need to fuzzy query data
The full-text search and database application have the greatest difference in: the most relevant head 100 results meet more than 98% of users' needs
LUCENE's innovation: Most search (database) engines are used with B tree structures to maintain indexes, and the index update will result in a large number of IO operations, Lucene is in realization: it is slightly improved: not to maintain an index File, but continuously creates a new index file when the index is expanded, then merged these new small index files into the original index (for different update policies, the size of the batch can be adjusted), so Under the premise of affecting the efficiency of the retrieval, the efficiency of the index is improved. Comparison of Lucene and Some Full Search System / Applications: Lucene Other Open Source Full-text Retrieval System Increment Index and Batch Index can be an index (APPEND), which can be batch indexes for large amounts of data, and interface design for optimizing bulk indexes And small batch incremental indexes. Many systems only support batch index, and sometimes data sources have additional additions also need to rebuild indexes. The data source Lucene does not define a specific data source, but a structure of a document, so it can be very flexible to adapt to various applications (as long as there is a suitable converter with a suitable converter into a corresponding structure), many systems are only for web pages, lacks Flexibility of other format documents. Index Content Crappiness The documentation is made up of multiple fields, and even those fields need to be indexed. Those fields do not need indexes, and the fields of the near-step index are also divided into types that require word and no description: Index, such as: Title, article content field does not need to perform a word index, such as: Author / Date field lacks versatility, often uses the entire document throughout the language to achieve different expansion of language analyzer: it can be filtered off Word: An the of the et al, Western Siculial Analysis: Both Jumps Jumped Jumper into JUMP to index / retrieve non-English support: to Asian language, the index support for the Arab language lacks universal interface implementation query analysis through query analysis interface implementation, You can customize your own query syntax rules: such as: - AND OR relationships between multiple keywords and other concurrent access to multi-user use for the use of unparalleled words on Asian languages (Word Segment) For Chinese, full-text index First, we must solve a problem with language analysis. For English, the words in the statement are naturally separated by spaces, but the word in the China-Japanese Korean statement of Asian languages is one word, all, first, first put the statement If you are indexing in "Word", this word is dilated is a big problem. First, you must not use a single character as an index unit, otherwise check "Shanghai", you can't make "sea" also match. But in a word: "Beijing Tiananmen", how does the computer use the Chinese language habits? "Beijing Tiananmen" or "North Beijing Tiananmen"? Let the computer can separate according to language habits, often requiring a machine to have a relatively rich word library to be more accurate to identify words in statements. Another solution is to adopt automatic cutting algorithms: Separate words according to the 2 yuan syntax (BIGRAM), such as "Beijing Tiananmen" ==> "Beijing Jingtian Tian Anmen".
In this way, in the query, whether it is inquiry "Beijing" or query "Tiananmen", put the query phrase according to the same rules: "Beijing", "Tiananmen", multiple keywords, "and" "The relationship is combined, and it can also be mapped to the corresponding index. This approach is common for other Asian languages: Korean, Japanese is common. Based on automatic segmentation is that there is no maintenance cost, simple, and the disadvantage is that the index efficiency is low, but for small and medium-sized applications, 2-element syntax is still sufficient. Based on 2 yuan, the index is generally large and the source file is similar, and for English, index files generally only have 30% -40% of the original document, automatic segmentation table cutting implementation is very simple to implement complex queries to increase query analysis. The complexity of complexity, suitable for comparing complicated query comparison rules storage efficiency index redundant, index is almost as high as the original, 30% of the original size of 30%, no warranty, no warranty, no warranty cost, very high : Sino-Japanese and other languages need to be maintained separately. It also needs to include word frequency statistics, etc. Applicable domain embedded system: operating environmental resource limited distributed system: no word synchronization problem multi-language environment: no word maintenance costs for professional search engines with high query and storage efficiency requirements are currently larger The language analysis algorithm of the search engine is generally based on the combination of the above two mechanisms. About Chinese language analysis algorithm, you can find more information on Google Keyword "Wordsegment Search". Install and use Download: http://jakarta.apache.org/lucene/ Note: Some of Lucene's more complex lexical analysis is generated with Javacc (Javacc: JavacAcompilerCompiler, pure Java's lexical analysis generator), so if The source code is compiled or modified in the QueryParser, customizes its own lexical analyzer, and needs to download Javacc from http://www.experimentalstuff.com/technologies/javacc/.
The composition of Lucene: Index Module (INDEX) and retrieval module (Search) is the main external application inlet org.InDex / cable introduction port Org.Apache .Lucene.Alysis / language analyzer org.apache.lucene.QueryParser / Query Analyzer org.apache.lucene.document / Storage Structure Org.Apache.lucene.Store/ underlay IO / Storage Structure ORG.Apache.lucene.UTIL / Some common data structures Simply demonstrate the use of Lucene: Index Process: Read file names (multiple) from the command line, store 2 fields of the file division (PATH field) and content (Body field), And the content is full-text index: the unit of index is the Document object, each Document object contains multiple field Field objects, which can choose different index / storage field rules, lists for different field properties and data output requirements. As follows: Methods Check the word Field.Text (String Name, String Value) YES YES YES: Title, content field Field.Text (String name, reader value) YES YES NO Split word index Store, such as: META information, not to return display, but need to retrieve content field.Keyword (String name, string value) No Yes Yes unsuitable index and store, such as: Date field Field.unIndexed (String Name, String Value) NO no yes does not index, only store, such as: file path field.unStored (String name, string value) YES YES NO only full-text index, not store public class indexfiles {// How to use :: IndexFiles [index output directory] [ Indexed file list] ... public static void main (string [] args) throws exception {string indexpath = args [0]; indexwriter Writer; // with a specified language The analyzer constructs a new write-indexer (the third parameter indicates whether it is an additional index) Writer = new model (INDEXPATH, New SimpleAly (), FRSE); for (INT i = 1; i // Constructs the Document object containing 2 fields Field // one is the path path field, does not index, only store // one is a content body field, and store Document doc = new document (); doc.add Field.unIndexed ("path", args [i])); doc.add ("Body", (Reader) New InputStreamReader (IS))); // Write a document to index Writer.AddDocument (DOC Is.close ();}; // Close the write-indexer Writer.close ();}} You can see during the index: • Language analyzer provides an abstract interface, so language analysis (AnalySer) is customized Although Lucene default provides 2 comparison of generic analyzers SimpleAnaly, and StandardAnalyser, Chinese is not supported by default, so the sections of the Chinese language are joined, and these 2 analyzers need to be modified. · Lucene does not specify the format of the data source, but only one universal structure (Document object) is provided to accept the index input, so the input data source can be: Database, Word document, PDF document, HTML document ... Designing the corresponding parser converter to index the data source into a docuement object. · For large quantities of data index, you can also increase the efficiency of bulk index by adjusting the INDEXERWRITE's file merge frequency properties. Retrieval process and results show: The search results returned to the HITS object, which can access the contents in Document ==> Field via it. Suppose the full-text search is performed according to the Body field, print the PATH field of the query result and the corresponding query, PUBLIC CLASS Search {public static void main (String [] args) throws exception {string indexpath = args [0 ], queryString = args [1]; // Point to the index directory Searcher Searcher = new indexsearcher (indexpath); // Query parser: Using and index the same language analyzer query query = queryparser.parse (querystring, " Body ", new Simpleanalyzer ()); // Search results Using HITS storage hits hits = search (query); // can access data and query of the corresponding field via HITS FOR (INT i = 0; i Hacking Lucene Simplified Query Analyzer Personally LUCENE Become a Jakarta project, drawing too much time for debugging increasing queryParser, and most of them are not familiar, current LUCENE supporting grammar: Query :: = (CLAUSE) * CLAUSE :: = [" ", "-"] [ Here, you need to modify the Hitcollector process in indexsearcher: ... scorer.score (new hitcollector () {private float minscore = 0.0f; public final void collect (int Doc, float score) {if (Score> 0.0F && / / IGNORE ZEROED BUCKETS (Bits == Null || Bits.get (DOC)) {// Skip Docs Not in BitStotalhits [0] ; if (Score> = Minscore) {/ * Original: Lucene will do DOCID and corresponding Match SCORE Export Results In list: * HQ.PUT (New Scoredoc (SCORE)); // Update Hit Queue * If you use DOC or 1 / DOC instead of score, you will implement according to DOCID or inverse Row * Assuming that the data source index has been ranked in accordance with a field, and the results are implemented in accordance with the DOCID sorting. * Sort for a field, even more complex score and docid fit. * / HQ .put (New Scoredoc (DOC, (FLOAT) 1 / DOC); if (hq.size ()> ndocs) {// if hit queue overfullhq.pop (); // remove lowest in hit queueminscore = ((Scoredoc) ) hq.top ()). score; // reset minscore}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}, reader.maxdoc ()); More universal input / output interface Although Lucene does not define a determined input document format, more and more People think that use a standard intermediate format as the Lucene's data import interface, then other data, such as PDF, only need to convert to a standard intermediate format through the parser to standard, and can perform data index. This intermediate format is mainly XML, similar to achieving 4, 5: Data Source: Word PDF HTML DB Other / | | | / XML Intermediate Format | Lucene Index currently has no parser for MSWORD documents, because word documentation Unlike ASCII-based RTF documents, you need to use a COM object mechanism analysis. This is the relevant information on Google's check: http://www.intrinsyc.com/products/enterprise_Applications.asp The other method is to convert the Word document to text: http://www.winfield.demon.nl/index .html Indexing process Optimized index generally divided into two cases, one is a small batch index expansion, one is a large quantity index reconstruction. During the index process, it is not a write operation of the index file every time the new DOC is added to the index (file I / O is a very consumable resource). Lucene first indexed operation in memory, and writes files based on certain batch. The larger the spacing of this batch, the less the number of writes of the file, but there will be a lot of memory. Easely occupy less memory, but the file IO operation is frequent, the index speed will be very slow. There is a MERGE_FACTOR parameter in IndexWriter to help you make full use of memory reduction files based on the application environment after constructing the indexer. According to my experience: Default Indexer is written once every 20 record indexes, each increases Merge_Factor by 50 times, and the index speed can be increased by about 1 times. Search Procedure Optimization Lucene supports memory index: Such search is raised than an order of speed based on file-based I / O. Http://www.onjava.com/lpt/a/3273 is also necessary to minimize the creation of IndexSearcher and the caching of the front desk of the search results. Optimization of Lucene's full-text retrieval is that after the first index search, the document is not read, and the ID of the top 100 results in all the results (TOPDOCS) will result in the result. The cache is returned, here you can compare the database search: If it is a 10,000 database retrieval result set, the database is necessary to get all the recorded contents will then return to the application result set. So even if the total number of match matches, the memory space for Lucene does not use a lot of memory spaces. For general fuzzy search applications, there is no so much result, and the head 100 has been able to meet more than 90% retrieval requirements. If the first batch of cache results are used to read more results, Searcher retrieves and generates a cache for 1 times the number of last search caches, and re-rewriting. So if you construct a Searcher to find 1-120 results, Searcher is actually a 2-time search process: After the first 100 is taken, the cache result is used, and the Searcher retries and constructs a 200 result cache, and so on 400 buffers, 800 cahes. Since each Searcher object disappears, these cache also access to that, you may want to cache the results record, the number of cache is try to guarantee the following to take full use of the first result cache, do not let Lucene retrieved multiple times, and The result cache can be classified. Another feature of Lucene is to filter out the result of low matching degree during the process of collecting results. This is also the difference between the results of the search needs to return the results of the search. Some of my try: · Support Chinese tokenizer: There are 2 versions here, one is generated by javacc, pressing a character to a token index for the CJK section, and the other is rewritten from SimpleTokenizer, supporting the number and letters to English, The Chinese is indexed in the iterative index. · Index based on XML data source: XMLINDEXER, so all data sources can be indexed with XMLINDXER as long as they can convert into designated XMLs according to DTD.