Author: Cha Dong Email: chedongATbigfoot.com/chedongATchedong.com write to: 2002/08 Last updated: 06/07/2003 22:37:01 Disclaimer: can be reproduced, reprint, please be sure to indicate the form of hyperlinks to the original article reprint Source and author information and this statement http://www.chedong.com/tech/th/lucene.html Keywords: Lucene Java Full-text Search Engine Chinese Word Segment Summary: Lucene is a full-text index toolkit package based on Java. Introduction to the full-text index engine of Java: Realization of the Author and Lucene's History Search: LuENE Full-text Index and Database Index Comparison Chinese Cut Foundation Mechanism Introduction: Based on the Ratual and Automatic Word Algorithm for Specific installation and use : System Structure Introduction and Demos Hacking Lucene: Simplified Query Analyzer, Delete Implementation, Custom Sort, Application Interface Extension From Lucene We can learn what Java-based full-text index / retrieval engine - Lucenelucene is not a complete Full-text index application, but is a full-text index engine toolkit written by Java, which can be easily embedded in various applications to implement the full-text index / retrieval function for applications. Lucene author: Lucene contributor DougCutting is a senior full-text indexing / retrieval expert, once the V-Twin search engine (one of the achievements of Apple's Copland operating system) the main developer, after a senior system architecture design in Excite Teacher, currently engaged in research on the underlying architecture of Internet. He contributed to Lucene's goal is to join a full-text search function for a variety of small and medium-sized applications. LUCENE's development process: I have published it in the author's own www.lucene.com, and later released in Sourceforge, became a subproject of the Apache Foundation Jakarta at the end of 2001: http://jakarta.apache.org/lucene/ There are many The Java project use Lucene as its full-text index engine, more famous: JIVE: Web Forum system; Eyebrows: Mail list HTML archive / browse / query system, this article's main reference document "Thelucene Search Engine: Powerful, flexible And Free "author is one of the main developers of the Eyebrows system, and Eyebrows has become the primary mailing list archive system of the current Apache project. COCOON: Based on the XML-based Web Publishing Framework, the full-text search section uses Lucene Eclipse: Java-based open development platform, the full-text index of the help part uses Lucene for Chinese users, the most concerned question is whether it supports Chinese full-text search . However, through the introduction of Lucene's structure, you will learn that due to the LUCENE's good architecture design, the support of Chinese can implement support for Chinese retrieval simultaneously. The full-text search implementation mechanism Lucene's API interface design is more common, and the input and output structure is very like the table ==> record ==> field of the database, so many traditional applications, databases, etc. can be more convenient to map to Lucene Storage structure / interface. Overall: You can first treat Lucene as a database system that supports full-text indexes.
Compare Lucene and Database: Lucene Database index data source: DOC (Field1, Field2 ...) DOC (Field1, Field2 ...) / indexer / _____________ | Lucene INDEX | ------------ - / searcher / result Output: Hits (DOC (Field1, Field2) DOC (Field1 ...) Index Data Source: Record (Field1, Field2 ...) Record (Field1 ..) / SQL: Insert / _____________ | DB Index | ------------- / SQL: SELECT / Result Output: Results (Record (Field1, Field2 ..) Record (Field1 ...) Document: A need to index Unit "One Document consists of multiple fields, containing multiple fields, Field: Field Field: Field Hits: Query results set, composed of matching documents RECORDSET: Query result set, composed of multiple Record full text search kind" % keyword% "Usually with a relatively thick books often attach keywords (such as: Beijing: 12, 34, Shanghai: 3,77 ...), it helps readers find the page number of related content more quickly. And the database index can greatly improve the speed principle of the query, imagine how much the index looks higher than one page after the retrieval of the book ... and the index is high, and the other reason is that it is Needs. The core is a sorting problem for the search system. Since the database index is not designed for full-text index, the database index does not work when using the LIKE "% keyword%". When using the Like query, the search process becomes a traversal process similar to a page. Therefore, for database services containing fuzzy queries, LIKE is extremely harmful to performance. If you need to make a plurality of keywords: Like "% keyword1%" and lot "% keyword2%" ... It is also conceivable. Therefore, the key to establishing an efficient search system is to establish a reverse index mechanism similar to the scientific index, and sequentially store the data source (such as a plural article), there is another keyword list, which is used. Storage Keyword ==> Article mapping relationship, using such a mapping relationship index: [Key words ==> "post-name number, number of times (even the location: position: start offset, end offset), appear Frequency], the retrieval process is to turn the fuzzy query into multiple processes that can utilize the accurate query of the index. Thus, it greatly improves the efficiency of multi-keyword queries, so the full text search problem is finally a sorting problem. It can be seen that the exact query of the fuzzy query relative database is a very uncertain problem, which is also a limited reason for the limited reasons for full-text retrieval. The most core of Lucene is a full-text index mechanism that is not good at traditional databases through a special index structure, and provides an extended interface to facilitate customization of different applications. You can compare the fuzzy query of the database through a table: Lucene Full-text Index Engine Database Index Sets the data in the data source to establish a reverse index by a full-text index For the Like query, the data traditional index is worth nothing.
The data needs to be ambiguously recorded by a convenient record, and there is a plurality of orders of magnitude more than the indexed search speed. The matching effect is matched by the word element (TERM), and the support for non-English such as Chinese can be implemented through the implementation of the language analysis interface. Using: Like "% Net%" will also match Netherlands, and the fuzzy matching of multiple keywords: use the LIKE "% COM% Net%": You cannot match the word step down xxx.net..xx.com matching degree The matching degree algorithm is ranked in front of the result of the degree (similarity). There is no control of the degree of matching: For example, 5 words appear in the record and 1 time, the result is the same. The result output through a special algorithm, outputs the highest gathering head 100 results, and the result set is read by the buffer small batch. Returns all result sets that require a lot of memory to save these temporary result sets when matching entries (such as tens of thousands). Deliveability is implemented through different language analysis interfaces, which can be convenient to customize the index rules that meet the application needs (including support for Chinese) without interface or interface complex, unable to customize the fuzzy query application of high loads, which requires the responsible blur query. The rules, the amount of the index is relatively large, the fuzzy matching rules are simple or need to be fuzzy query, the amount of data, the maximum difference between the database application is: let the most relevant head 100 results to meet more than 98% of the user's needs LUCENE's innovation: Most search (database) engines are used with B tree structures to maintain indexes, and the index update will result in a large number of IO operations, Lucene is in realization: it is slightly improved: not to maintain an index File, but continuously creates a new index file when the index is expanded, then merged these new small index files into the original index (for different update policies, the size of the batch can be adjusted), so Under the premise of affecting the efficiency of the retrieval, the efficiency of the index is improved. Comparison of Lucene and Some Full Search System / Applications: Lucene Other Open Source Full-text Retrieval System Increment Index and Batch Index can be an index (APPEND), which can be batch indexes for large amounts of data, and interface design for optimizing bulk indexes And small batch incremental indexes. Many systems only support batch index, and sometimes data sources have additional additions also need to rebuild indexes. The data source Lucene does not define a specific data source, but a structure of a document, so it can be very flexible to adapt to various applications (as long as there is a suitable converter with a suitable converter into a corresponding structure), many systems are only for web pages, lacks Flexibility of other format documents.
Index Content Crappiness The documentation is made up of multiple fields, and even those fields need to be indexed. Those fields do not need indexes, and the fields of the near-step index are also divided into types that require word and no description: Index, such as: Title, article content field does not need to perform a word index, such as: Author / Date field lacks versatility, often uses the entire document throughout the language to achieve different expansion of language analyzer: it can be filtered off Word: An the of the et al, Western Siculial Analysis: Both Jumps Jumped Jumper into JUMP to index / retrieve non-English support: to Asian language, the index support for the Arab language lacks universal interface implementation query analysis through query analysis interface implementation, You can customize your own query syntax rules: such as: - AND OR relationships between multiple keywords and other concurrent access to multi-user use for the use of unparalleled words on Asian languages (Word Segment) For Chinese, full-text index First, we must solve a problem with language analysis. For English, the words in the statement are naturally separated by spaces, but the word in the China-Japanese Korean statement of Asian languages is one word, all, first, first put the statement If you are indexing in "Word", this word is dilated is a big problem. First, you must not use a single character as an index unit, otherwise check "Shanghai", you can't make "sea" also match. But in a word: "Beijing Tiananmen", how does the computer use the Chinese language habits? "Beijing Tiananmen" or "North Beijing Tiananmen"? Let the computer can separate according to language habits, often requiring a machine to have a relatively rich word library to be more accurate to identify words in statements. Another solution is to adopt automatic cutting algorithms: Separate words according to the 2 yuan syntax (BIGRAM), such as "Beijing Tiananmen" ==> "Beijing Jingtian Tian Anmen". In this way, in the query, whether it is inquiry "Beijing" or query "Tiananmen", put the query phrase according to the same rules: "Beijing", "Tiananmen", multiple keywords, "and" "The relationship is combined, and it can also be mapped to the corresponding index. This approach is common for other Asian languages: Korean, Japanese is common. Based on automatic segmentation is that there is no maintenance cost, simple, and the disadvantage is that the index efficiency is low, but for small and medium-sized applications, 2-element syntax is still sufficient. Based on 2 yuan, the index is generally large and the source file is similar, and for English, index files generally only have 30% -40% of the original document, automatic segmentation table cutting implementation is very simple to implement complex queries to increase query analysis. The complexity of complexity, suitable for comparing complicated query comparison rules storage efficiency index redundant, index is almost as high as the original, 30% of the original size of 30%, no warranty, no warranty, no warranty cost, very high : Sino-Japanese and other languages need to be maintained separately. It also needs to include word frequency statistics, etc. Applicable domain embedded system: operating environmental resource limited distributed system: no word synchronization problem multi-language environment: no word maintenance costs for professional search engines with high query and storage efficiency requirements are currently larger The language analysis algorithm of the search engine is generally based on the combination of the above two mechanisms. About Chinese language analysis algorithm, you can find more information on Google Keyword "Wordsegment Search".
Install and use Download: http://jakarta.apache.org/lucene/ Note: Some of Lucene's more complex lexical analysis is generated with Javacc (Javacc: JavacAcompilerCompiler, pure Java's lexical analysis generator), so if The source code is compiled or needs to modify QueryParser, customize your own lexical analyzer, and you need to download Javacc from http://www.experimentalstuff.com/technologies/javacc/.
The composition of Lucene: Index Module (INDEX) and retrieval module (Search) is the main external application inlet org.InDex / cable introduction port Org.Apache .Lucene.Alysis / language analyzer org.apache.lucene.QueryParser / Query Analyzer org.apache.lucene.document / Storage Structure Org.Apache.lucene.Store/ underlay IO / Storage Structure ORG.Apache.lucene.UTIL / Some common data structures Simply demonstrate the use of Lucene: Index Process: Read file names (multiple) from the command line, store 2 fields of the file division (PATH field) and content (Body field), And the content is full-text index: the unit of index is the Document object, each Document object contains multiple field Field objects, which can choose different index / storage field rules, lists for different field properties and data output requirements. As follows: Methods Check the word Field.Text (String Name, String Value) YES YES YES: Title, content field Field.Text (String name, reader value) YES YES NO Split word index Store, such as: META information, not to return display, but need to retrieve content field.Keyword (String name, string value) No Yes Yes unsuitable index and store, such as: Date field Field.unIndexed (String Name, String Value) NO no yes does not index, only store, such as: file path field.unStored (String name, string value) YES YES NO only full-text index, not store public class indexfiles {// How to use :: IndexFiles [index output directory] [ Indexed file list] ... public static void main (string [] args) throws exception {string indexpath = args [0]; indexwriter Writer; // Construct a new write-indexer with the specified language parser (third parameter represents whether to add index) Writer = New Simpleanaly (), false; for (int i = 1; I // Write the document to index Writer.addDocument (); is.close ();}; // Close the write-indexer Writer.close ();}} The index process can be seen: Language analyzer provides abstract Interface, the language analysis is customized, although Lucene default provides two comparison universal analyzers SimpleAnaly, and Standardanalyser, these 2 analyzers are not supported by Chinese, so they want to join the Chinese language. The rules need to modify these 2 analyzers. Lucene does not specify the format of the data source, but only one universal structure (Document object) is provided to accept the index input, so the input data source can be: database, Word document, PDF document, HTML document ... as long as it is designed The corresponding parsing converter can index the data source to a docuement object. For large quantities of data index, it is also possible to increase the efficiency of bulk index by adjusting the INDEXERWRITE's file merge frequency properties. Retrieval process and results show: The search results returned to the HITS object, which can access the contents in Document ==> Field via it. Suppose the full-text search is performed according to the Body field, print the PATH field of the query result and the corresponding query, PUBLIC CLASS Search {public static void main (String [] args) throws exception {string indexpath = args [0 ], queryString = args [1]; // Point to the index directory Searcher Searcher = new indexsearcher (indexpath); // Query parser: Using and index the same language analyzer query query = queryparser.parse (querystring, " Body ", new Simpleanalyzer ()); // Search results Using HITS storage hits hits = search (query); // can access data and query of the corresponding field via HITS FOR (INT i = 0; i Hacking Lucene Simplified Query Analyzer Personally LUCENE Become a Jakarta project, drawing too much time for debugging increasing queryParser, and most of them are not familiar, current LUCENE supporting grammar: Query :: = (CLAUSE) * CLAUSE :: = [" ", "-"] [ Here, you need to modify the Hitcollector process in indexsearcher: ... scorer.score (new hitcollector () {private float minscore = 0.0f; public final void collect (int Doc, float score) {if (Score> 0.0F && / / IGNORE ZEROED BUCKETS (Bits == Null || Bits.get (Doc)) {// Skip Docs Not in Bits Totalhits [0] ; if (Score> = Minscore) {/ * Original: Lucene Docid and The corresponding matching degree score is in the list of results hits: * HQ.PUT (New Scoredoc (SCORE)); // Update Hit Queue * If you use DOC or 1 / DOC instead of score, you will implement it according to DOCID or Reverse row * assumes that the data source index has been placed in accordance with a field, and the results are implemented in accordance with the Docid sorting. * Sort for a field can even achieve more complex Score and Docid fit. * / HQ.PUT (New Scoredoc (Float) 1 / DOC); if (HQ.Size ()> ndocs) {// if Hit Queue overfull HQ.POP (); // Remove Lowest in Hit Queue Minscore = (Scoredoc) HQ.TOP ()). Score; // reset minscore}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}, reader.maxdoc ()); More universal input / output interface Although Lucene does not define a determined input document format, but getting The more people think of using a standard intermediate format as Lucene's data import interface, then other data, such as PDF, only need to convert to standard intermediate formats through the parser to standard. This intermediate format is mainly XML, similar to achieving 4, 5: Data Source: Word PDF HTML DB Other / | | | / XML Intermediate Format | Lucene Index currently has no parser for MSWORD documents, because word documentation Unlike ASCII-based RTF documents, you need to use a COM object mechanism analysis. This is the relevant information on Google's check: http://www.intrinsyc.com/products/enterprise_Applications.asp The other method is to convert the Word document to text: http://www.winfield.demon.nl/index .html index process Optimized index generally divided into two cases, one is a small batch index extension, one is a large quantity index reconstruction. During the index process, it is not a write operation of the index file every time the new DOC is added to the index (file I / O is a very consumable resource). Lucene first indexed operation in memory, and writes files based on certain batch. The larger the spacing of this batch, the less the number of writes of the file, but there will be a lot of memory. Easely occupy less memory, but the file IO operation is frequent, the index speed will be very slow. There is a MERGE_FACTOR parameter in IndexWriter to help you make full use of memory reduction files based on the application environment after constructing the indexer. According to my experience: Default Indexer is written once every 20 record indexes, each increases Merge_Factor by 50 times, and the index speed can be increased by about 1 times. Search Procedure Optimization Lucene supports memory index: Such search is raised than an order of speed based on file-based I / O. Http://www.onjava.com/lpt/a/3273 is also necessary to minimize the creation of IndexSearcher and the caching of the front desk of the search results. Optimization of Lucene's full-text retrieval is that after the first index search, the document is not read, and the ID of the top 100 results in all the results (TOPDOCS) will result in the result. The cache is returned, here you can compare the database search: If it is a 10,000 database retrieval result set, the database is necessary to get all the recorded contents will then return to the application result set. So even if the total number of match matches, the memory space for Lucene does not use a lot of memory spaces. For general fuzzy search applications, there is no so much result, and the head 100 has been able to meet more than 90% retrieval requirements. If the first batch of cache results are used to read more results, Searcher retrieves and generates a cache for 1 times the number of last search caches, and re-rewriting. So if you construct a Searcher to find 1-120 results, Searcher is actually a 2-time search process: After the first 100 is taken, the cache result is used, and the Searcher retries and constructs a 200 result cache, and so on 400 buffers, 800 cahes. Since each Searcher object disappears, these cache also access to that, you may want to cache the results record, the number of cache is try to guarantee the following to take full use of the first result cache, do not let Lucene retrieved multiple times, and The result cache can be classified. Another feature of Lucene is to filter out the result of low matching degree during the process of collecting results. This is also the difference between the results of the search needs to return the results of the search. Some of my try: Support Chinese tokenizer: Here is 2 versions, one is generated by javacc, pressing a character to a token index for the CJK section, and the other is rewritten from SimpleTokenizer, supporting the number and letters token, pair Chinese is indexed by iterative. The indexer based on the XML data source: XMLINDEXER, so all data sources can be indexed with XMLINDXER as long as the DTD is converted to the specified XML. Sort by a field: Search for the results of the record index sequence: IndexOrDersearcher, so if you need to let the search results are sorted according to a field, you can let the data are sequential (such as: Pricefield), so index After that, the result is equivalent to the result of the field sorting by using this sequence of recorded ID. Learn more Lune from Lucene is indeed a model of object design. All issues are convenient and reused by an additional abstraction layer: You can reach your own purpose by re-implementing your own purpose, but not other modules Need; Simple Application Inlet Searcher, Indexer, and call the underlying component collaborative completion search task; all objects tasks are very special: such as search procedures: QueryParser analysis converts query statements into a series of precise queries ( Query, read the index reader by the underlying index reading structure indexreader, and score / sort the search results with the corresponding scorer.