Lucene is a high-performance full-text retrieval engine that uses an inverted file index structure. This data structure and corresponding generating algorithm are as follows:
0) • The contents of the article 1 and 2 Articles 1 are: Tom Lives in Guangzhou Too. Article 2's content is: He ONCE Lived in Shanghai.1) Since Lucene is based on keyword index and Inquiries, first we have to get the keywords of these two articles, usually we need to do the following to add measures a. We now have the article content, namely a string, we must first find all the words in the string, ie words. English words are better treated due to space separation. The Chinese words are connected together to require special particle treatment. b. "in" in the article, "Once" "TOO" is nothing in the actual meaning, "" Yes "in Chinese is usually no specific meaning, and these words that do not represent the concept can be filtered out of C. Users typically hope that "He" can also be found when "HE" is found, so all words need to be uniform. d. Users typically hope that "live", "live", "Live", "lived" is needed to restore "live", "Live" e. The punctuation symbol in the article is usually not Indicates a concept, or filter out in Lucene's measures by the Analyzer class.
After treatment? All key words of article 1 are: [Tom] [live] [Guangzhou] [i] [Live] [Guangzhou]? All key words of article 2 are: [HE] [live] [Shanghai] 2 After you have a keyword, we can set up an inverted index. The above correspondence is: "Article number" on "all keywords in the article." The inverted index poured this relationship and became: "Keyword" for "all articles with this keyword". Articles 1, 2 turns into keywords after changing ?? Article number guangzhou 1HE ?????? 2i ??????? 1live ????? 1, 2shanghai? 2 Tom ????? 1
Usually I only know which articles in the article are not enough, we also need to know the number of times and the position of the keyword in the article, usually there are two positions: a) character position, that is, record the word is the first few Character (advantage is that the keyword is brightly positioned); b) Keyword position, that is, record the word is the second keyword in the article (the advantage is saving index space, phrase), LUCENE recorded This is this location.
After adding "Frequency" and "Appearance" information, we have the index structure to: Key words ?? Articles å · [Existing frequency] ?? appearance position guangzhou 1 [2] ??????????? ???? 3, 6HE ?????? 2 [1] ?????????????? 1i ???????? 1 [1] ?????? ???????? 4live ????? 1 [2], 2 [1] ?????????? 2, 5, 2shai? 2 [1] ??????? ??????? 3tom ????? 1 [1] ?????????????? 1
With a live behavior example, we explain that this structure: Live appears twice in the article 1, there is an appearance in the article 2, what is the appearance of "2, 5, 2"? We need to combine the article number And the frequency of occurrence, 2 times in the article 1, then "2, 5" said that Live has two locations that appear in article 1, and the article 2 has appeared once, and the remaining "2" said that Live is The second keyword in the article 2. The above is the core part of the Lucene index structure. We noticed that the keyword is arranged in character order (Lucene does not use B tree structure), so Lucene can quickly locate keywords with binary search algorithms. • When the implementation, Lucene will be saved as a dictionary, a frequency file (Frequencies), the position file (all), frequency file (FREQUENCIES), and Positions. The dictionary document not only saves each keyword, but also preserves a pointer to the frequency file and location file, and the frequency information and location information of the keyword can be found through the pointer. ? Lucene uses Field concepts, used to express information in location (such as in the title, in the article), in the built index, the field information is also recorded in the dictionary file, each keyword has a field. Information (because each keyword must belong to one or more Field).
• In order to reduce the size of the index file, Lucene also uses compression technology to indexes. First, compress the keywords in the dictionary document, keyword compression is
For example, the current word is "Arabic", the previous word is "Arab", then "Arabic" is compressed as <3, and the language>. Second, a large number of uses the compression of the numbers, the number only saves the difference with the previous value (this can reduce the length of the number, thereby reducing the number of bytes required to save the number). For example, the current article number is 16389 (no compression to save with 3 bytes), the previous article is 16382, saving 7 after compression (only one byte).
?
? Below we can explain why you want an index by explaining the query of the index.
Suppose you want to query the word "Live", Lucene first look for dictionary, find the word, read all article numbers by pointing to the pointer to the frequency file, and then return the result. The dictionary is usually very small, and thus, the entire process is in milliseconds.
The algorithm is matched in a normal order, but not to build the index, but a string of content of all articles, this process will be quite slow, and when the number of articles is large, the time is often unbearable.