Lucene inverted the prototype

xiaoxiao2021-03-06 57

Lucene is a high-performance Java full-text retrieval toolkit, which uses an inverted file index structure. This structure and the corresponding generating algorithm are as follows:

0) Two articles 1 and 2

The content of the article is: Tom Lives in Guangzhou, I Live in Guangzhou Too.

The content of the article 2 is: He Once Lived in Shanghai.

1) Since Lucene is based on keyword index and inquiry, first we have to obtain the key words of these two articles, usually we need to do the following treatment measures

a. We now have the article content, namely a string, let's first find all the words in the string, that is, a word. English words are better treated due to space separation. The Chinese words are connected together to require special particle treatment.

b. "in" in the article, "Once" "TOO", there is no practical meaning, "" Yes "in Chinese is usually no specific meaning, these do not represent the concept of the concept can be filtered out

c. Users typically hope that "He" can also be found when "HE" is found, so all words need to be unified.

d. Users typically hope that "live", "lived" articles can also be found when "live" is also found, so "lives", "Lived" needs to be restored to "Live".

e. Point symbols in the article usually do not represent a concept, or filter out

The above measures in Lucene are completed by the Analyzer class.

After treatment

All key words of article 1 are: [Tom] [Live] [GuangzhouGzhou] [i] [live] [guangzhou]

All key words of article 2 are: [HE] [Live] [Shanghai]

2) After you have a keyword, we can build an inverted index. The above correspondence is: "Article number" on "all keywords in the article." The inverted index poured this relationship and became: "Keyword" for "all articles with this keyword". Articles 1, 2 turns into the chassis

Key words articles

Guangzhou 1

HE 2

i 1

Live 1, 2

Shanghai 2

Tom 1

Usually I only know which articles in the article are not enough, we also need to know the number of times and the position of the keyword in the article, usually there are two positions: a) character position, that is, record the word is the first few Character (advantage is that the keyword is brightly positioned); b) Keyword position, that is, record the word is the second keyword in the article (the advantage is saving index space, phrase), LUCENE recorded This is this location.

After adding "Frequency" and "appearance location", our index structure becomes:

Key words articles number [frequency] appearance position

Guangzhou 1 [2] 3, 6

HE 2 [1] 1

i 1 [1] 4

Live 1 [2], 2 [1] 2, 5, 2

Shanghai 2 [1] 3

Tom 1 [1] 1

With a live behavior example, we explain that the structure appears 2 times in article 1, and there is a position in the article 2, what is the appearance of "2, 5, 2"? We need to analyze the article number and the frequency of appearance, 2 times in the article 1, then "2, 5" said that Live has two locations that appear in article 1, and the article 2 has occurred once, and the remaining "2 "Just say that Live is the second keyword in the article 2. The above is the core part of the Lucene index structure. We noticed that the keyword is arranged in character order (Lucene does not use B tree structure), so Lucene can quickly locate keywords with binary search algorithms.

At the time of implementation, Lucene saves the above three columns as a dictionary, frequency file (Frequencies), location file (positions). The dictionary document not only saves each keyword, but also preserves a pointer to the frequency file and location file, and the frequency information and location information of the keyword can be found through the pointer.

The concept of Fieldne is used in Lucene, used to express information in location (such as the title, in the article), in the built index, the field information is also recorded in the dictionary file, each keyword has a field information. (Because each keyword must belong to one or more Field).

In order to reduce the size of the index file, Lucene also uses compression technology to the index. First, the keywords in the dictionary document are compressed. Key words compression are , for example: the current word is "Arabic", the previous word is "Arab", then "Arabic" compression is < 3, language>. Second, a large number of uses the compression of the numbers, the number only saves the difference with the previous value (this can reduce the length of the number, thereby reducing the number of bytes required to save the number). For example, the current article number is 16389 (no compression to save with 3 bytes), the previous article is 16382, saving 7 after compression (only one byte).

Below we can explain why you want to build an index by explaining the query of the index.

Suppose you want to query the word "Live", Lucene first look for dictionary, find the word, read all article numbers by pointing to the pointer to the frequency file, and then return the result. The dictionary is usually very small, and thus, the entire process is in milliseconds.

The algorithm is matched in a normal order, but not to build the index, but a string of content of all articles, this process will be quite slow, and when the number of articles is large, the time is often unbearable.

转载请注明原文地址:https://www.9cbs.com/read-75816.html

9cbs

New Post(0)