Lucene inverted index protector (transfer)

xiaoxiao2021-03-05 142

Key words :

Lucene is a high-performance Java full-text retrieval toolkit, which uses an inverted file index structure. This structure and corresponding generating algorithms are as follows: 0) The contents of the article 1 and 2 Articles 1 are: Tom Lives in Guangzhou Too. Article 2's content is: He Once Lived in Shanghai.1) Since Lucene is based on keyword index and inquiries, we must first get the key words of these two articles, usually we need to do the following to add measures a. We now have a content of the article, a string, we must find a string first All words, phrases. English words are better treated due to space separation. The Chinese words are connected together to require special particle treatment. b. "in" in the article, "Once" "TOO" is nothing in the actual meaning, "" Yes "in Chinese is usually no specific meaning, and these words that do not represent the concept can be filtered out of C. Users typically hope that "He" can also be found when "HE" is found, so all words need to be uniform. d. Users typically hope that "live", "live", "Live", "lived" is needed to restore "live", "Live" e. The punctuation symbol in the article is usually not Represents some concept, it can also filter out all the key words of the analysis of the article 1 after the analysis of the analysis of the article 1 by the Analyzer class. [Guangzhou] [i] [live] [guangzhou] article 2 All keywords are: [HE] [Live] [Shanghai] 2) After the keyword, we can establish an inverted index. The above correspondence is: "Article number" on "all keywords in the article." The inverted index poured this relationship and became: "Keyword" for "all articles with this keyword". Article 1, 2 After the reverse line becomes the keyword article 号 Guangzhou 1HE 2i 1LIVE 1, 2SHANGHAI 2TOM 1 usually only know which articles in the key words are not enough, we also need to know the number and appearance of keywords in the article Location, usually there are two positions: a) character position, that is, record the word is the first character in the article (the advantage is the keyword is fast); b) Keyword position, that is, record the word is in the article Several keywords (the advantage is to save the index space, the phrase "query is fast), and the LUCENE is recorded is this location. After adding "Frequency" and "appearance position" information, our index structure changes: keyword article number [frequent frequency] appearance position guangzhou 1 [2] 3, 6HE 2 [1] 1i 1 [1] 4LIVE 1 [2], 2 [1] 2, 5, 2SHANGHAI 2 [1] 3TOM 1 [1] 1 Taking Live This behavior: Live appears twice in article 1, there is a time in the article 2 What does its appearance are "2, 5, 2"? We need to analyze the article number and the frequency of appearance, 2 times in the article 1, then "2, 5" said that Live has two locations that appear in article 1, and the article 2 has occurred once, and the remaining "2 "Just say that Live is the second keyword in the article 2. The above is the core part of the Lucene index structure.

转载请注明原文地址:https://www.9cbs.com/read-33183.html

9cbs

New Post(0)