The Influence of Chinese Foundation on LUCENE Index of Different Rules

xiaoxiao2021-03-06 21

Transfer from: Tian Chunfeng

In the Summary Index, in order to establish a reverse index, you need to separate the sentences in the document, please refer to the introduction of the car.

In the release of Lucene 1.3, the Chinese established index is supported. His default segmentation rule is divided by one man. The example is seen.

Here mainly compares the effects of the following three Chinese sections on the Lucene index.

The first one: the default single-word cut;

Second: binary segmentation (see the article of the car);

The third: according to the meaning of the word (using the reverse maximum division method of the little bit).

The effects of the three sections above are as follows:

The original sentence: "The history of the search engine has not been able to do not do it, and it is more convenient to get the information of the search engine."

Lucene default word results:

Org.apache.lucene.Analysis.Standard.standardAnalyzer:

[搜] [] [引] [] []] [发] [展] [历] [] [证] [明] [no] [有] [do] [no] [to] [only ] [There is] [想] [no] [to] [let] [people] [] [more] [party] [便] [准] [确] [准] [获] [取] [letter] [ Interest] [Yes] [Search] [So] [引] [] [】 [使] [life]

Binary cutting results:

Org.apache.lucene.demo.cjkanalyzer:

[Search] [Index] [Engine] []] [发] [Development] [Exclusive] [History] [History] [Certificate] [No] [Don't do] [Don't do] [No] [ Only] [only] [I want] [I don't think] [I don't have any] [people] [people] [更] [more] [convenient] [convenient] [accurate] [indeed] [ ] [Get] [Sin] [Information] []] [是搜] [Search] [Index] [Engine] [Engine] [Mission] [Mission]

Small score results:

Org.apache.lucene.Demo.chineseAnalyzer:

[Search] [Engine] [】 [Development] [History] [Certificate] [There is] [Do] [No] [Only] [I want] [No] [People] [more] [convenient] [accurate] [的] [Get] [Information] [Yes] [Search] [Engine] [] [Mission]

In the Lucene index, the smallest index unit is token. Basically, you can understand token, in English is a word, in Chinese, in different split results [] words.

My test data: Today's major websites and blog news, including total of 212K text documents such as economy, political, education, entertainment, and technology.

The statistics after the generation of Lucene generates the index are as follows:

Single-word cut:

Lead 15 TERM

Word meaning:

The first 15 TERMs of the word meaning

It can be seen from the above comparison: Term of single-word cutting is less than the word meaning. The reason is obvious, and the words commonly used in Chinese are more than 4,000, so the word Separation TERM is also probably so much, and the word meaning is different. I have more than 40,000 words in the word meaning.

From intuitive observation, the more TERM in the index file, the faster search results, the higher the search.

Another interesting situation is the change in index file size.

When I have to test data about 80K size, the index file generated by the two methods is not large, but when the amount of data is greater than 100K, the single-word division index file has been more than 30 more than the word index index file. K. Since there is currently not understanding the index file format, I can only guess why such results will appear. Because of the small TERM of single-word cut, the more link information points to this Term, the more relevant results are also related to the search results). vice versa. There is no filtering common Chinese characters in the test data above. Common Chinese characters have no effect on search, such as:, yes.

转载请注明原文地址:https://www.9cbs.com/read-45484.html

9cbs

New Post(0)