A Chinese cutting word method for search engines
First, let's talk about the cause of search engine sect. When a full-text retrieval, the content to be retrieved into a shorter text sequence. The corresponding table (index) of the string included in each text sequence is then generated. After entering the search statement, segmentation is also divided, compared to the index. That is, even if the same text is included, the two cannot be properly retrieved. There are two main segmentation methods, which are both words parsing indexes and text indexes, respectively. The word parsing index is divided according to the smallest word unit in the dictionary, which is scored in terms of words. Such as the ICTCLAS of the Chinese Academy of Sciences. The text index is not considered the meaning of the words in the text, just divided according to the unit of a certain word length. Such as the binary division method of the car.
The above two methods have been compared in the influence of the Chinese word of different rules on the Impact of Lucene index. The point here is: I mentioned inside: From intuition observation, the more TERM in the index file, the sooner the search results, the higher the search.
This sentence missed a little, which is the "quality" problem of TERM. For example, the binary cutoff method is used, and the TERM generated is more than the single-word split and the meaning, but it is obvious that the search is not improved. My opinion is that binary sections have increased the hit rate of search keywords to reduce the relevance of index results.
So, can it solve the above problem in accordance with the meaning of words? When talking about Grassland's development plan, (Car Dong and LHELPER is the main developer of Weblucene), they are very directional: Fully suitable for full-text search.
It seems that this conclusion seems to be a common sense. This is also a place to be discussed herein.
The above mentioned: The search engine can take the same cutoff method when establishing an index, and can be retrieved correctly. And this is exactly where the word meaning is prone to errors. The reason is that when the search engine establishes an index, it can make the words and dictionaries in the text content through the algorithm (ICTCLAS public beta accuracy is around 90%). When the keywords on the user are finished, the text of the user enters short, and the other hand only uses the keywords you think, (still do not consider the wrong word, do not words :-() caused this The difference between the word words before and after.
The results caused by this difference can be imagined, or even more effective as the text index binary segmentation is more effective.
How do we choose this at this time? The following is a comment to the relevant organization:
The University of California, Berkeley, said in the report that the method of segmenting the search statement by means of a word is more effective in writing in terms of words. "As a report, the NTCIR Senior Team has a high credibility. We may be able to get a way to divide the words in units of words. However, as a data of the search object, in order to prevent retrieval of omissions, sometimes the text index is segmented.
The comments above Berkeley can be used as the theoretical starting point for the Chinese cutting method for search engines.
It is summarized: the word index finite method is used in the part of the deviation in terms of words. The purpose of search engine sections is not to cut a meaningful word, but a keyword that makes it out.
The above mentioned is only a general idea, and there are many details in technology to consider.
I will fill the above ideas, modify the small horizontal methods, and put them in the Chinese diversity module of Grassland. In addition, please pay attention to Grassland's progress and participate. Tian Chunfeng 20040108