Reprinted:
The author has been engaged in the study of Chinese automatic word, one of which is the simplicity of this study will be helpful to the Chinese search engine on WWW, but it is often difficult to achieve satisfactory accuracy for the open environment. Recently, I've I have a real understanding, here, I will write a little bit of mind to throw the jade. "Interesting" experience about Chinese search engines
Let's talk about a member's "interesting" experience. One day, I wanted to find information related to Japan's "kimono" on WWW. Open the search engine of yahoo china (http://cn.yahoo.com/), naturally select "kimono" as query.
The search results were completely unexpected: 255 "related websites" found, but there is a "harmonious" related person, such as: "China Talent Hotline GB - Provide information and service in recruitment and job hunting". In the 255 websites, check the fact that cannot endure, re-(ie independent of the previous search results, the same below) Type "and service" and "Japan", I hope to narrow the search range. This time I only received a website related to "kimono": "Ningbo Jiangdong Star Silk Belt Factory GB - engaged in the embroidery and manufacturing of Japan and the belt".
The author does not believe that Nobi Yahoo China will only save this fruit, so try "kimono" and "clothing". This time, a total of 45 websites, but there is still only "Ningbo Jiangdong Star Will Factory", the retrieval accuracy is 1/45. The author feels puzzled: Is it really guarded by Baoshan? I jumped out a wonderful word in my mind: "Japanese style", type "和服" and "Japanese style", finally dig out a lot of "treasures": return 1140 web pages (I don't know why, check "related sites ", The operation is also exactly the same as before, but the feedback is dead, it is" related web page "), in which there is no shortage related to" kimono ", such as" kimono culture ", the following is the market and other market and other markets Comparison of fiber products market ... "finally" high-job ", I was relaxed in my heart. After I retrospect, I feel less simple: If you can't think of" Japanese style ", how many other words do you want to try? How many related web authorities don't know if I don't know if I don't have it. It seems that it is not very easy to figure out. The search seems to be a "art", not a "technology".
Preliminary test of Chinese search engine performance
This experience has prompted my performance of the Chinese search engine to have a preliminary investigation. At that time, I was studying in the University of Hong Kong, so 50 Hong Kong students are required to type an interested word as an inquiry to Yahoo Hong Kong (http://hk.yahoo.com/), and then examine the search accuracy of the query. . The search accuracy is defined as: the number of websites (pages) (pages) retrieved with the query is true. If the retrieved website (page) is greater than 50, only the top 50 is examined.
These 50 retrieval words and corresponding retrieval accuracy (%) are shown in Table 1.
The search results show that Yahoo Hong Kong did not do word processing, and the average retrieval accuracy is only 48.8%, half of which is garbage. Table 2 lists some retrieval examples. From the search error, the situation is quite complicated, involving all aspects of Chinese automatic word, including cross-acknowledgment (such as "research ecology theory and application". Underline instructions search words, the same), combined ambiguity ("promoted people-oriented education" ), Chinese name (such as "Shandong Anli Law Firm"), foreigner name (such as "Helen and John", "Introducing wine well magazine"), China's place name (such as "Biyang Shuangmiao Street"), foreign country Place name (such as "Egypt and Jordan"), institution name (such as "Hawthorn Treatment Center"), abbreviated language (such as "medium larger ERP software") and so on. In order to roughly estimate the impact of the word system on the Chinese search engine, the author uses the Chinese-speaking system CSEG & TAG, which is self-developed by Tsinghua University, is given 122 exemplary examples related to these 50 words, including "retrieval. The error example "78 and" Retrieve the correct example "44 sentences, some of which are shown in Table 2) Automatic scales, and the word results are shown in Table 3.
Overall, the correct rate of this 122 sentences is 76.2%. Assuming this can reflect the word results of all the sentences of 50 words to 50 words, the retrieval accuracy can rise from 48.8% to 76.2%. It can be seen that although the performance of the currently word system is equivalent to the ideal state, the role of the search engine is also the so-called "one-profit also has a disadvantage", but weigh the spirit and cons. In other words, word techniques are available in search engines.
Further analysis of the 29 sentences of the CSEG & TAG system error, it can also be divided into two categories: the first category (a total of 11 sentences), basically due to the correct processing of the unregistered word, but fortunate Yes, the boundary of the word does not generate entanglements with other words around (such as "United Machinery Co., Ltd."); second type (18 sentences), or make the word boundary of the words (such as "palm
The weather therapy center "), or the ingredients that should not be used as a" word "together (such as the" Introduction to the Society and the Tenth Asian Medical Association "). The impact of the first class on the search engine, in the effect The same is exactly the same as that do not do word processing.
So, if you add this 11 sentences, the search accuracy of 50 words is expected to increase from 76.2% to 85.2%. The second category is a fatal injury to the search engine, and it is the case where we are most unwanted is also the most afraid of meeting. Take a closer look,
The situation can be solved by simple rules, if followed by numbers, generally should be separated), but most of the cases are not easy to deal with, even in the WWW environment, we will encounter How much similar cases are impossible to predict, more effective. Experience tells us, no matter what efforts to invest in, the word system will never meet the perfect realm in an open environment - this means that when we construct a Chinese search engine, you must first accept such a basic assumption: recurrent Chinese word system Certain unpredictable errors can also occur when handling real text, and can reach 90% of the word accuracy is already Xie Tian Xie, and there is an inevitable, normal. The mechanism for studying the Chinese search engine is good, the algorithm is also, trying to improve the retrieval rate, accuracy (accuracy), must be carried out on this basic assumption, otherwise it is no different from the fish.
Future R & D direction
In view of the above discussion, the author believes that the Chinese word system for search engines must be based on a model mixed model, and the corresponding text retrieval mechanism must also be a word mix. The research on this model and mechanism is bound to be the frontier and hot topic for Chinese automatic word system and Chinese search engine system in the next few years. Another inspiration that the author is: Chinese search engine has a big difference in response characteristics of different words, such as, even if no words, the search accuracy of "cheongsam" can still reach 100%, and the search for "native" Accuracy is 0. We need to make exhaustive one-on-one examinations on all Chinese search: "Responding" characteristics of this word relative to Chinese search engines? Is there a simple solution (such as "native" almost appeared in "customs")? Or simply being limited by the level of research, it is not possible to find solutions at all. and many more. This survey is a new generation of Chinese search engines based on word-based techniques will be a very value fundamental work.