Puzzle in the word
With a mature word algorithm, is it easier to solve the problem of Chinese word words? The fact is far less. Chinese is a very complex language that makes computer understanding Chinese language is more difficult. In the process of Chinese word, there are two big problems that have not been completely broken.
1, ambiguity identification
Ambiguity refers to the same sentence, there may be two or more sections. For example: surface, because "surface" and "surface" are words, then this phrase can be divided into "surface" and "surface". This is called cross ambiguity. As such a crucifix is very common, an example of "kimono" in front is actually because of an error caused by cross ambiguity. "Makeup and clothing" can be divided into "makeup and clothing" or "makeup and clothing". Due to no one's knowledge, it is difficult to know which solution is correct.
Cross Ambiguity is relatively easy to handle, and combined ambiguity must be determined according to the entire sentence. For example, in the sentence "this door handle", "handle" is a word, but in the sentence "Please take the hand", "handle" is not a word; in the sentence, "General appointed a middle general" "Lord" is a word, but in the sentence "three years will grow twice", "the middle will" is no longer the word. How do these words computers recognize?
If cross ambiguous and combined ambiguity computers can solve, there is a problem in ambiguity, which is true. True ambiguity is to give a sentence, and people who are judged, I don't know which one should be words, which should not be the word. For example: "Table tennis auction is over", you can cut into "Table tennis auction", but also to be "table tennis auction", if there is no context, I am afraid no one knows "auction" is not counted here. One word.
2, new word identification
New words, professional terminology is called uncommon words. That is, those words that are not included in the dictionary, but it is indeed known to those words. The most typical is the name, people can easily understand the sentence "Wang Junhu to Guangzhou", "Wang Junhu" is a word, because it is a person's name, but if the computer is difficult to identify. If "Wang Junhu" is included in the dictionary, there are so many words in the world, and every moment has a new name, including these people, it is a huge project. Even if this work can be completed, there will be problems, such as: "Wang Junhu" in the sentence "Wang Junhu" can still calculate?
In addition to the human name, there are also agency name, place name, product name, trademark name, abbreviation, omitted language, etc., and these is exactly what people often use, so for search engines. The new term identification in the word system is very important. At present, the accuracy of new word recognition has become one of the important signs of evaluating the quality of a word system.
Application of Chinese word
Currently in natural language processing technology, Chinese processing technology is more than a large distance behind Western treatment technology. Many Western handling methods cannot be directly adopted, that is, because Chinese must have this process. Chinese word is the foundation of other Chinese information processing, and the search engine is just an application of Chinese word. Other such as machine translation (MT), speech synthesis, automatic classification, automatic summary, automatic school peer, etc., all need to be used. Because Chinese needs to be word, it may affect some research, but it also brings opportunities for some companies, because foreign computer processing technology should enter the Chinese market, first, to solve Chinese word problems. In Chinese research, compared to foreigners, the Chinese have a very obvious advantage.
The accuracy of the word is very important for the search engine, but if the speed is too slow, even if the accuracy is high, it is not available for the search engine, because the search engine needs to handle hundreds of millions of web pages, if the word consumption The time is too long, which will seriously affect the speed of the search engine content update. Therefore, for the search engine, the accuracy and speed of the word words need to reach a high requirement. Most of the Chinese scratches are currently studying research institutes, Tsinghua, Peking University, Chinese Academy, Beijing Language College, Northeast University, IBM Research Institute, Microsoft China Research Institute, etc. have their own research team, and commercial companies that truly study Chinese word in addition to There are almost no massive technology. The technology of research institutions research, most of them cannot be very fast-productized, and a professional company's power is limited, it seems that Chinese word techniques should want to better serve more products, and have a long way. More reference
Note: Since the following references are not published in some magazines, there is no surface, you can get the download link of related articles by searching in Google or Baidu search engine. [1] Chinese search engine technology revealed: network spider. [2] Chinese search engine technology revealed: sorting technology. [3] Chinese search engine technology revealed: system architecture. [4] Robots & SPIDERS & CRAWLERS: How Web and Intranet Search Engines Follow Links to Build Indexes. Author: Avi Rapports.2001. [5] Guidelines for Robot Writers. Author: Martijn Koster, 1993.
Transfer from: e800.com.cn