The rapid growth of the preface information makes the search engine become the preferred tool for people to find information, Google, Baidu, China Search and other large search engines have always been the topics discussed. With the increase in the search market value, more and more companies have developed their own search engine, Alibaba's business search, 8848 shopping search, etc., naturally, search engine technology has also become a hot spot for technical staff. The research of search engine technology, foreign countries are more than ten years earlier than China, from the earliest Archie, to the Search engine for Search engine, the search engine, and the search engine, and the search engine has historical history. The research search engine began in China was in the late last century. In many fields, all foreign products and technologies are in the world, especially when some technologies have been studied foreign research for many years and the domestic start. For example, operating systems, word processing software, browser, etc., but search engines are exceptions. Although the foreign search engine technology has already begun research, there is a good search engine in China, like Baidu (http://www.baidu.com), Search (http://www.zhongsou.com), etc. . Currently in the field of Chinese search engine, domestic search engines have been different from foreign search engine effects. It can form such a situation, there is an important reason to speak in Chinese and English language itself, which is a Chinese word for the computer. What is well known in Chinese, English is in terms of words, and the word and words are separated by spaces, and Chinese is in units, all words in the sentence can be described in the sentence. For example, English sentence I am A student, using Chinese: "I am a student." The computer can be very simple to know that Student is a word, but it is not easy to understand "learning", "students" two words represent a word. Dividing Chinese Chinese characters into meaningful words, is Chinese word, and some people are also called clever words. I am a student, the results of the word: I am a student. How much influence on the Chinese word and search engine Chinese word to the search engine? For the search engine, the most important thing is not to find all the results, because there is not much significance in the billions of web pages, no one can see it, the most important thing is to put the most relevant results In the fore, this is also called the correlation sort. The accuracy of Chinese word is often directly affected by the correlation of the search results. The author recently found some information about Japanese kimono for friends. In the search engine, enter "kimono" on the search engine, and the results have found a lot of problems. Here, in this example, the influence of the selection results will be described in this example. Test on the existing Chinese search engines, the test method is direct in Google (http://www.google.com), Baidu (http: // www .baidu.com), search in "HTTP: //www.zhongsou.com) Search with" Kion "for keywords: Enter" and service "on Google search for all Chinese Simplified Website, 507,000 results, the top 20 There were 14 results in the results without a little relationship. There is the following error in the first page: "Communication Information: Rising Technical and Service Development Network Security Market" "Use Pure HTML General Data Management and Services - Developer - ZDNET ..." "Chen Huilin" is not very Makeup and Apparel Bracket Office "" :: Minister: China's overseas Consular Protection and Service Guide (2003 Edition) ... "Products and Services", etc. Only three of the first pages is the result of truly "kimono".
Enter "kimono" search web pages in Baidu, with a total of 287,000, and there are 6 in the top 20 results, there is no relationship. In the first page, there is an error: "Fujian Jinjiang Heng and Garment Co., Ltd." "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" The total results are 26,917, the top 20 results are web pages related to the kimono. The error in this search engine result is caused by the inaccurate of the word. Through the author's understanding, Google's Chinese word spreads is the Chinese word technique provided by companies named Basis Technology (http://www.basipech.com), Baidu uses the word technology developed by its company. Used is the word technique provided by China Massive Technology (http://www.hylanda.com). It can be seen that the accuracy of Chinese chrome has considerable relationship between search engine results. Chinese word technique is a natural language treatment technology. For a sentence, people can understand through their own knowledge. Is the word, which is not the word, but how to make the computer understand? Its processing process is a chiral algorithm. The existing word algorithm can be divided into three categories: a word method based on string matching, based on understanding of the word method and statistical scales. 1. This method is called mechanical financing method based on string matching, which is a historical string to be analyzed with a "full" machine dictionary according to certain strategies. If it is in the dictionary Find a string, match success (identifying a word). According to the scanning direction, the string of the string can be divided into positive matching and reverse match; the case can be divided into maximum (maximum) matching and minimum (shortest) match according to the case of different lengths; In combination, it can be divided into integrated methods for combining simple particulars and particulars and labels. Several mechanical scavenging methods are as follows: 1) Positive to maximum matching method (direction from left to right); 2) reverse maximum matching method (from right to left direction); 3) minimum 3. Make each sentence The number of words cut out is minimal). It is also possible to combine each other described above, for example, the forward maximum matching method and the reverse maximum matching method can be combined to constitute a bidirectional matching method. Due to the characteristics of Chinese single word, positive minimum matching and reverse minimum matches are generally rarely used. In general, the reverse matching segmentation accuracy is slightly higher than the forward match, and the ambiguity encountered is small. The statistical results show that the error rate of the most commonly used positive maximum match is 1/169, and the reverse maximum matching error rate is 1/245. But this accuracy is far from meeting practical needs. The actually used word system is a first-point means of mechanical score, but also further improves the accuracy of the division by using various other language information. One method is to improve the scanning method, referred to as feature scanning or marker segmentation, priority to identify and divide some words with obvious feature in the string to analyze, with these words as a breakpoint, can divide the original character string The smaller string is then allowed to enter the mechanical finishes, thereby reducing the matching error rate. Another method is to combine the word and the word class, and use rich word information to help word decision, and in the labeling process, the word results are tested, adjusted, thereby greatly improving the accuracy of the segmentation. For mechanical scratch methods, a general model can be established, and there are professional academic papers in this regard, not detailed here.
2, the word method based on understanding is the effect of reaching the identifier word by letting computer simulative understanding of the sentence. Its basic thinking is to process the ambiguity in syntactics, semantic analysis, and use syntax information and semantic information. It typically includes three parts: a variety of synthesis, syntax, a subsystem, and a total control section. In the coordination of the total control section, the syntactic subsystems can obtain syntax and semantic information of the word, sentence, etc., which simulates the understanding of the sentence. This word method requires a large number of language knowledge and information. Due to the general language, complexity of Chinese language knowledge, it is difficult to organize various language information into the form of machines can be directly read, so the understanding-based words are still in the test phase. 3, based on the statistical word method, the word is a combination of stable words, so in the context, the more the number of adjacent words occurs, the more likely, it is possible to form a word. Therefore, the frequency or probability that the word and word adjacent to the word can better reflect the credibility of the word word. The frequency of the combinations of each word that are adjacent to the corpus can be statistically, and their mutual information can be calculated. Define the mutual information of the two words, calculate the neighbor propulsion of two Chinese characters x, y. The exchange information reflects the tightness of the relationship between Chinese characters. When the tightness is higher than a certain threshold, this group may construct a word. This method only requires statistics on the words in the corpus, and does not require a docking dictionary, and thus is called no dictionary classification or statistical methods. However, this method also has certain limitations, often with some common frequency, but not the commonlios, such as "this", "one", "some", "my", "Many", etc., and poor the recognition accuracy of common words, and the time and space cost is large. Both statistical patch systems for practical applications are used to use a basic word dictionary (commonly used word dictionary), while using statistical methods to identify some new words, the string frequency statistics and string matching are combined with both matching points. Fast sport, high efficiency, and use no dictionary word combination context identification words, automatically eliminate the advantages of ambiguity. Which of the distal accuracy is higher, and it is not an ideal. For any mature word system, it is impossible to rely on an algorithm separately, and it is necessary to integrate different algorithms. The author understands that the scratch scope algorithm is adopted "Compound Points", the so-called compound, equivalent to the compound concept in Chinese medicine, that is, combined with different drugs to treat diseases, the same, for the identification of Chinese words, need more The algorithm is handled different problems. The puzzle in the word has a mature word algorithm, whether it is easy to solve the problem of Chinese word words? The fact is far less. Chinese is a very complex language that makes computer understanding Chinese language is more difficult. In the process of Chinese word, there are two big problems that have not been completely broken. 1. Ambiguity ambiguity refers to the same sentence, there may be two or more method of cutting. For example: surface, because "surface" and "surface" are words, then this phrase can be divided into "surface" and "surface". This is called cross ambiguity. As such a crucifix is very common, an example of "kimono" in front is actually because of an error caused by cross ambiguity. "Makeup and clothing" can be divided into "makeup and clothing" or "makeup and clothing". Due to no one's knowledge, it is difficult to know which solution is correct. Cross Ambiguity is relatively easy to handle, and combined ambiguity must be determined according to the entire sentence. For example, in the sentence "this door handle", "handle" is a word, but in the sentence "Please take the hand", "handle" is not a word; in the sentence, "General appointed a middle general" "Lord" is a word, but in the sentence "three years will grow twice", "the middle will" is no longer the word. How do these words computers to identify? If cross ambiguous and combined ambiguity computers can solve, there is a problem in ambiguity, which is true.
True ambiguity is to give a sentence, and people who are judged, I don't know which one should be words, which should not be the word. For example: "Table tennis auction is over", you can cut into "Table tennis auction", but also to be "table tennis auction", if there is no context, I am afraid no one knows "auction" is not counted here. One word. 2, new words recognize new words, professional terms are called uncommon words. That is, those words that are not included in the dictionary, but it is indeed known to those words. The most typical is the name, people can easily understand the sentence "Wang Junhu to Guangzhou", "Wang Junhu" is a word, because it is a person's name, but if the computer is difficult to identify. If "Wang Junhu" is included in the dictionary, there are so many words in the world, and every moment has a new name, including these people, it is a huge project. Even if this work can be completed, there will be problems, such as: "Wang Junhu" in the sentence "Wang Junhu" can still calculate? In addition to the human name, there are also agency name, place name, product name, trademark name, abbreviation, omitted language, etc., and these is exactly what people often use, so for search engines. The new term identification in the word system is very important. At present, the accuracy of new word recognition has become one of the important signs of evaluating the quality of a word system. The application of Chinese word is currently in natural language processing technology, Chinese processing technology is a large distance behind Western treatment technology. Many Western treatment methods cannot be directly adopted, that is, because Chinese must have this step. Chinese word is the foundation of other Chinese information processing, and the search engine is just an application of Chinese word. Other such as machine translation (MT), speech synthesis, automatic classification, automatic summary, automatic school peer, etc., all need to be used. Because Chinese needs to be word, it may affect some research, but it also brings opportunities for some companies, because foreign computer processing technology should enter the Chinese market, first, to solve Chinese word problems. In Chinese research, compared to foreigners, the Chinese have a very obvious advantage. The accuracy of the word is very important for the search engine, but if the speed is too slow, even if the accuracy is high, it is not available for the search engine, because the search engine needs to handle hundreds of millions of web pages, if the word consumption The time is too long, which will seriously affect the speed of the search engine content update. Therefore, for the search engine, the accuracy and speed of the word words need to reach a high requirement. Most of the Chinese scratches are currently studying research institutes, Tsinghua, Peking University, Chinese Academy, Beijing Language College, Northeast University, IBM Research Institute, Microsoft China Research Institute, etc. have their own research team, and commercial companies that truly study Chinese word in addition to There are almost no massive technology. The technology of research institutions research, most of them cannot be very fast-productized, and a professional company's power is limited, it seems that Chinese word techniques should want to better serve more products, and have a long way. More Recommendation: Since the following references are not published in some magazines, there is no surface, you can get the download link of related articles on Google or Baidu search engines. [1] Chinese search engine technology revealed: network spider. [2] Chinese search engine technology revealed: sorting technology. [3] Chinese search engine technology revealed: system architecture.