Chinese search engine technology revealed: Chinese word

xiaoxiao2021-03-06 90

http://www.fullsercher.com/

Chinese full-text inspection network 2004-9-18 23:15:14 Winter

Keywords: Chinese search engine technology reveals the rapid growth of information, so that the search engine becomes the preferred tool for people to find information, Google, Baidu, China Search and other large search engines have always been the topic discussed. With the increase in the search market value, more and more companies have developed their own search engine, Alibaba's business search, 8848 shopping search, etc., naturally, search engine technology has also become a hot spot for technical staff. The research of search engine technology, foreign countries are more than ten years earlier than China, from the earliest Archie, to the Search engine for Search engine, the search engine, and the search engine, and the search engine has historical history. The research search engine began in China was in the late last century. In many fields, all foreign products and technologies are in the world, especially when some technologies have been studied foreign research for many years and the domestic start. For example, operating systems, word processing software, browser, etc., but search engines are exceptions. Although the foreign search engine technology has already begun research, but in China, excellent search engines have emerged, such as Baidu (http://www.baidu.com, "http://www.zhongsou.com/) Wait. Currently in the field of Chinese search engine, domestic search engines have been different from foreign search engine effects. It can form such a situation, there is an important reason to speak in Chinese and English language itself, which is a Chinese word for the computer. What is well known in Chinese, English is in terms of words, and the word and words are separated by spaces, and Chinese is in units, all words in the sentence can be described in the sentence. For example, English sentence I am A student, using Chinese: "I am a student." The computer can be very simple to know that Student is a word, but it is not easy to understand "learning", "students" two words represent a word. Dividing Chinese Chinese characters into meaningful words, is Chinese word, and some people are also called clever words. I am a student, the results of the word: I am a student. How much influence on the Chinese word and search engine Chinese word to the search engine? For the search engine, the most important thing is not to find all the results, because there is not much significance in the billions of web pages, no one can see it, the most important thing is to put the most relevant results In the fore, this is also called the correlation sort. The accuracy of Chinese word is often directly affected by the correlation of the search results. The author recently found some information about Japanese kimono for friends. In the search engine, enter "kimono" on the search engine, and the results have found a lot of problems. Here, in this example, the influence of the selection results will be described, and the test method is tested on the existing Chinese search engine. The test method is directly in Google (http://www.google.com/), Baidu (http: // Www.baidu.com), search in "Koh" on http://www.zhongsou.com/) Search with "Koh": Enter "and service" on Google search for all Chinese Simplified Fun, 507,000 results, before There were 14 results in the 20 results, there was no relationship with the kimono.

There is the following error in the first page: "Communication Information: Rising Technical and Service Development Network Security Market" "Use Pure HTML General Data Management and Services - Developer - ZDNET ..." "Chen Huilin" is not very Makeup and Apparel Bracket Office "" :: Minister: China's overseas Consular Protection and Service Guide (2003 Edition) ... "Products and Services", etc. Only three of the first pages is the result of truly "kimono". Enter "kimono" search web pages in Baidu, with a total of 287,000, and there are 6 in the top 20 results, there is no relationship. In the first page, there is an error: "Fujian Jinjiang Heng and Garment Co., Ltd." "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" The total results are 26,917, the top 20 results are web pages related to the kimono. The error in this search engine result is caused by the inaccurate of the word. Through the author's understanding, Google's Chinese scratch technology uses a Chinese word technology provided by companies called Basis Technology (http://www.basistech.com/), and Baidu uses the scope of the company's development. Zhongo is used is the word technique provided by domestic massive technology (http://www.hylanda.com). It can be seen that the accuracy of Chinese chronic is considerably related to the correlation and accuracy of search engine results. Chinese word technique is a natural language processing technology. For a sentence, people can understand which words are words, which are not words, but how to make computers understand? Its processing process is a chiral algorithm. The existing word algorithm can be divided into three categories: a word method based on string matching, based on understanding of the word method and statistical scales. 1. This method is called mechanical financing method based on string matching, which is a historical string to be analyzed with a "full" machine dictionary in accordance with a certain strategy. Find a string, match success (identifying a word). According to the scanning direction, the string of the string can be divided into positive matching and reverse match; the case can be divided into maximum (maximum) matching and minimum (shortest) match according to the case of different lengths; In combination, it can be divided into integrated methods for combining simple particulars and particulars and labels. Several mechanical scavenging methods are as follows: 1) Positive to maximum matching method (direction from left to right); 2) reverse maximum matching method (from right to left direction); 3) minimum 3. Make each sentence The number of words cut out is minimal). It is also possible to combine each other described above, for example, the forward maximum matching method and the reverse maximum matching method can be combined to constitute a bidirectional matching method. Due to the characteristics of Chinese single word, positive minimum matching and reverse minimum matches are generally rarely used. In general, the reverse matching segmentation accuracy is slightly higher than the forward match, and the ambiguity encountered is small. The statistical results show that the error rate of the most commonly used positive maximum match is 1/169, and the reverse maximum matching error rate is 1/245. But this accuracy is far from meeting practical needs. The actually used word system is a first-point means of mechanical score, but also further improves the accuracy of the division by using various other language information.

One method is to improve the scanning method, referred to as feature scanning or marker segmentation, priority to identify and divide some words with obvious feature in the string to analyze, with these words as a breakpoint, can divide the original character string The smaller string is then allowed to enter the mechanical finishes, thereby reducing the matching error rate. Another method is to combine the word and the word class, and use rich word information to help word decision, and in the labeling process, the word results are tested, adjusted, thereby greatly improving the accuracy of the segmentation. For mechanical scratch methods, a general model can be established, and there are professional academic papers in this regard, not detailed here. 2, the word method based on understanding is the effect of reaching the identifier word by letting computer simulative understanding of the sentence. Its basic thinking is to process the ambiguity in syntactics, semantic analysis, and use syntax information and semantic information. It typically includes three parts: a variety of synthesis, syntax, a subsystem, and a total control section. In the coordination of the total control section, the syntactic subsystems can obtain syntax and semantic information of the word, sentence, etc., which simulates the understanding of the sentence. This word method requires a large number of language knowledge and information. Due to the general language, complexity of Chinese language knowledge, it is difficult to organize various language information into the form of machines can be directly read, so the understanding-based words are still in the test phase. 3, based on the statistical word method, the word is a combination of stable words, so in the context, the more the number of adjacent words occurs, the more likely, it is possible to form a word. Therefore, the frequency or probability that the word and word adjacent to the word can better reflect the credibility of the word word. The frequency of the combinations of each word that are adjacent to the corpus can be statistically, and their mutual information can be calculated. Define the mutual information of the two words, calculate the neighbor propulsion of two Chinese characters x, y. The exchange information reflects the tightness of the relationship between Chinese characters. When the tightness is higher than a certain threshold, this group may construct a word. This method only requires statistics on the words in the corpus, and does not require a docking dictionary, and thus is called no dictionary classification or statistical methods. However, this method also has certain limitations, often with some common frequency, but not the commonlios, such as "this", "one", "some", "my", "Many", etc., and poor the recognition accuracy of common words, and the time and space cost is large. Both statistical patch systems for practical applications are used to use a basic word dictionary (commonly used word dictionary), while using statistical methods to identify some new words, the string frequency statistics and string matching are combined with both matching points. Fast sport, high efficiency, and use no dictionary word combination context identification words, automatically eliminate the advantages of ambiguity. Which of the distal accuracy is higher, and it is not an ideal. For any mature word system, it is impossible to rely on an algorithm separately, and it is necessary to integrate different algorithms. The author understands that the scratch scope algorithm is adopted "Compound Points", the so-called compound, equivalent to the compound concept in Chinese medicine, that is, combined with different drugs to treat diseases, the same, for the identification of Chinese words, need more The algorithm is handled different problems. The puzzle in the word has a mature word algorithm, whether it is easy to solve the problem of Chinese word words? The fact is far less. Chinese is a very complex language that makes computer understanding Chinese language is more difficult. In the process of Chinese word, there are two big problems that have not been completely broken. 1. Ambiguity ambiguity refers to the same sentence, there may be two or more method of cutting. For example: surface, because "surface" and "surface" are words, then this phrase can be divided into "surface" and "surface". This is called cross ambiguity. As such a crucifix is very common, an example of "kimono" in front is actually because of an error caused by cross ambiguity.

转载请注明原文地址:https://www.9cbs.com/read-85276.html

9cbs

New Post(0)