Chinese search engine technology revealed: Chinese word (three)

xiaoxiao2021-03-06 28

Chinese word technique is a natural language processing technology. For a sentence, people can understand which words are words, which are not words, but how to make computers understand? Its processing process is a chiral algorithm. The existing word algorithm can be divided into three categories: a word method based on string matching, based on understanding of the word method and statistical scales. 1. This method is called mechanical financing method based on string matching, which is a historical string to be analyzed with a "full" machine dictionary according to certain strategies. If it is in the dictionary Find a string, match success (identifying a word). According to the scanning direction, the string of the string can be divided into positive matching and reverse match; the case can be divided into maximum (maximum) matching and minimum (shortest) match according to the case of different lengths; In combination, it can be divided into integrated methods for combining simple particulars and particulars and labels. Several mechanical scavenging methods are as follows: 1) Positive to maximum matching method (direction from left to right); 2) reverse maximum matching method (from right to left direction); 3) minimum 3. Make each sentence The number of words cut out is minimal). It is also possible to combine each other described above, for example, the forward maximum matching method and the reverse maximum matching method can be combined to constitute a bidirectional matching method. Due to the characteristics of Chinese single word, positive minimum matching and reverse minimum matches are generally rarely used. In general, the reverse matching segmentation accuracy is slightly higher than the forward match, and the ambiguity encountered is small. The statistical results show that the error rate of the most commonly used positive maximum match is 1/169, and the reverse maximum matching error rate is 1/245. But this accuracy is far from meeting practical needs. The actually used word system is a first-point means of mechanical score, but also further improves the accuracy of the division by using various other language information. One method is to improve the scanning method, referred to as feature scanning or marker segmentation, priority to identify and divide some words with obvious feature in the string to analyze, with these words as a breakpoint, can divide the original character string The smaller string is then allowed to enter the mechanical finishes, thereby reducing the matching error rate. Another method is to combine the word and the word class, and use rich word information to help word decision, and in the labeling process, the word results are tested, adjusted, thereby greatly improving the accuracy of the segmentation. For mechanical scratch methods, a general model can be established, and there are professional academic papers in this regard, not detailed here. 2, the word method based on understanding is the effect of reaching the identifier word by letting computer simulative understanding of the sentence. Its basic thinking is to process the ambiguity in syntactics, semantic analysis, and use syntax information and semantic information. It typically includes three parts: a variety of synthesis, syntax, a subsystem, and a total control section. In the coordination of the total control section, the syntactic subsystems can obtain syntax and semantic information of the word, sentence, etc., which simulates the understanding of the sentence. This word method requires a large number of language knowledge and information. Due to the general language, complexity of Chinese language knowledge, it is difficult to organize various language information into the form of machines can be directly read, so the understanding-based words are still in the test phase. 3, based on the statistical word method, the word is a combination of stable words, so in the context, the more the number of adjacent words occurs, the more likely, it is possible to form a word. Therefore, the frequency or probability that the word and word adjacent to the word can better reflect the credibility of the word word. The frequency of the combinations of each word that are adjacent to the corpus can be statistically, and their mutual information can be calculated. Define the mutual information of the two words, calculate the neighbor propulsion of two Chinese characters x, y. The exchange information reflects the tightness of the relationship between Chinese characters.

转载请注明原文地址:https://www.9cbs.com/read-45545.html

9cbs

New Post(0)