[Title] Chinese Automatic Foundation Research Review
[Original Source] Contemporary Linguistics
[Original Issue] 200101
[Title Note] This study was funded by the National Natural Science Foundation (Item: 69705005) and the National Key Basic Research Development Planning Project (Project Number: G1998030507).
[Author] Sun Massan / Zou Jiao Yan
[About the author] Sun Losong, Tsinghua University
Zou Jiao Yan, Hong Kong City University
Sun Losong, 100084 Beijing. Tsinghua University Computer System Email: LKC-DCS@mail.tsinghua.edu.cn
Zou Jiao Yan, Hong Kong City University Language Information Science Research Center
[Summary] This article first expounds the realities and possibilities of Chinese automatic financing research, followed by three basic issues in the study (cutting the disagreement, unrecognizable resource construction), and focus on the discussion, and The various methods arising from more than a decade have been colled. Finally, I have published some personal opinions on the future research points in this field.
【Key words】 Chinese information processing / Chinese automatic word / cutting righteous ablation / unverified word processing / language resource construction
【text】
1. The realism and possibility of Chinese automatic particle
As we all know, Chinese text does not have the boundary sign of explicit table markers like English spaces. The mission of Chinese automatic word, saying that it is necessary to automatically add spaces between the machine in Chinese text and the words. One refers to automatic word, usually encounters two more typical questions. One question is from the outer pedestrian: this thing looks ordinary, it seems that it is not "lively", what will be used? Another question is from the inner pedestrian: automatic word studies have been in full turnover more than ten years, but there has also been a test-out system (in conjunction with this, Japanese is also There is also a word problem, but there is already a Japanese word system in the circle. This is almost a "eternal" topic in Chinese information. So, there is no hope to make a true meaning "doorway" "Come?
The first question is concerned about the realism of automatic particle, and its answer is very clear. The current large environment is encouraging: China is quickly advancement to the information society, its highlighted characterization is a sharp increase in the Chinese webpage on the Internet and the rapid spread of Chinese electronic publications, Chinese digital libraries. Chinese natural language treatment research in non-restricted text as the main object is also high, and the importance is increasingly significant. The Chinese automatic word is the first basic "process" that is difficult to avoid in any Chinese natural language processing system. It is estimated that it is not too much. Only by passing this obstacle, the Chinese processing system said that "smart" imprint is preliminarily hit, and the various subsequent language analysis methods that are constructed above the word plane have the stage of displaying the play. Otherwise, the system can only be bound on the word plane, can't become too big climate. Specifically, the automatic particle is in many real-world applications (automatic search, classification, and summary of Chinese text, automatic school pairs of Chinese text, Hanwai machine translation, Chinese characters identifies the post-treatment of Chinese speech recognition, Chinese speech synthesis, The sentence-based Chinese character keyboard input, Chinese characters simple body conversion, etc.) all play an extremely important role (Wu Zmand tseng g.1993; wu zm andtseng G.1995; Nie Jyand Brisebois m.et Al.1996; Sun MS Andlin FZ, ET Al.1996). We will give two examples intuitive explanation.
[Text Retrieval]
Set text a with sentence (1a) and text B contains sentences (1B):
(1) a. Kunju │ │ During the three days, the Trend is completed and presents the generals.
B. Wangfu Hotel's facilities │ and │ Services │ are first class. Obviously, Text A is Japan "kimono", text B is related to the "service" of the hotel, both of which are not coexisted. If the word or "and service" is incorrect, it will lead to ridiculous retrieval results.
[文 文 转]
Keeping note of the "Chandan" in the sentence (2a), (2b):
(2) a. They are │ 查 │ Jintai │ hit the people.
b. Okay's │ Chaper Thai │ is known far away.
In the sentence (2a), "check" is verb, and it should be read in the cha, sentence (2b), which is the last name, and should read ZHA.
The second questioning that the possibility of automatic fencing. Although we have so far we can't find a completely affirmative conclusion, the contour of this answer is substantially high after more than a dozen years. After all, the study is compared with the syntactic plane and semantic plane. It is much more difficult to make much more, and whether it is in calculating linguistics or in ordinary linguistics, the results have mature, and more. The existing work accumulation has reached the extent of thickness. If the Chinese language syntax, semantic automatic analysis is still expected to be, then, then, in the face of the Chinese automatic word of the same object, it is only a few steps from Kaoshi. (Of course, even if it reaches the one The goal is not satisfactory. Sproat R. And Shih Cl, ET Al. (1996) and Sun Msand SHEN DY, ET Al. (1997) Chinese Automatic Foundation Prototype System has initiated the features required to handle non-restricted text, they along the correct The direction has a big step.
The focus of this paper is that in Section 2, it will focus on the basic issues in Chinese auto-financing, and the various methods arising from more than a decade (the subsequent references have basically included this field more representative papers). Section 3, publish some personal opinions on future research.
2. Basic problems and main solutions in Chinese automatic scratches
2.1 cut a difference and its treatment
2.1.1 Basic types of score ancestors
Putting the score is a "blocking road" in the study of Chinese automatic word study. Liangnanyuan (1987) was the first to investigate the system. He defined two basic cut-off type:
Defining 1 Chinese character string AJB is called an intersection-type score, if the AJ, JB is satisfied (A, J, B is a Chinese character, respectively). At this time, the Chinese character string J is referred to as an intersection string.
[Example] Intersection of intersection: "combined"
(3) a. Combination │
b. 结 │ Synthesis
Where A = "knot", J = "comb", b = "Cheng".
Defining 2 Chinese character strings AB is called a polysemous combination, if the A, B, AB is satisfied.
[Example] Polysemy Combination Score: "Get up"
(4) a. His station │ 起 │ │ coming.
b. He Tomorrow │ get together │ Go to Beijing.
His intended score, he also defined the chain length:
The set of intersection string of 3 a intersection-type cut division is called the intersection string, and its number is called the chain length.
For example, the intersection of the intersection of the integration of the integration, "combined", "synthesis", "ingredient", "molecule", the collection of integration is {"combined", "Cheng", "points"} The chain length is 3.
Several concepts involved in these definitions, basically portrayed the structural characteristics of Chinese score, so it has been used.
Liangnan Yuan (1987) statistics on a 48,092 word natural science, social science samples: 518 intersection scores, and 42 polyfined combination scores. According to this, the occurrence of a score in Chinese text is approximately 1.2 times / 100 words, and the proportion of the intersection score and the polysemous combination of cuts is about 12: 1. Interestingly, Liu Ting, Wang Kai Cast (1998) survey showed the reverse results of Liangnan Yuan: Chinese text CCMSD score-induced ratio of polysemous combination of cuts is about 1:22. The reason for this situation is that the definition 2 has omissions. Sun M. S. And Benjamin K.T. (1995) Guess, plus a restriction really reflects the original meaning of the beam:
Definition 2 'Chinese character string AB is called a polysemy combination type cut, if satisfied (1) a, b, AB is true; (2) There is at least one front and rear prosthetic C, under C. , A, B are established in grammar and semantics.
For example, the Chinese character string "plain" conforms to definition 2, but does not meet the definition 2 '(because "flat │ light" is not possible in text). Liu, Wang will "put" into a varied combination, and the beam is not included. Since the number of Chinese character strings that meet the definition 2 is much larger than the number of Chinese character strings that meet the definition 2 ', "Qiankun reverse" is not surprising.
Take a closer analysis, definition 1 and definition 2 are all defined from the machine angle, and the definition 2 'increases the human judgment. Sun Massan, Huang Changning, etc. (1997), believes that the name "Polysemy Combination Practice" given in Definition 2 is not scientific (in fact, some intersection type cutting scores are also multi-combined), easy Cause confusion, emphasizing the pure form of "Intersection" this, called "pregnancy" or "coverage" may be more appropriate.
Dong Zhendong (1997) used another set of names: It is called "occasional ambiguity", called "inherent ambiguity". "The difference between the two is that the front and reasons of the former ambiguity are very personal, accidentally, difficult to predict", "and the latter is predictable." This expression is quite deeply located for two types of ambiguous properties, intriguing. But the accuracy of the name is still considered.
Point-of-the-horn disambiguked
Intersection of Intersection Differential Definition 1
Nature occasional ambiguity
A small amount of small amount
Example floor area, peace, etc., determination and software, in construction, department
Covered Score Date Definition Definition 2 'Defining 2 Debut Definition 2' Extension
Nature inherent ambiguous occurs
A small amount of small amount
Example got up, handle, one line, triangle, height, entry, conclusion
Table 1 Practice
Sun Losong, Zuo Zhiping (1998) pointed out that the score of disagreement should further distinguish "true score" and "pseudo-cutting". For example, the same intersection, "land area" is true ("" These │地 │ area │ is really not small "" Ground │ 积 │ 积 厚 雪 ")," and software "is a dismnifiament ( Although there are two different sections "and" soft │ "and" and │ software ", in real text, there is no exception to be sorted into" and │ software "); the same coverage," get up "It is true for realism," plain "is a dismissal.
Incidentally, this paper organizes a cut-off type table (see Table 1), hoping to help a long-lasting chaos in clarification concept.
Regarding the score ambiguity, there are two basic observations: 1) According to Sun Massan, Zuo Zheng (1998) on a 100 million-character statistical statistics of a 100 million-character bomb library, the intersection of the intersection score is 3 ~ 14 words (" Improve the people's lives of people's lives "), the length of the intersection of the intersection is 1 ~ 3 words (" If the arrow is string "), the chain length variation ranges from 1 to 9 words (" Chinese people's living standards and beautification ");
2) The intersection and coverage are often entangled together, which increases the variable. As shown in Figure 1, 19 possible forms of sections can be derived from 19 possible forms (arc indication of the word).
Drawings
Figure 1 Some basic types of mixing
2.1.2 Detection and digestion
Score-cutting processes include two parts: (1) Test acknowledgment; (2) Decomposition of score. These two parts can be divided into two relatively independent steps on logical relationships.
First talk about the problem of testing. "Maximum Matching Law" (accurate statement) is the earliest emergence, and it is also the most basic Chinese automatic scratch method. In 1963, he was introduced in "Text Reform" magazine (Liu Yongquan). 1988). Liu Yuan, Liangnan Yuan (1986) first applied this method to the Chinese automatic word system. According to the direction of the screw sentence, it is divided into two types of MM (from left to right) and reverse maximum matching RMM (from right to left). The maximum matching method actually detects the cut-off detection and digestion of the two processes to be two processes, and the unique segmentation possibilities are given to the input sentence, and According to the experimental results of Liangnanyuan (1987), under the criticism, there is no other knowledge, the maximum error penetration rate of 1 time / 169 words ~ 1/245 words, and it has a simple and fast advantage. Guo J. (1997) is more explained in strict forms of working principle for maximum matching method. In addition, Job Spring Rain, Liu Yuan, etc. (1989) were completely analyzed with the structure of the maximum matching method and its time efficiency.
The "two-way maximum matching method" is exported from the maximum matching method, which is MM RMM. SUNM.S. AND BENJAMIN KT (1995) Notes: 90.0% of the sentence, MM and RMM in Chinese text are completely coincident and correct, 9.0% of sentence mm and RMM dock, but there must be one It is correct (ambiguous detection success), only less than 1.0% of the sentence, or the coincidence of the mm and the RMM is wrong, or both the mm and the RMM die are different, but the two are not pairs (ambiguous detection failed). This is why the two-way maximum matching method is widely used in practical Chinese information processing systems.
Obviously, the two-way maximum matching method exists with a divergence detection blind spot. For cut division, the other two valuable work is, Wang Xiaolong, Wang Kai Cast, etc. (1989) "minimum word method" (ambiguous detection ability is stronger than two-way maximum matching method, the possible number of possible sections) is only slightly Increased) The "All Practice Law" of Ma Zhi (1996) (all possible sections of exhaustion, realizing the cut-offs detection of the blind zone, but the cost is a large number of cut "garbage"). This problem is not completely solved today - if the two-way maximum match is treated as an extreme (simplest), the whole pieces are treated as another extreme (most complicated), our goal should be: in these two extremes Looking for a compromise program of "Delivering", there are both (almost) excluding the detection blind zone, and suppresses unreachable expansion of possible division. Next, discuss the digestion problem. For more than a decade, researchers have almost mobilized all "fashionable" calculation methods in the artificial intelligence, and they have dealtive scores, which is called "Eight Immortals over the sea, and" That. " Typical means include: "Relaxation" (Fan CK and Tsai WH 1988), "Expansion Transfer Network" (Huang Xiangxi 1989), "Phrase Structural Grammar" (Liang Nanyuan 1990; Yao Tianshun, Zhang Guiping et al. 1990; YEH CL and Lee H. J. 1991; Han Shixin, Wang Ying Youth 1992) ET Al. 1996) (Laib.Y. And Sun MS, et al. 1997; Shen Dayang, Sun Massan, etc. 1997a; Sun Losong, Zuo Zhenglan 1999), "Brill Transformation Law" (Palmer DD " 1997) These new explorations reflect the different sides of the cutting of the cutting calculation, and have achieved their own effects in a certain range, but from the overall look, it is too rough; or although the research is more fully, the computational power of the model itself is weak. Or just set up a frame, simply taste; or the scale of the experiment is too small and persuaded.
Through continuous practice, people have become more profoundly recognized that if there is not enough language knowledge as support, the advanced calculation means can only be "silver sample wax gunhead - can't see". Differentiation of disagreement has undergone a evolutionary process by shallow and deep, by simple to complex language knowledge:
1) Some systems (especially early systems) mainly utilize simple information such as word frequency and morpheme (free or constraint), cut a score of the structure (Fan CK and TSAI W. H.1988; Li Gui Chen, Liu Kaizhen, etc. 1988; Wang Yongcheng, Suhai Chrysanthemum Wait 1990; Chen K. J. And Liu Sh 1992; Ma Zhi 1996).
2) Sun M.s. and lai B.y., et al. (1992) reveals the role of syllable information in automatic particle.
3) He Ke anti, Xu Hui, etc. (1991) Ask, 95.0% of the cut-offs can be resolved by the following knowledge below, only 5.0% must resort to semantics and pragmatic knowledge. Based on the rules (Huang Xiangxi 1989; Liangnanyuan 1990; Yao Tianshun, Zhang Guiping et al. 1990; YEH CL and lee hj1991; Han Shixin, Wang Ying Youth 1992; Xu Hui, He Ke anti-et al, etc.) Grouting, cutting and dismissal dissolution Mainly resorted to the words and syntact rules. The existence of the defect is that the rule set is prepared by subjective, which will be plagued by "natural" problems such as systematic, effectiveness, consistency, maintainability. 4) In order to overcome the drawbacks of the artificial rule set, some researchers have begun to try another way of statistics. LAI BY AND SUN MS, ET Al. (1992; 1997), Chang CH and Chen CD (1993), White Tiger (1995), combine automatic particle and Markov chain based on the word automatic labeling technology, using the artificial laboratory The method of extracting the word binary statistics to dissolve the score of the score (the morphic label is feedback to the word, both in parallel). Preliminary Experiment (Lai Byand Sun MS, ET Al.1997) shows that the same "first do the maximum matching word, then the word automatic label" (the word label is not feedback, both of the two serials), this practice The word accuracy and the morphinal labeling accuracy have increased by 1.3% and 1.4%, respectively.
(5) He two begins in love with the first year of the month.
Split a. ... is │ from head │ year │ 元月 │ ...
Verb adverb time quantifier time word
Split B. ... │ │ 头 年 │ 元月 │ ...
Vermony word time word time word
Although "from beginning to", "year" word frequency is greater than "from", "head year" word frequency, the probability of "verb adverb time quantual word time words" is far less than "verb preposition time The probability of the word time word "is selected as the result.
5) wu a.d. and jiang z.x. (1998) Going farther. They believe that in most cases, cutting privileges can be properly handled within the localization of the input sentences, but some complicated scores must be resolved in a larger range in the sentence. When this is encountered, their system will make a complete syntactic analysis of the sentence. If the analysis fails, the corresponding division is refused:
(6) There are ten companies in China in China.
Split a. In │ These │ Enterprise │ │ 国 有 │ Enterprise │ │ Ten │ │.
Split B. │ These │ companies │ China │ 有 │ Enterprise │ │ Ten │ │.
Separate B does not receive a trusted syntax tree, which is rejected.
Of course, the deeper the analysis, the stronger the dependence on the quality, scale of the knowledge base, the time, the time required, and the space is also larger (and the Chinese syntax analyzer for real text is almost in the expected future. There is no possible possible, which is also a factor that should be considered). Sometimes it is impossible to make people causing a causal cycle-like confusion: Digestiveness, this relative "simple" task seems to be completed more than the words itself is difficult to complete. In this "paradox", there is actually implicit "subtext", which is very inspired by the design of Chinese natural language processing systems, which is not expanded here.
Another job worth mentioning is that Sun Losong, Zuo Zhengtian (1999) found that the first 4,619 high-frequency integral ambiguous scales extracted from a 100 million-word real Chinese language library covers all of the spending 59.20% of the intersection scores (they have 50.85% for another completely independent corpus, indicating that the distribution of high-frequency intersections is relatively stable), 4,279 pseudo Ambiguity (such as "and software", "give full play", "love is invincible"), coverage is as high as 53.35%. Given the dummy demolitions have nothing to do with the context, so they put forward a simple but effective strategy: to cut their correct (unique) in a table in a table, in the form of a simplicity of high-frequency intersection. Its ambiguity is realized by direct investigations. Essentially, this is a memory-based model. 2.2 Phase the recordings and their processing
Unrecorded words roughly included two categories: 1) New emerging universal words or professional terms, etc .; 2) Special nouns, such as Chinese name, foreign translation, place name, institution name (general index, group and other enterprises and institutions) Wait. The former unregistered words is theoretically expected, which can be manually added to the word table (but this is only ideal, it is not easy to do in real circumstances); the latter type is completely unfounded, No matter how large the word table, it cannot be covered.
Sun Losong, Zou Jialan (1995) pointed out that in the real text (even in the field of general), the influence of the word unrivaled on the precision of the words exceeds ambiguous score. Uncourse is not recorded in the real-purpose word system.
For the processing of the first unregistered word, it is generally supported by the large-scale spending, first automatically generating a candidate word table (no supervised machine learning strategy), and then manually screens out of the new Words and supplements in the word table. Given the finished multi-word, even the Chinese class of Chinese characters in the Chinese language is currently the water mirror, so the existing research in this direction does not have the n-element Chinese character from the great-scale organic library. The distribution (N ≥ 2) is based on. Sproat R. And Shih C.L. (1993) Deliver the "Mutual Information" in the Information Theory Describes the binding power between any two Chinese characters. Sun M.S. and Shen D.Y., et al. (1998) stepped by this idea and proposed the concept of T-test difference between Chinese characters as a beneficial supplement of mutual information. Huang Weijing, Wu Lid, etc. (1996) introduced "quadruple loops" in classic statistics and testing of Pearsx X [2] - statistics, 2 words, 3 words, and 4 respectively, 3 words and 4 respectively, 3 words and 4 respectively Words of any Chinese character string do internal relevance analysis, followed by candidate. Nie J.Y. And JIN W.Y., ET Al. (1994), Liu Ting, Wu Yan, etc. (1998) work only uses relatively simple string frequency information. Several statistics (mutual information, T-test difference, x [2] - statistic, string string frequency) are dependent on a large-scale spending, Sun Massan, Zou Jialo (1995) For global statistics.
Treating the second type of untrusted word is usually: first, based on statistical knowledge (such as surnamed words and its frequency) and artificially summarized some structural rules In the input sentence, speculation may become a Chinese character string that may become a proprietary noun and gives its confidence. After using the proprietary term information (such as title), and global statistics and local statistics ( See below), further identification. Existing work involves four common proprietary nouns: Chinese name (Zhang Junsheng, Chen Yude, etc. 1992; Song Yugong, Zhu Hong et al. 1993; Sun Massan, Huang Changning, etc. 1995), foreign translation name (Sun Massan, Zhang Weijie 1993 The identification of Chinese places (Chen HH and LEE JC 1994; Zhang Xiaoheng, Wang Lingling 1997). From the experimental results of each report, the identification effect of foreign translations is best, the Chinese name, the Chinese place name is again, the organization is the worst. The difficulty of the task itself is essentially in this order from a small increase. Shen Dayang, Sun Massan, etc. (1997B) especially emphasized the value of local statistics in the undiscued word processing. Local statistics are relatively global statistics, refer to the current article and its effective range is generally limited to the statistic of the article (usually string string). Sun Losong, Zou Jialang (1995) demonstrated the effect of local statistics by the following example:
(7) Feng Jun, Henan Member, is willing to give a 100-day red 1000 strains without compensation.
Split a. Henan │ Member │ Feng Junfa │ I hope │ No Free │ 百 百 红 │1000 │ │.
Split B. Henan │ Member │ Feng Jun│Plope │ Free │ 红
Isolate the sentence (7), even if the syntactics or even semantic analysis can not be judged, it is Separation A or a separate B (both are rational). Only the bondage of the sentence boundary is jumped out, and it can be determined in the unit that is more than sentence - the chapter. For example, if "Feng Junfa" is "Feng Junfa", take a cut a; how to "Feng Jun", then cut B. Obviously, local statistics and "short memory" mechanisms in psychology or "buffer" mechanisms in computer technology are "the heart is a little bit".
Generally, the intervention of unrecorded words can cause new scores, thus making the situation faced by the word system more complicated. Sun MS and SHEN DY, ET AL. (1997) clearly divided into: 1) Scores between ordinary words and ordinary words (Section 2.1); 2) Ordinary words and unregistered words Differential dissensions; 3) Unrouting spots between words and unregistered words.
Observe the sentence (8):
(8) Wang Linjiang loves to play football.
The candidate guess "Wang Lin", "Wang Linjiang", "Lin Jiang", "Jiang Love", "Lin Jiang", "Jiang Love", "Lin Jiang", "Jiang Love", Chinese name identification module, "Lin Jiang". Among them, the Chinese name "Wang Lin" and "Wang Linjiang", "Wang Lin" and "Lin Jiang", "Wang Lin" and "Wang Linjiang" and "Lin Jiang", "Wang Linjiang" and "Lin Jiang love "," Wang Linjiang "and" Jiang Love "," Lin Jiang "and" Lin Jiang Love "," Lin Jiang "and" Jiang Love "," Lin Jiang Love "and" Jiang Love "and Chinese name" Lin Jiang " "There has been an unregistered score between the unregistered words and the unregistered words between" Lin Jiang ", and the general word" love "and" Jiang Love "have produced ordinary words and The score between the words is not logged.
It must be explained that the current research on unregistered words is still a matter of initial, in the method, especially at the calculation model of local statistics, must also be atmospheric. The two sets of examples are not described here, and the reader may wish to carefully understand the taste:
2.3 Language Resources Construction
A good automatic word system is inseparable from the support of the necessary language resources. There are three most important resources involved: universal words, the corpus and the corpus, and a large-scale organic library. On the one hand, they provide "mineral" rich treasure hostel for the various knowledge required to exploit the word system (such as the static distribution of score, what kind of word meter is related to the cutting, and the digestion The statistical parameters of the mode, and even the hidden Markov model can be learned from the corpus in the word and the word, the global statistics can be automatically converted by the extremely large-scale organic library); on the other hand, the corpus and the word-labeled corpus A quantitative assessment of the performance of the automatic word system can be used as a test material. Therefore, the construction of language resources is also an indispensable part of automatic word study.
The main difficulties facing this link are actually some "classic" issues hung in the study of Chinese language studies, such as the boundaries of words and morphology, phrases, specific classifications of the word classification system, and the like. Because of the constraint of the length of the article, don't talk about it. Here is just a few words to say about the first problem (in fact, the so-called word specification). The word specification directly affects the quality of the word menu and the word quotes. Although there have been national standards (National Technical Supervision Bureau 1993; Liu Yuan et al. 1994), some units have also set their own regulations (Huang Zhenren, Chen Kejian, etc. 1997), but The operability of these specifications is less operability (such as the statement on "what is the word" in the national standard: "Combination is close, use stable", it is difficult to operate), it is difficult to construct consistency good The word meter and the word quotes (Sun Losong 1999). In response to this, Liangnan Yuan, Liu Yuan, etc. (1991) and Sun Massan, Zhang Lei (1997) proposed the solution of "human-machine combination, qualitative and quantitative and quantitative", and has a certain scale of experiments, but this idea is true Optionally, there are premature years.
Take it, on this link, the linguistics is a great use of Wu Zhi, calculating the linguistics is in an urgent, honest mood open your arms to look forward to the hug of linguistics. In turn, the nature of language calculations (system must cover all language phenomena that intend to process) will also force linguistics more interpretation, analyze the language, and analyze the language, from middle-sized theory, the true look of the language, It is more versatile.
3. In the future research point
In December 1995, the State Ministry of Science Commission organized 863 smart machine auto-fencing evaluation, and several systems participated in China. The evaluation results under open test conditions are: the highest level accuracy is 89.4%; the correct rate of the intersection cutting score is up to 78.0%, the correct rate of coverage cutting scores is up to 59.0%; rather than registering The correct rate, the highest human name is 58.0%, the place name is up to 65.0% (Liu Kai Yizhen 1997). In March 1998, the State Science and Technology Commission has invested the second evaluation, and the results are similar to the first time. This means that even for Chinese analyzes the lowest, the simplest task - automatic word, there is also a distance from true meaning, we must also pay hard and meticulous efforts.
This is not optimistic, does not affect the feasibility of our auto-fencing in Section 1, because although the amount of engineering to be completed is still large, in the nature of the task difficulty, automatic score is after all Not belonging to "The Tamount Mountain is in the superbar" - "Not not, it is not possible." So, in the future research, what should be "right" can we help to achieve our ideal realm? In combination with your own research experience, the author believes that some work should be taken: 1) Build a widely accepted, high-quality universal word meter as soon as possible. This is to ensure that all other automatic finishes are solid and reliable; 2) Establish a set of Chinese auto-financing norms and words marked norms that recognize and comply with the school colleagues, and develop the balance of the mapping of millions of characters, mean The corpus and 10 million characters of even the fonts are federated. The results of each home should be shared as much as possible, avoiding simple repeating; 3) Under the support of general-purpose word menu and extremely large-scale spending, it is systematically found that the frequency, stability (referring to the basics of the field) (Or can be referred to as a general-purpose score) and targetedly given a solution; 4) The research on the cut-off score is very weak, and the statistical means seems to whip, should explore new countermeasures; 5) make the already The various proprietary nouns recognition mechanisms are more refined, and the Japanese name, minority name recognition mechanism; 6) Research on conflict processing mechanisms between various proprietary nouns; 7) Continue to discover global statistics and local statistics Potential, and pay attention to overcome its side effects; Zheng Jiaheng 1992), constructing more reasonable automatic fencing evaluation model, striving for the authority, disclosure, continuousization of evaluation work; 10) Under the guidance of machine learning theory, study the structure from linear or semi-structural language unit sequences Language knowledge, as well as complementary interaction strategies for supervising learning and non-supervised learning, maximizing the adaptive capacity of automatic fencing systems on complex open environments. 【references】
CHANG, C.H. And Chen C. D. 1993. A Study On IntegratingChinese Word Segmentation and Part- of- Speech Tagging.comMunications of Colips 3.2.69-77.
CHEN, H. H. And Lee J. C. 1994. The Identification OFORGANIZATION NAMES IN CHINESE TEXTS. Communications of colips4.2.131-142.
CHEN, K. J. And Liu S.9992. Word Identification for Mandarin Chinese Sentences. Proceedings of The 14th International Conference On Computational Linguistics, 101-107.nantes.
FAN, C. K. And Tsai W. H. 1988. Automatic Wordidentification In Chinese Sentences by The RelaXationTechnique. Computer Processing of Chinese and OrientalLalarLANGuages 4.1.33-56.
Guo, J. 1997. Critical tokenization and ITS Properties. Computational Linguistics 23.4.569-59.Lai, B.Y., Sun M.S., et al.1992.tagging- based first Ordermarkov Model Approach To Chinese Word Identification.
Proceedings of 1992 International Conference OnComputer Processing of Chinese and Oriental Languages, Florida.
---. 1997.Chinese Word Segmentation and Part- Of- SpeechTaging in One Step.Proceedings of International Conference: 1997 Research On Computational Linguistics, 229-236.taipei.
Nie, J. Y., Brisebois M., et al. 1996. On Chinese Wordsegmentation and Word- Based Text Retrieval. Proceeding 1996, 405 -412.singapore.
Nie, J.Y., JIN W.Y., et al.1994.a hybrid approach to unknownword detection and segmentation of chinese.
Proceedings of International Conference On Chinese Computing 1994, 405-412.singapore.
Palmer, d.d.1997.a trainable rule- based algorithm for word segmentation.proceedings of the 35th Annual Meeting of ACL AND 8th Conference of the European Chapter of Acl.madrid.
SPROAT, R.And Shih C. L. 1993. A Statistical Method forfinding Word Boundaries in Chinese TEXT. Computer Processing of Chinese and Oriental Languages 4.4.336-249.
Sproat, R., Shih C.L., et al. 1996.a Stochastic Finite-StateWord Segmentation Algorithm for Chinese. ComputationAllinguistics 22.3.377-404.
Sun, M.s.and Benjamin K. T. 1995. Ambiguity Resolution Inchinese Word Segmentation. Proceedings of the 10th AsiaConference On Language, Information and Computation, 121-1126.hong kong.
Sun, M.S., LAI B.Y., ET Al. 1992. Some Issues OnStatistical Approach To Chinese Word Identification.Proceedings of The 3rd International Conference On ChinaInformation Processing, 246-253. Beijing.
Sun, MS, Lin FZ, et al. 1996. Linguistic processingfor Chinese OCR & TTS. Proceedings of the 2nd InternationalConference of Virtual Systems and Multimedia, 27-42.Gifu.Sun, MS, Shen DY, et al.1997.Cseg & Tag 1.0: a PracticalWord Segmenter and Pos Tagger for Chinese Texts. Proceedingsof The 5th Conference ON Applied Natural Language Processing, 119-126.washington DC
----. 1998.Chinese word segmentation without using lexiconand hand-crafted training data.Proceedings of the 36th AnnualMeeting of Association of Computational Linguistics and the17th International Conference on Computational Linguistics, 1265-1271.Montreal.
Wu, a.d.and jiang z.x.1998. Word segmentation insenceanalysis.proceedings of the 1998 International Conference OnChinese Information Processing, 169-180.beijing.
Wu, Z.M.and Tseng G. 1993. Chinese Text Segmentation Fortext Retrieval: Achievements and Problems. Journal of Theamerican Society for Information Science 44.9.532-542.
---. 1995.acts: an Automatic Chinese Text SegmentationSystem for Full Text Retrieval. Journal of The AmericanSociety for Information Science 46.1.83-96.
YEH, C.L.And Lee H.j.1991.ruach..................................... ...
White Tiger, 1995, Chinese Word Separation and Words Layer. "Progress and Application of Computational Linguistics" Beijing: Tsinghua University Press, 56-61 pages.
Evaluation Model of Quality of Automatic Foundation Software in Cao Huanguang, Zheng Jiaheng, 1992. "Chinese Information" No. 4, 57-61 pages.
Trual Talk about Dong Zhendong, 1997, Chinese word. "Language Writing Application" No. 1, 107-112 page.
National Technical Supervision Bureau, 1993, National Standard GB / T 13715-92 of the People's Republic of China. Beijing: China Standard Press.
Huang Zhenren, Chen Kejian, 1997, "Information and Treatment of Information Treatment" Design Ideas and Norms. "Language Text Application" No. 1, 92-100 pages.
Huang Weijing, Wu Lid, et al, 1996, unwanted manual preparation dictionary based on machine learning. "Mode Identification and Artificial Intelligence" No. 4, page 297-303. Huang Xiangxi, 1989, "Generation - Test" Method for Automatic Pieces in Chinese. "Chinese Journal of Chinese Information" No. 4, 42-49 pages.
Han Shixin, Wang Ying Ying, 1992, Issue Based on the Foundation of Phrase Structure. "Chinese Information" Issue 3, 48-53 pages.
He Ke anti, Xu Hui, 1991, written Chinese automatic patch expert system design principle. "Chinese Information" No. 2, 1-14 pages.
Justice Chun Yu, Liu Yuan, 1989, the Chinese Automatic Pointment Method. "Chinese Information" No. 1, 1-9.
Treatment of Automatic Pieces and Ambiguity Combination of Chinese Automatic Pointments and Ambiguity Combination of 1988, Chinese. "Chinese Information" Issue 3, 27-33 Page.
Liangnan Yuan, 1987, written Chinese Automatic Pointment System - CDWS. "Chinese Information" Issue 2, 44-52.
-, 1990, Chinese computer automatic word knowledge. "Chinese Information" No. 2, 29-33.
Liangnan Yuan, Liu Yuan, etc., 1991, the principles of "Information Treatment Modern Chinese Common Word Word Form" are developed and discussed. "Chinese Information" No. 3, 26-37 pages.
Research on Automatic Meeting Evaluation Technology of Modern Chinese. Language Text Application No. 1, 101-106 pages.
Liu Ting, Wu Yan, etc., 1998, string frequency statistics and word matching combined with Chinese automatic patch system. "Chinese Information" No. 1, 17-25 pages 17-25.
Thoughts and Experiments on the Power of Ambiguity Fields in 1998. "Chinese Information" Issue 2, 63-64 pages.
Liu Yongquan, 1988, talk about the problem. "Chinese Information" No. 2, 47 -50.
Basic Project of Liu Yuan, Liangnan Yuan, 1986, Chinese Treatment - Modern Chinese Word Frequency Statistics. "Chinese Information" No. 1, 17-25 pages 17-25.
Liu Yuan et al, 1994, "Information processing is used in modern Chinese word norms and automatic scratch methods" Beijing: Tsinghua University Press and Guangxi Science and Technology Press.
Research and Implementation of Chinese Automatic Foundation System Based on Evaluation. "Language Information Processing Speech" Beijing: Tsinghua University Press and Guangxi Science and Technology Press, 2 -36 pages.
Shen Dayang, Sun Massan, 1995, China's place name automatic identification. "Progress and Application of Computational Linguistics" Beijing: Tsinghua University Press, 68-74 pages.
Shen Dayang, Sun Massan, etc., 1997a, information integration and optimal path search method in Chinese word system. "Chinese Information" No. 2, page 34-47.
-, 1997B, local statistics Application and implementation methods in Chinese unregistered word identification. "Language Engineering" Beijing: Tsinghua University Press, 127-132 pages.
Song Yugui, Zhu Hong et al, 1993, a human name recognition method based on corpus and rules. "Computational Linguistics Research and Application" Beijing: Beijing Language College Press, 150-154 pages.
Sun Losong, 1999, talking about the consistency of Chinese language quotes. "Language Text Application" No. 2, 87-90 pages.
Sun Massan, Huang Changning, etc., 1995, Chinese name automatic identification. "Chinese Information" Issue 2, 16-27.
-, 1997, using the Chinese character dual syntax relationship to solve the intersection ambiguity in Chinese automatic word. "Computer Research and Development" No. 5, page 332-339.
Automatic Identification of the English Name Translation of the English Name. "Computational Linguistics Research and Application", Beijing: Beijing Language College Publishing House, 144-149 pages. Sun Losong, Zhang Lei, 1997, Human Machine Co-Survival, Quality of Quality - Talking about Developing Information Processing Chinese Circular Strategies. "Language Text Application" No. 1, page 79-86.
Several Theoretical Issues in the Study of Automatic Fields in Chinese. "Language Writing Application" No. 4, 40-46 pages.
Sun Losong, Zuo Zhengping, 1998, the intersection of the intersection in Chinese real text. "Research on Chinese Measurement and Computing" Hong Kong: Hong Kong City University Press, page 323-338.
-, 1999A, algorithm to extract Chinese three-character long settles. "Journal of Tsinghua University" No. 5, 101-103 Page.
Sun Losong, Zuo Zhenglan, 1999b, the role of high frequency maximum intersection scores in the automatic word of Chinese. "Chinese Information" No. 1, 27-34 pages.
Wang Xiaolong, Wang Youth, etc., 1989, minimum word problem and solution. "Science Notice" No. 13, 1030-1032 pages.
Wang Yongcheng, Suahai Chrysantry, etc., 1990, the automatic treatment of Chinese words. "Chinese Information Journal of Chinese Information" No. 4, 1-10 Page.
Yao Tianshun, Zhang Guiping et al, 1990, rule-based Chinese automatic patch system. "Chinese Information" No. 1, page 37-43.
Xu Bingzhen, Zhan Jian, 1993, the word method based on neural network. "Chinese Information" Issue 2, page 36-44.
Xu Hui, He Ke, etc., 1991, the realization of automatic word expert system in written Chinese. "Chinese Information" No. 3, 38-47.
Zhang Junsheng, Chen Yude et al, 1992, Chinese name identification of multilingual library practices. "Chinese Information" No. 3, 7-15 pages.
Identification and Analysis of Zhang Xiaoheng, Wang Lingling, 1997, Chinese Institution Name. "Chinese Information Journal of Chinese Information" No. 4, 21-32.