Yu Shi Wen
1. Significance of developing language information processing technology
The author believes that Chinese information processing can be roughly divided into two levels. One is a text level, namely Chinese character information; the other is the language level, this article only discusses Chinese information processing issues. All human beings are used for communication information, propagation knowledge, and various natural languages (such as Chinese, English, etc.) have deep similarity, so compared with the text level, Chinese information processing is more information processing with other languages. Commonity, of course, Chinese information processing also has its own characteristics, this paper naturally explores the characteristics of Chinese information processing more.
With the increasing information of society, people are increasingly intensifying to communicate with computer with the natural language. Natural language understanding is a fascinating and challenging topic in computer science. From a computer scientific point of view, the task of natural language understanding is to establish a computational model, which can understand natural language like people. Due to the natural language inherent complexity, people still understand the mechanism of themselves, it is extremely difficult to define the next essential definition of "understanding". Since language is the carrier of information, the understanding of the computer's natural language is generally judged according to the viewpoint of practical information processing. If the computer implements (1) human chance or (2) machine translation or (3) automatic abstract, the computer is considered to have a natural language understanding. Due to these practical systems, in addition to the analysis input to the computer's articles or words, there is also a function of generating language. Therefore, in computer science, in addition to "natural language understanding", "natural language processing" or " Language information processing "This type of term. In order to achieve various functions of language information processing, people in developing natural language lexical analysis, syntactic analysis, semantic analysis, context analysis, etc., in the accumulation of language data resources such as electronic dictionaries, corpus. These technologies and resources have already formed products, and some will be integrated into new information processing systems. The development of Chinese information processing technology has huge potential.
Due to the close relationship between language and thinking, culture, language research has become a breakthrough in Western modern philosophy and humanities. Language science is the leading science in humanities science. It is a bridge between humanities and natural sciences, with philosophical and mathematics in the entire scientific system. Due to the introduction of mathematical methods and computer technology in contemporary linguistics, language itself has produced a leap, there have been many branch cross disciplines, where calculating linguistics is the most active branch. The current foreign language study revolves around a central issue, this central issue is the language related to developing smart computers. my country's Chinese language studies have a significant gap in this regard. Experts in the computer are combined with linguists to carry out language information processing, which can not only shorten this gap, but also can drive the development of the entire humanities.
The essence of intelligence is one of the contemporary scientific problems. To achieve natural language understanding, you must ultimately understand how people understand the language and how children learn their native language. Different linguistic theories have made different interpretations of human language, and the reason why various debates cannot be concerned, because the function of the basic basis of the brain as a smart activity (including language activities) has not been thoroughly understood. Creating a cognitive model for analog language understanding on your computer (the current natural language processing system is the prototype of this model), which provides a "window" that can be observed for the activities of the black box of the brain. Using computers successfully simulate logical thinking, but also explore in simulation of image thinking and inspiration. The study of natural language understanding can contribute to the breakthrough of intelligent science.
2. Difficult history of language information processing research
The application of digital electronic computers in non-numerical fields is initially tried in the field of language information processing. Shortly after the electronic computer has started the machine translation test. However, whether the development speed of the computer technology itself is compared to the development speed of the computer technology in other fields, the development of language information processing is quite slow, and the road is tortuous. The first boom in the early 1950s and the early 1990s had a first boom in the United States. In 1966, the American Academy of Sciences Language Automatic Treatment Advisory Committee, published a scoop of the machine, and the language information processing has been in the quiet period. Since the late 1970s, due to the rapid advancement of computer technology and the development of linguistic theories, since some machine translation systems and database natural language interface entered practical, more social demand promotion, language information processing research re-entered the prosperity period, its significant The logo is a quite a number of language information processing products enter the market. However, the road is not flat. The two large-scale machine translation research plans completed in the early 1990s (EU EUROTRA and ODA of the European Community and the ODA of 4 neighboring countries) failed to achieve the expected goal. The statistical method of corpus-based standards advocated by some scholars in the early 1990s also encountered heavy obstacles. At home and abroad, a considerable number of experts on the status quo, theoretical foundation, and technical route of natural language treatment, and some scholars believe that it has not yet leapped the "semantic obstacles", and they are also brewing new breakthroughs. In recent years, the Internet has quickly expanded, and a large amount of information is like tide. The main carrier of this information is still a natural language. People are eager to develop natural language information processing technology to achieve text automated classification, literature retrieval, information extraction, language translation, automatic Abstract, automatic exploration, accelerated information, knowledge and culture, promote social, economic and scientific progress, obviously this is the challenge for every country. The development of language information processing technology has a new powerful promotion force. my country is one of the earliest countries in the world to carry out machine translation research, but the large-scale, comparative system research on natural language processing is until the mid-1980s, is relatively late. In view of the national conditions of my country, my country's scholars will focus on the development of practical systems, the theoretical research is relatively weak, and less theoretical results. Although some systems have achieved considerable economic benefits, from the whole, my country's language information processing research is compared to current international levels, there is still a certain gap. This phenomenon may also exist in other fields of science and technology. We need to focus on some special issues related to Chinese in the field of language information processing.
At the level of semantic analysis and context analysis, the author noticed the commonality of various natural languages, and the author is difficult to believe in the optimistic estimate of the Chinese will surpass other languages in advance in semantic analysis. Conversely, the author will see the special difficulties encountered by Chinese analysis at the level of syntactic analysis.
The external characteristics of Chinese as a typical analyte in English and Japanese belonging to the adhesive language are both lack of morphological changes and lack of adhesive ingredients as syntactic markers. The author thought that in the existing Chinese grammatical system, Mr. Zhu Dexi's phrase (phrase) The syntax of the Book is most in line with the practical and information processing of Chinese. The phrase native syntax reveals the influence of Chinese external characteristics on Chinese syntax analysis. These impacts are the essence of Chinese automatic analysis difficulties. The speech-based grammatical system is discussed in Chinese syntax. Abstract is as follows: (1) The word to the phrase is "constituent" relationship, and from the phrase to the sentence is a "realization" relationship; the principle of Chinese phrases and the structure of the sentence The principle is basically consistent; (2) The same word of Chinese can be a variety of syntax components in the syntax structure, and there is no variety of phrases; the structural components of various phrases can be various types of phrases, predicates and Other various syntactic components are in equivalents, and the predicate itself can be the structure; (4) Although the internal sequence of various types of phrases is fixed, the sequence of Chinese sentences is quite flexible; (5) the virtual words in Chinese Although there is an important syntax function, in many cases, it can be omitted; (6) In written Chinese in written Chinese in writing (between the word and the word), more language information is lost. The quality of translation in Chinese-English machine translation is far more poor in English-Chinese machine, and it has confirmed the correctness of rational thinking from practice. Considering the pivotal status of the syntactic analysis in most language information processing systems in the actual operation, it is meaningful to clearly recognize the special difficulties of Chinese automatic analysis. Only by recognizing that difficulties may find the countermeasures to overcome difficulties. Of course, my country's language information processing research also has its own advantage. In my country, under the guidance of more advanced linguistics, China has begun to study in a more advanced computer environment, avoiding some of the bending roads from the early exploration of developed countries. Chinese is one of the most important languages in the world. The human and language data resources needed by my country and language projects are rich, and the prices are relatively low. my country's scholars can fully play their intelligence, assume social development to give their own responsibilities, and have something to create, and contribute.
3. Thoughts on the development strategy of language information processing technology
3.1 Basic Project - Establish a large-scale comprehensive language knowledge base
There is no difficulty in communicating with people with natural language, because communication is always conducted in a certain environment, communication between the two sides (including language knowledge and real world knowledge) must have a common part, exchanged The purpose is generally presuppost. There is no such knowledge now. The real world is boundless, but must face specific areas, but language knowledge is common. Establishing a large-scale integrated language knowledge base is an essential basic project. This knowledge base includes both lexics, syntax knowledge, including semantics and even words; this basic language unit in this knowledge base is both words, there are both morphoons and phrases; this knowledge base includes both original corpus, including multi-level machining The corpus, high knowledge content, dictionary database that store format specification is essential. In order to achieve machine translation, this knowledge base not only contains Chinese knowledge, but also to interpret Chinese and other languages. After more than ten years of hard work, my country has already had a lot of accumulation in this regard, but it is very dispersed and the quality is also uneven. It is now necessary to integrate and develop. The "Modern Chinese Syntax Information Dictionary" developed by Peking University Computational Linguistics can become a structural material of this comprehensive language knowledge base. The grammar dictionary of the Peking University computing language can also provide reference for building this language knowledge base.
3.2 Theoretical Exploration - Theoretical System and Calculation Model for Chinese
In this regard, we should learn foreign advanced theories and methods, and should be in line with international research. For example, a variety of computational linguistic theoretical models based on complex feature sets and combined algregules proposed by foreign scholars are worth "brought". Foreign scholars advocate semantic analysis and corpus-based statistical methods worth learning, but if there is a study of Chinese grammar rules that apply to computer processing, it is the actual situation in Chinese. The author believes that combining the actual situation of my country, using machine processing and expert school to combine technical routes, machine machining and combining rules and statistics, it is possible to achieve a shortest large-scale corpus in a short period of time. Multi-level machining. For this goal, if the organization is properly, we may be in front of others. We have a corpus that meets the requirements of the requirements, and we may construct probability syntax, which ends the language rules saying that there is no difference between the language, but there is always a 尴尬 situation that does not meet the rules. Considering the special difficulties of Chinese automatic analysis, the author thinks that the real meaning and use value of the limited Chinese studies are very large. Restricted Chinese is not a meter, in the historical process of language information processing development, it can play a milestone. Limited Chinese may become a common language in the world of Yan Yue, will promote the exchange and cooperation between the Yellow Seeds. Research on Limited Chinese will promote the standardization and modernization of Chinese, will improve Chinese international status. Limited Chinese may become a high-speed train full of Chinese culture on the information highway.
3.3 Product development - support with theoretical research and basic projects
Although the theory and technology of language information processing is still immature, existing technology and language data resources can also develop products that are suitable for market needs or improve information technology products. Since the investment in theoretical research and basic project is always in enough, some energy engaged in product development, with its income support theoretical research and basic engineering, to form a benign circulation between theory, foundation, and applications, such technology routes from the overall It is undoubtedly available on it. However, it is often a small unit, which is often exhausted, which is also distressed in the language information.
3.4 Talent Cultivation - Vigorously Cultivate Talents for Computational Linguistics
In order to promote the development of natural language processing technology, in order to enhance our competitiveness in this high-tech field, vigorously cultivate cross-disciplines that support natural language treatment technology - the talents of calculating linguistics, especially young talents are important. Many foreign universities have a language department. In the past 10 years, it has also established a computational linguistic system or a professional, and some important universities in the United States have attracted a Ph.D. in calculating language. my country has neither such a department and professional, and there is no master's degree and doctoral degree. Now we can only cultivate doctoral students and master students in other disciplines (computer science or linguistics). The author hopes to establish a doctoral degree and master point of computational linguistics in some conditional university experiments, accelerate the cultivation of senior professional talents in the field of language information processing.