National Vocational Committee Language Text Application Research Institute Feng Zhiwei
Traditional language research is serving language teaching, literature finishing, social historical research. Such a study is completely for people, such research has been in close two thousand years and has achieved considerable results. Since the emergence of the electronic computer, the transmission and communication between the person and the computer, so, in addition to continuing to perform human language research, it is necessary to carry out a computer-oriented language study. Scholars began to use computer technology to study and handle natural language, build a variety of natural language processing systems on a computer. The research on a computer-oriented language is that this study has begun in the 1950s, and this study has made great progress, which has become an important emerging discipline - natural language processing. The computer's research and processing of the natural language should generally be processed by the following three aspects: First, the problem that needs to be studied is formified in linguistic formalism, so that it can be strictly in a certain mathematical form. Regularly expressed; second, this strict and regular mathematical form is expressed as an algorithm (Computational formalism); third, write computer programs according to the algorithm, so that it is on the computer Convert (Computer Implement). Therefore, in order to study natural language processing, we must not only have linguistics, but also have knowledge of mathematics and computer science, so natural language processing has become a world of language, mathematics and computer science. The marginal cross-disciplines between China, and the three fields of liberal arts, science and engineering. A computer-oriented language study first is starting from the research of the machine translation system. In 1946, the electronic computer has just been approved, and when people are widely applied to numerical operations, they also thought that the use of computers translated one or several languages into another language or several other languages. From the early 1950s, the machine translation has been a central issue of natural language processing system research. At the time, the main translation of "word identical words" was used, which was not based on the simple technique based on the understanding of natural language. There is no expected translation effect. In the mid-1960s, people began to transfer basic issues such as grammar, semantic and pragmatic and pragmatic, and attempted to let the computer to understand the natural language. Many scholars believe that the determination of whether the computer understands the most intuitive method of the natural language is to make people talk to the computer. If the computer can answer the problem with the natural language, it is proved that the computer has already understood the natural language, so The "Human Dialogue" system (or "natural language understanding" system) has appeared. The theory and methods of natural language processing are also gradually formed, mature and perfected in these specific studies. Research of the machine translation system is a historic department for computer language research. Regarding the idea of using machines to translate language, far in the ancient Greek era, some people have been presented. At that time, people have tried to design an ideal language to replace the natural language of the variety of different forms, so that I think about people in different nations. Many programs have been raised, some of which have taken into account how to analyze the language in mechanical means. At the beginning of the 1930s, the French scientist Alchugi (G. ARTSOUNI) proposed the idea of using the machine to translate the language. In 1933, the Soviet inventor Troyski (п.п.троянский) designed a mechanical method to translate a language into another language, and registered his inventions on September 5 of the same year.
However, due to the technical level of the 1930s, the Troyski's translator is not made. The development of the machine translation system begins at the end of the 1940s. It can be divided into three periods of the grass creation period, the revival period, and the development period. (1) Dragonfly (1954-1970): In 1946, the University of Pennsylvania, JP Eckert, and MoMauchly, designed and manufactured the world's first electronic computer ENIAC, electronics The computer is amazing, revealing the innovation issues that people consider translation technology. Therefore, in the same year of the electronic computer, the British Engineering Rebs (W.Weaver), in discussed the electronic computer application range, in discussed the idea of using the computer. In 1949, Waff published a memorandum of "translation", formally proposed machine translation. In this memo, in addition to making many common characteristics, there are two points worth paying attention: First, he believes that translation is similar to the process of interpretation of the password. He said: "When I read an article written with Russian, I can say that this article is actually written in English, but it is a code with another strange symbol, when I When I read, I was decoding. "Second, he believes the original text and the translation" is the same thing ", so when the language A is translated into language B, it means that from the language A, after passing A "Universal Language" or "Interular Language", then converts to language B, this "universal language" or "intermediate language", which can be assumed to be common. It can be seen that Weiver has only seen machine translation only as a process of mechanical interpretation passwords, and he has not seen the complexity of machine translation in lexical analysis, syntact analysis, and semantic analysis. Due to the enthusiasm of scholars, the strong support of the industry, the US machine translation study is prosperous. 1954 The Soviet Union, the United Kingdom, Japan also conducted a machine translation test, and the machine translation was booming. The development of the early machine translation system is greatly influenced by the above-mentioned thoughts of Weffe, and many machine translation researchers have more than the process of interpreting the password. Therefore, the readability of the translation is poor, it is difficult to put it practical. In 1964, the American Academy of Sciences established a language automatic technical advisory committee (ALPAC committee) to investigate the research on machine translation, and published a report entitled "Language and Machine" in November 1966. Alpac reports to adopt a negative attitude towards machine translation. The report claims: "In the current translation of machine translation, there is not much reason to support"; the report also pointed out that machine translation studies have encountered difficult to overcome "Semantic Barrier".
Under the influence of the Alpac report, many countries' machine translation research moved to low tide, many machine translation research units have encountered the difficulties of administrative and funding, and the thermal trend of machine translation suddenly disappeared, there is an emergency. Available in the depression of the depression. However, although in the depression, France, Japan, Canada, etc., still insists on machine translation research, so in the early 1970s, the translation of machine translation has emerged. my country is the fourth country in the world after the United States, the Soviet Union, and the United Kingdom. Today, in the process of translation in machine translation, Japan in the process is only in 1958, starting machine translation, starting than my country. Compared with the development of foreign machine translation, my country's machine translation has a very special period-stagnation period due to the influence of the cultural revolution, in addition to the cultural revolutionary revolution, but also due to my country's machine translation In theoretical and methods, the substrate is very thin, and each period of my country's machine translation is slightly lag more than the same period of foreign machine translation. This is the characteristics of my country's machine translation development.
(1) Dragonfly (1956-1966)
During this period, Chinese scholars conducted a preliminary exploration and experiment of machine translation. In 1956, the country was included in the development plan of my country's scientific work and became a topic. The name of the topic is:
The Establishment of Machine Translation, the Establishment of Natural Language Translation Rule and Natural Language Mathematical Theory. 1957, China Academy of Sciences Language Institute, cooperated with the Research Institute to carry out Russian machine translation test, translated 9 different types, more complex sentences. In this grassroots, Beijing Foreign Language School, Beijing Russian College, Guangzhou South China Institute of Technology, Harbin Institute of Technology, also established a machine translation study group, carrying out Tests of Russian or English machine translation.
(2) Stopping (1966-1975)
During this period, there was no machine translation study and test in addition to theoretical exploration in extreme harsh conditions.
(3) Rehabilitation period (1975-1987)
During this period, my country's machine translation studies the revitalization of the burner and start recovery.
In November 1975, a machine translation collaborative study group composed of staff, language, and calculation, and a mitallurized entry of 5,000 units were established in China's Science and Technology Intelligence Research Institute. The translation scenario is paralleled. In May 1978, a sampling test was performed on the calculation of 111 machines, and 20 samples were achieved. During this period, Chinese scholars also conducted a certain effectiveness of the translation test of Fahan, German, Japan and Han-French / British / Day / Russian / DD multi-law. (4) Prosperity (1987 - now)
This period is marked with the "Star Translation No. 1" machine translation system. After the "Translation Star 1", a series of practical commercially used machine translation systems, such as the rain, the spring, and my country's machine translation towards the phase of practical and commercialization.
Another area for computer-oriented language research is the development of natural language understanding systems.
Natural language understanding how to make computer understand and use human natural language, making computers to understand the meaning of natural language, and answering questions from people to computers, by dialogue, using natural language with natural language. The natural language understanding system can be used as an expert system, knowledge engineering, intelligence retrieval, and office automation natural language human interface, there is a lot of practical value.
After the United States announced the Alpac report that negates the machine translation, the machine translation study in the grass created period was translated into a low tide, so that the research on computer handling of natural language has gradually turned to the natural language understanding. The scholars have adopted a variety of exquisite methods to try to establish a computer system, let the computer understand the natural language, and determine whether the computer understands the most intuitive method of the natural language, the human conversation, according to the computer, according to the computer If the answer to the question, you can see if the computer understands the natural language. This aspect of the study has not been encouraging. Therefore, when the machine translation is difficult to translate in the end of the end of the end of the 1960s, the study of natural language understanding is around the source, and then hosted, and when the machine translate the East Mountain, the natural language understanding has been obtained. Tired fruit.
The development of the natural language understanding system can be divided into two phases of the first generation system and the second generation system. The first generation system is based on the analysis of the words and word sequence analysis. The statistical method is often used in the analysis; the second generation system begins to introduce semantics and even parallelism factors, and almost completely draft statistical technology.
The first generation of natural language understands the system can be divided into four types:
(1) Special format system: Early natural language understanding system is a special format system, according to the characteristics of human-computer dialogue, using special formats to perform human-machine dialogue. In 1963, R.Lindsay designed an SAD-SAM system in the US Carnegie Technical College with IPL-V table, which uses special formats to perform human-machine dialogue on relative relationships. A database for relative relationships is established to receive English sentence questions about problems in relative relationships, and respond with English. In 1968, Po Bobrow designed a Student system in the US MIT Institute, this system summarizes the English sentences in high-middle algebraic applications into some basic modes, which understand the English sentences in these application questions. List the solution to the equation and give the answer. In the early 1960s, Green (B.Green) established the Baseball system in the Lincoln Laboratory, and also used the IPL-VV table to handle language. The system's database has stored data on the US 1959 federal baseball match record. Answer some questions about the baseball match. The system's syntactic analyzer is poor, and the input sentence is very simple. There is no connection word, and there is no adjective and adverb of the comparison. It mainly rely on a machine dictionary to identify words, and use 14 word categories, all issues. A special specification expression is used. (2) Based on the text: Some researchers are not satisfied with various formats in special format systems, because in a special field, the most convenient is to use systems that are not subject to special formatting structures. Dialogue, which appears in text-based systems, 1966 Ximeng (RFSIMMONS), JFBurger, and RE long, is the storage of text information and Search mode work.
(3) Limited logic system: limited logic system further improves the text-based system. In such a system, the sentence of natural language is replaced with some more formal marks, and these marks are self-contained in a limited logic system, and some reasoning can be performed. In 1968, Raphael established a SIR system in the US MIT Institute of Technology with a LIPSP language. It proposes 24 matching modes for English, and the input English sentences match these patterns to identify input sentences. Structure, in the process of reposing databases to the answer problem, you can handle some of the concepts commonly used in people's dialogue, such as the collection of relationships, spatial relationships, etc., and can make simple logic reasoning, machines and can be in the conversation Learn, remember the knowledge you have learned, engage in some initial intelligent activities. In 1965, Slegle (J. Slagle) established a DedUcom system to interpret the results in intelligence retrieval. In 1966, Sampon (F.B.Thompson) established a DEACON system, managing a fictional military database through English, using a loop structure and approximate English concepts. In 1968, C.kelog was built on the IBM360 / 67 computer, which can reasonabate according to the 1000 facts of 120 cities in the United States.
(4) General Deductive System: General Deductive System Use some standard mathematical symbols (such as prehealth calculation symbols) to express information. Logic people can use all the achievements of the agencies to establish a valid interpretation system, so that any problem can be expressed by the agencies, and actually perform the information required. Use the natural language to answer. The general interpretation system can express complex information that is not easily expressed in a limited logic system, thereby further improving the ability of the natural language understanding system. In 1968-1969, Green and Raidel established the QA2, QA3 system, using the predicate calculation method and formatted data (FORATED DATA) to perform the initial, answer the problem, and responded in English, this is a general interpretation system Typical representative.
Since 1970, a certain number of second-generation natural language understanding have emerged. Most of these systems are the program interpretation system, a large number of semantics, contextual analysis. The more famous system is the Lunar system, the SHRDLU system, the Margie system, SAM system, and PAM system. The Lunar System is a natural language intelligence retrieval system designed in 1972 in 1972, its purpose in helping geologists compare and evaluate chemicals from the Apollo-11 rockets about the composition of the moon rock and soil Analyze the data, this system uses a form question language (FORAL Query Language) to represent the semantics of the question, so that the sentence is explained in semantic explanation, and finally execute the formal questions language in the database, producing an answer to the problem.
The SHRDLU system was a system that established a natural language command robot in 1972 in 1972. The system combines syntactic analysis, semantic analysis, logical reasoning, greatly enhances the functionality of the system in language analysis. The object of the system is a toy robot with simple "hand" and "eye", which can be operated on a table with a toy building block with different colors, sizes, and shapes, such as cubes, pyramids, boxes, and robots. It is possible to pose these blocks according to the operator's command, move them to a new building block structure. During the human-machine dialogue process, the operator can obtain a variety of visual feedback he sent to the robot, and observe the robot in real time. The case where the command is executed. On the TV screen, you can also display the simulated image of this robot and its same true living in the electrical passenger to use the vivid scene of English conversation.
The Margie system is Schike (R.Schank) developed in 1975 in the Artificial Intelligence Laboratory of Sthu, USA. The purpose of the system is to provide a natural language understanding. The system first converts the English sentence into a concept-dependent expression, then reasoning according to the relevant information in the system, and pushes a lot of facts from the conceptual dependency expression. Since people are in understanding the sentence, there is always much more contents than the external expression of the sentence, therefore, the system has 16 types, such as reasons, effects, descriptions, functions, etc., finally, the results of reasoning Convert to English output.
The SAM system is built in the 1975 Yale University in 1975. This system uses "scripts" (script) to understand the story written by the natural language. The so-called script is used to describe a standardized event series of people's activities (such as beds, see a doctor). Schic Chuck and Amberson have assumed that everyone will naturally realize such scripts in his own practice. When understanding the story, these scripts can be used to establish the context of incident, and thus can be used It is expected that the events it represent, and the natural language is understood as the background, and the characters, locations, events, events in the story are reasonable. In the process of reasoning, give them new information, and finally use "synonymous mutual Training "(paraphrase) method, according to the results of the computer, repeat the original story by computer. When the repertrice is repeated, the content of the story repeated is much better than the original story in the reasoning process. The computer seems to be a sensible living person, add the new information launched in the reasoning process to the story, and add the original story more exciting. For example, enter such a simple story: "John walked into a restaurant. He sat down. He is angry. He left." The output of the SAM system is: "John hungry. He decided to go to the restaurant. He walks. Entered a restaurant. The waiter did not care about him. So John is so angry. He decided to leave this restaurant. "The computer inference, the reason for John left the restaurant is not available after sitting down. This is because in the "scripting" on the restaurant, there is a "service delivery menu" project, and there is no such content in the input sentence, but there is John's breath, so the SAM system makes this inference. The PAM system is a system that Wilinski (R.Wilensky) established in 1978 in Yale University. The PAM system can also explain the storyline, answer questions, and inference, make a summary. In addition to the event sequence in "Script", "Plan" is also proposed as the basis for understanding the story. The so-called "plan" is the means taken by the characters in the story to achieve their purpose. If you want to understand the story through the Plan, you will find out the purpose of the characters and the actions taken to complete this purpose. A "Plan Box) is provided in the system stores information about the various purposes and the information of various means. In this way, when understanding the story, you can understand what the purpose of this story is asking for a part of the information about the information stored in the program in the program. When a story plot is matched with scripts, the information about the general purpose can be provided due to the "Program", which will not cause the failure of the story. For example, rescuing a person who is snapped away, in the "rescue", the general purpose of "Rescue", including the nest of the mob, and the various methods to kill the mob, the next behavior can be expected. At the same time, it can inject the purpose according to the subject. For example, enter the story: "John Ai Mary. Mary was taken away by the mob." The PAM system is expected to take action to rescue Mary. Although there is no such content, depending on the "love theme" in the program library, "John should take action to rescue Mary".
Scholars also further studies the relationship between language understanding and memory, summarizing various specific knowledge structures, comprehensive syntax, semantics, knowledge, reasoning, and built into two fast reading systems of FRUMP and IPP. These two systems store more than 2,000 English words, which do not need to be analyzed by typing, but skip certain words extracts the main information in the story. Such systems can automatically make some news stories from newspapers.
The above system is written natural language understanding system, and the input and output is written text. Oral natural language understanding system, also involves complex technologies such as speech recognition, speech synthesis, obviously a more difficult problem, and the research of oral natural language understanding has progressed in recent years.
my country's natural language understanding has started late, 17 years late than foreign countries. In 1963, the early natural language understanding system has been built in 1963, and in 1980, two Chinese natural language understanding models were built, and they were implemented in a man-machine dialogue.
In the middle of the 1980s, under the influence of the fierce competition of international new generation, the study of natural language understanding has received more attention in China. The "natural language understanding and human-machine interface" is included in the development planning of new generations. The unit increased, and the research team was also growing.
At present, in addition to the machine translation system and natural language understanding system, the research field facing computer-oriented language has also expanded to the natural language human interface system, intelligence automatic search system, terminology database system, computer aided teaching system, voice automatic identification and synthesis System, text automatic identification system, speech statistics, etc. This research on computed-oriented language research has become a hot spot in modern science and technology. (Full text)
bibliography:
⒈ Feng Zhiwei, Chinese Information Processing and Chinese Research, Beijing, Commercial Press, 1992.
⒉Gazdar, Mellish, Natural Language Processing in Lisp.