Understanding Chinese Language from The View Point of Computer
Zhou, XIling
PROFESSOR
Beijing Information Technology Institute Institute. INSTINE
Aug. 16,1997
Abstract
It is important to distinguish two mode of information transferring processes:.. Detailed description mode and curtailed hint-mode In history, the traditional style of Chinese writing is to focus on "terse and more message" Thus Chinese writing is unique in its widely usage of Hint-Mode in various ways for the purpose of high efficiency as compared with other languages such as English and Japanese. This implies that the understanding of Chinese language depends in much more degree on the Common Knowledge between the information sender and the receiver. This kind of practise was carried out so hard that in many cases the reader (listener) has to rectify the meaning of a sentence via common sense instead of the result of syntax analysis. So far, our computer technology is far from being able to understand the "Hint Mode" Expressions On The Basis of Common Sense and Knowledge, Hence We Have To Restrict The Style of Statements IN "Description Mode" Before The Computer Can Understand IT.
Summary
Two modes must be distinguished from the information transfer process: "Detailed description method" and "brief prompt". Handan of Chinese is emphasizing "Swordsmanship". Compared to other languages (such as English and Japanese), "prompting mode" in Chinese's written language and spoken language, which means that the understanding of information should be written in greater extent (listening) The knowledge of both parties share. This approach is such a general case that the recipient that is sometimes information must pass the information obtained by common sense. Computer technology to date is far from reaching this mainly relying on common sense and expertise to guess the semantic level, so that the natural language statement fed to the computer must be limited to the "description method".
Foreword
Whether ancient sages are still in modern scholars, they all advocate "more reading, and careful". However, this paper basically only starts from the sense of language from the intuition and daily life, it is said to have been deemed to have been teach. One is because I have recently entered the natural language to handle this very sparsely problem, so there is no training in the basic skills of linguistics. I have been thinking that there is a thing of all the benefits, and I have a good thing, but reading is "Let others run the brain". If you watch TV dramas, you will use your TV drama to bring your eyes to the director; in the case of other work very busy If you swallow your dates, there is no time to digest, and your instinct will give the horseshoe. The third is that the linguists have conducted research and discussions for many years. We'd better discuss it in the circle of software workers. Finally, as Freshmen in this field, even if it is not right, shallow, and even generous, it is probably to understand the understanding of software peers and linguists. It is preface. Two kinds of information delivery
When we want to pass information with some meaning, two ways can be generally taken: detailed "description methods" and brief "prompts". for example:
l When you use the C language to write your computer program, we can use only the basic statements that allow for use in the C language, you can also call subroutines in the library, the SVC or API provided by the operating system. The former belongs to the "description method", which belongs to "prompt mode".
l In the field of painting, there is a "penalty" that is characterized by deliberately detailed and detailed, it also has a simple outline, highlighting the "bricks" (comics and child painting textbooks for this class). The former belongs to the "description method", which belongs to "prompt mode".
l Painting and calligraphy, the statement used when people speak can be divided into two: "description methods" and "prompt mode". "Description method" pays attention to grammar. The relationship between the structure and the compositions specified in the syntax and the components in the sentence are comparatively corresponding to the structure of the objective thing to be described by this sentence. "Tip mode" is different. It only selects a number of major factors in the objective things to be described, letting the other party use the information in the environment and the information already included in the context, and the other person already have the knowledge omitted in the statement. If a language of a language can comply with the corresponding relationship between syntax and objective world, it is more delicate to different situations, then what he said is more rigorous. Use "Description Sentence". Conversely, if he said is relatively simple. He uses "prompt". At this time, the understanding of the language is more dependent on context and the common knowledge of the writer and the audience to exclude ambiguity.
When you distinguish between the two information transfer methods, we will of course notice the following:
l "Description Method" and "Prompt Method" are mixed, especially when writing computer programs and people talk each other.
l In order to save time and energy, as long as it is possible, people tend to use "prompts". However, in this way, it is necessary to meet a prerequisite: the transceiver of information sharing some kind of knowledge. The readers of the comics should understand the portrait of a celebrity outlined, he must have seen this celebrity from the newspaper, magazine or TV in advance; the computer compiler must find the description of the IMPLEmentation of the library program in the library. To compile; in the same manner, if an article contains the idiom of "Bosn", the reader must know the story quoted by the four words "of Cheng Zhu" in advance to understand the meaning of this paragraph.
l In other words, the use of "prompt mode" improves the requirements for information reception.
Descriptor
The grammar of various languages in the world is different, but they are probably satisfying a common requirement, which is to reflect the relationship between things in the world and things between things and things. Words used to correspond to things in various languages are "textures", which are equivalent to the concept of "object" "entity" in computer software. In order to describe the relationship between things and things, "predicate" (which is equivalent to "relationship" "in computer software). In order to distinguish 1: 1 or N: M relationship, it produces a single and most concepts. The body is later further subdivided into nouns, synonymous, and so on. The predicate is later further subdivided into verbs, adjectives, prepositions, and more. There are many relationships, such as "hit", in order to indicate which side of the relationship, some language generates the concept of grammar "Grid", which has a "active" on the verb "Different from" dynamic ". There is no "grid" and unclear "active" and "sequence" and "virtual words" in Chinese, "the virtual word seems to be used in the direction of the relationship). To express the concept of this. "Prompt" in Chinese
There are two cases of "prompt" in Chinese. One is a story of "people all know" through idioms. E.g:
He is simply "worn and worry".
The other is to omit a number of ingredients in a complete description sentence, leaving only a key term. The other part of the omitting is to rely on the reading (listening) to use your knowledge to add back. E.g:
Not coming today.
It is "I" not here, or "Old Zhang" is not coming, people who want to see speech are calling or waiting for people. As for a forecast of abbreviations in a certain period of society, such as:
Five words and three love.
This is even more.
The words in Chinese are basically there is no "sex, number,", and the verb itself does not pay attention to "the initiative, passive" and "past, future, now" time distinction. In addition, China's cultural tradition has always pay more attention to "Sword". Therefore, the use of "prompt" in Chinese is more than other languages. Not only that, but Chinese often allows the language of the language to violate the grammar rules required in the "Description Sentence", as long as the listener can understand the "error" syntax or sequence correctly when understanding through semantics and contexts.
Taking Du Fu's famous sentence "famous article, the official should be old illnesses" as an example. The last five words list five concepts:
When the official should be aging, get sick
According to normal syntax, it is simply unable to understand it. To understand, it is necessary to turn into the second sequence of these words into: "Old disease should be closed", that is,: "due to aging, it should be unhappy".
This language is still retained in modern daily life. The most typical example is: "Rescue" and "recovery fatigue" these two universal statements.
The cause of this strange phenomenon may be revealed from the process of learning from a child. When children learn, most of them have not organized the ability to organize the words in accordance with certain syntax, and he can only say the words of the most important concepts in his mind. For example, "fire fire", rigorous statement should be:
Rescue life and property in the fire. Or rescue life and property from the fire.
However, he has no ability or unreasonable to say such a complicated, there is only two most critical words "save" and "fire". In the scene of the fire, although the sequence used will make nerd to understand the mistake, the meaning of the two words is enough, and it will not cause misunderstandings. Of course, before the match is not invented, people see the precious fire to go out and want to save it, "Fire" has become a "saving fire" that is fully compliant with the modern Chinese grammar.
Similarly, "recovery fatigue" is the following rigorous statement:
Restore the state of fatigue. Tip form.
Take the predecessor of the predecessors of the language of the language, first proposed, and repeatedly discussed in the language circles:
(on the table)
Chicken is not eating.
Look. Some people say, "Eating" is dynamic here, if this sentence is in the chicken farm, it is the active. I think it can be more naturally explained from another point of view, that is, it is actually a simplified prompt form that speaks, and its simplified process can be envisaged as follows:
I don't want to eat this chicken.
Chicken, I don't eat it.
Chicken, not eaten.
Chicken is not eating.
Although the final sentence violates conventional syntax, it is allowed to use Chinese people.
There is also an example in which it often arises is:
Go right.
Is the word "going" in this sentence is a noun or a verb?
There is a letter that "go" is generally the verb, but "go" here has been named, because only the body can act as the subject. It is already equivalent to Going in English, but because there is no morphological change in Chinese, it can't see it on the surface.
The opponent said, no! The "go" is still a verb, otherwise how do you explain the following sentence?
Not right.
Is it also a noun? "No" can only be used to modify the verb, "no" is never followed by a noun! So in Chinese grammar, nouns, verbs, and adjectives can be used to act as the subject.
So far, the above two kinds of opinions seem to be persuaded. I think that this sentence is actually an informal, "tip", and the problem will be solved if this view is taken.
Imagine a unit, everyone discusses that the old man will meet. Use a rigorous "description sentence" to express the two disputes, it should be:
"This faction that advocates old Zhang to meet" is right.
"This faction that advocates that Lao Zhang does not meet" is right.
Waiting until the old business is coming back, if the two sides start to argue with the same prompt statement, then the actual content is turned:
"This time this is the practice of going" is right.
"This time this time does not meet the meeting" is right.
In other words, in a completely complete statement, it is the main phrase as the subject, and "going" in the simple form is "not" just a critical verb that draws from these native phrases.
If the prompt of the above example sentence is translated into: "GO is right." I am I am afraid I can't be considered a regular English, but I must say: "The idea of let him go is right." . In general, language like English, Russian, Japanese is biased towards the description of the description of the description, while Chinese tolerate more briefly. This feature of Chinese makes its expression in the form of more information compact, the same amount of information contained in the same amount of information, and its disadvantage is, such as Mr. Lu Xun said: not sure enough. From the perspective of computer natural language, understanding, due to a large number of suggestions to rely on language recipients, they have added the part of the illegible part, and greatly increases the difficulties of computers.
岐 problem
There are many discussions about 岐. "Things" is a typical example. It can have several meanings:
〖"Things" can have two different pronunciations of dong1xi1 and dong1xi5, while different pronunciation Chinese characters or Chinese characters are not a word. However, only the computer's handling of written languages is discussed here, so this distinction is not considered. 〗
Things (objects) - to buy things in department stores.
Things (direction) - East Chang'an Street is a street in east and west.
Things (love said) - your loved little things. Things (derogate) - this person is really not something!
It is difficult for computers that it is difficult to determine which meaning should be used in context, unless the social and natural scientific knowledge of ordinary people is loaded in the computer. However, if people use the auxiliary writing software that can help the computer to understand Chinese, you can check the dictionary in the division, if you find that this word has multiple meanings, you will ask the writer to ask him if he is using A meaning.
It is worth noting that some ambiguity changes its meaning in a very concealed, in a context-related manner. American students studying Chinese in China said: Chinese is really strange, the following sentences:
(a) The Chinese team won the US team.
(b) The Chinese team defeated the US team.
It is your Chinese team. but:
(c) The US team defeated.
It is still the US team lost.
In fact, the "defeat" in the second sentence (b) is the physical vocabulary "make the ... defeated". The "defeat" in the third sentence (c) is not a vocabulament, indicating the main "defeat". In other words:
If the subject and object, "Decades" or "Dead" or "Dead" in the sentence is used as "and material volators" in the sentence, at this time, the party represented by the object is of course a winner.
If there is only the subject in the sentence, there is no object, "defeat" or "defeat" can only be used as "unstruth" in the sentence, and the one represented by the subject is the loser.
There is still a strange thing:
Throw the waste paper on the ground.
Throw the waste paper underground.
The "ground" and "underground" are "on the surface of the floor."
Buir waste paper underground.
"Underneath The Ground" is "Underneath The Ground." 〖"Underground" has two pronunciations in Di4xia4 and Di4xia5, and the linguist thinks that two different words: the former means "under the ground", the latter "below On the ground. But the average person is difficult to note even in the spoken language, as a written language for computer reading, it is more difficult. 〗
There is no word in Chinese to identify the word, and the same Chinese character can often be normally named, but also as a verb, even adjective. This "Word Ambiguity" also brought difficulties to Chinese understanding. In ancient text, this example is more:
Road to Road, very Avenue. Name, very famous.
Jun Jun Chen Chen father and child.
Little children.
This phenomenon has also happened in modern life. The host of the "Variety of Variety" in CCTV said:
This show lives very much.
There are also many examples in the daily life of ordinary people:
Husband asked: "Is there a baby?"
Wife A: "It's already big."
Since the body
"The body words" is a very common phenomenon in Chinese. In this sentence or phrase, you only see a series of nouns, synonymous, synonymous, ..., but not find the predicates, prepositions, prepositions, ... From the perspective of the software personnel who are familiar with the ER (Entity-Relation) model, this approach is equivalent to only a number of Entities (entities) in the ER model reflecting the objective world, and completely omitting the RELATIONSHIP between them. ). What is the relationship between them, and leaves the reader to guess according to the semantics of these textures. Compared with other languages, Chinese compact, flexibility, but also its non-precision is largely related to this language.
If there are two noun n1 n2 adjacent in the statement, then the relationship between them can have a variety of forms, how to choose, often to see the meaning of these two nouns N1, N2 to determine. E.g:
l If N1, N2 is all place name, then n1 is used to limit the range of N2, such as: China, China,
Xinjiekou
l Little plum yellow hair. ---- Main predication
l Round neck shirt. ---- Before the description
l Sometimes it also needs to decide according to social knowledge or life knowledge other than statements:
Lu Xun recalls "
In this phrase, two real words of the entity: "Lu Xun" and "Memoirs". What is the relationship between these two entities? There is no phrase in the phrase. Guess from the computer's mechanical head, there can be some speculation:
About Lu Xun's memoirs
Lu Xun writes memoirs
Lu Xun collects memoirs
Lu Xun sales memoirs
.........................
However, most people know that as a book name is printed on the cover, only the first two explanations (but the computer can't see this). People with a certain culture are further known that "(about) Lu Xun () memoirs", not Lu Xun's own memoir: because he knew Lu Xun did not write her own experience, and he knew Many people have written with Lu Xun. Or he once looked through the directory or content of this book. Until
"Herrushchev memoirs"
It should be understood as "Herrushchev (written) memoirs": Because he saw Herrush Xiaofeng's report (computer without the experience of reading the newspaper).
Coverage of Chinese grammar
If you recognize that I am divided into the Chinese expression into a "description method" and "prompt mode", you can conclude:
The range of natural language subsets that are at least in limited Chinese, or to communicate with computer, never try to design or summarize Chinese grammar that can describe the description and describe the probably. Chinese grammar can be overwritten to cover the description sentence. Otherwise, the induced Chinese syntax must be poor systematic, there are many exceptions, and the specification role that can be found in grammar. For example, if "go is right." "Don't go right." It is a regular statement that is not omitted, then in the development of Chinese grammar, in addition to the body, like the void words, adjective these To specify the properties of the body, or the words "predicate" category between them should also be allowed to act as the subject. In particular, if there is anything permit in a grammar, then this "law" has no use.
I heard that the relevant departments have entrusted some experts to formulate a set of Chinese grammar that can be issued, but it has not been successful. Is it here?
How do you understand a Chinese sentence?
There seems to be many statements for this problem, for example:
If the computer can correct the syntax tree of the statement.
If your computer is able to correctly translate the statement into another language.
and many more
Even understand this sentence.
According to our current task, the following specific requirements are proposed:
1. The computer can correctly point correctly.
2. For each of the words, if it is a general word, the computer can find the explanation and attributes of the word in the machine dictionary. If it is a polysemous word, it is determined by context or to determine one of them with the user. If it is a special word such as a human name, it is understood by the user's dialogue.
3. Divide the textbooks that represent the entities, by predicate, the virtual words can clear the relationship between the various words.
Computers and people are in understanding the differences in natural language
Because natural Chinese uses a lot of express "prompt mode", it is necessary to understand the articles written by Chinese, at least
l To make your computer understand natural Chinese, the computer needs at least the social life common sense and natural scientific knowledge of primary school graduates.
l With the ability to use the above common sense to convert the "prompt" to "descriptor".
l If the "prompt method" uses the environment where the dialogue is at the time, because the computer does not have similar human sensory and related information processing capabilities, such statements are unaffected by computers. Even in the novel, it is extremely difficult to think that the computer is imagined to imagine the environment at the time.
At present, the level of computers will not meet this requirement; at least from the perspective of the economy, do not do this. Therefore, it is impossible to request the semantics of the "Prompt Sentence" in principle. So, in order to enable the computer to understand the natural Chinese, a question that is first to solve is: How to transform our daily "prompt" form into a more complete "descriptor" form. At present, this can only take advantage of people when writing articles, and uses software to make it back to the ingredients that are omitted through the human machine.
If someone wrote an example in which a computer is ready to understand, the computer is written, and the computer is completing the "word" work and gets the author's recognition.
l If you find that there is a "things", "defeat", "defeat", "defeat", "underground", ..., and the like, it is necessary to clarify the exact righteousness of the author's mind.
l If you find that "predicate" appear in the position of "texture" in the subject, the object "is found, the author is required to rewrite.
For example: put "go is right." Change to "Decision" is right. "
l If you find the "texture series" phenomenon, you can't find "predicate" that explains the relationship between them in the statement, you will require the author to fill in "predicate" that can specifically explain the relationship between them. For example: change the "Recallment of Lu Yun" to "Memories about Lu Yun".
in conclusion:
l "Pursuit of simplicity" and "pursuit of detailed" contradictions are the driving force for driving language development. The tradition of Chinese is emphasized that the language understanding process should rely more context and the common knowledge of writing parties and audiences.
l In Chinese, distinguish "Description Sentence" and "Prompt" and treat it in different attitudes.
l The Chinese grammar can only cover the construction rules of "description sentence".
l Only can only require computer understanding of Chinese to express Chinese. In order to transform our daily "prompt" form "to become a more complete" descriptor "form. At present, only the opportunity to write a writing article, and use the software through the human-machine dialogue to supplement the original omitted ingredients. .
Of course, "omitting" and "complete" are opposite concepts. A brief and detailed contradiction can be said to be the driving force for driving language development. Where is a compromise, you have to look at the environment and intelligence and knowledge level.