Zhou Xing: xlzhou0421@vip.sina.com
Opening white
A:
In the information about natural language processing, you can often see the statement that "formalize the natural language", for example, once I saw it in a piece of information:
The information in the natural language text is mainly facing people, the content (semantic) has no formal representation, so the computer is difficult to handle.
Or:
A prerequisite for processing information content is the formation of information content. To establish a strict formal theory, provide a solid theoretical basis for content-based intelligent information processing.
This kind of saying and approach often makes me feel some doubts, what is "the form of content"? If the natural language (content) can really be "formal", can it be called a natural language? What do you mean by "formal"? I really hope that someone gives a "formal definition" of "formalization". Have you seen this definition?
Formal process
B
I also think that "formal" is difficult to grasp? Talking about "formal", it is "precise", "unambiguous", "formal", "formal" ... these concepts; but these cannot be "formal" definitions. I also want to find "formalization" definition of "formal" term, but did not find it so far. In the domestic published computer encyclopedia, the entry of "formalization" is also found.
The Chinese in Chinese seems to have no "formal". The term "formal" is probably translated from English formalize because Formalize is translated into "formal" in accordance with English formalize. If you first think of Formal (not FORM) when you first translate, Fomalize may be translated into "normalization", "standardized" or similar name.
A:
"Formal" is not "as long as the form, don't you content"?
B
"Formal" is indeed misleading, making people think that "as long as the form, do not content". English terms "formalize" have nothing. However, I think it will be able to understand this approach to: strive to make everything we care about therefor.
A:
Why "Everything we care about has corresponding forms"?
B
Our most common way to study objective things is "establish model".
The "formation" of a thing is to abstract the things the study, remove the details of the "Non-critical", and reserve a model for its research purposes, establish a model for it, and get the precision of the model. Complete "formal representation". If the model established "Simplifier", its behavior will be basically the same as the original research object, but we can use mathematical tools because of proper simplification. This is the "modeling" process that everyone is often said.
In other words, although the investigation or description of objective things can be expanded in the breadth or depth, we can only income in the "model", we can only income those things that make sense to our research targets, in order to The established model is valid, and the content we care is of course there must be a corresponding form in the model.
A:
For "natural language processing", do you also study through "modeling"?
B
I think it is. Among the experts of "natural language processing" work I know, there are many "science and engineering bodies". It is naturally, they will transfer the most common and effective research methods in engineering technology to this. A field.
However, "natural language" research objects and astronomical celestial body, equipment and network in electronics, animals and plants in biology ... These objects are somewhat different, because the latter is completely independent of our subjective world, And "natural language" has a close relationship with our subjective world, two points: 1) Various ingredients in natural language: words, phrases, sections, sentences, sentence groups, paragraphs, chapters ... ... only some of the "symbol" used by the brain, is not like "Earth, Oxygen, Insects ..." is a stable and unchanging attribute, independent of the substance entity other than our subjective world. Therefore, the idea of natural science research has become difficult to study the various components in the natural language will be plagued. Whether the "word" in the natural language has the debate that is determined or even the only unchanged "word". For example, in "Chemistry", we can determine what element is in the periodic table according to the ability of a certain element with other elements, and the compound produced to determine what element is in the periodic table. If this idea is used in syntax theory, the word class is used in "The combination of words and words" (that is, the 'grammar function' of Mr. Zhu Dexi is to divide the word class, which is suspicious. Because, yesterday, you can't think of "Youth", "China", "Environmental Protection" ... put it behind the adverb "very", today because of "popular", you also follow everyone say: " She wears this dress very youth "," My face is very good "," this kind of food is very environmentally friendly ", ...... So they can also be adjective. However, the "oxygen" element in the chemical field is an "oxygen" element, which is never today is "oxygen" today, and tomorrow can be "hydrogen" or "chlorine". 】
2) The production and understanding of the statement in the natural language is inseparable from the processing of the brain. So when "Modeling", it should be included in the human brain. However, the classic language theory represented by Chomsky is completely related to this link. Therefore, although these theories have made great contributions to the "computer programming language for computer processing, it is impossible to force in front of the natural language.
From the perspective of the parallelism, I think that language theoretical experts have not been understood and valued by this difference. However, the situation seems to have begun to change.
A:
What is the main manifestation in the traditional language theory? "Is there any process in the human brain?"
B
I think it is mainly manifested in the association, reasoning (judgment) and selection process without simulating the brain.
When dealing with the language, the biggest difference between the brain and the computer is that the computer is "dead brains", and the person takes place is "live" attitude. The two most important aspects of "live" are: "Supplement" and "Correction" on the information received by the attached statement.
l In a statement of natural language, there are often many "components" to be omitted. These "components" do not appear in the form, however, actually exists. People will be naturally supplemented by Lenovo by the common sense of life when describing or interpreting these statements.
l In the natural language statement, there are often many inconsistentant methods, where they are uncommon, and the listener will naturally correct the common sense of life when interpreting these statements.
A:
Can you give one or two examples.
B
For example, once, when I took a taxi through a street sky bridge through the Third Ring Road, I saw the following words on the bridge:
100c.c. above
I don't know what the words mean in the first day. However, combined with the local context, mobilize the knowledge about the modern urban traffic rules, and immediately understand what it means:
"This lane is only available for vehicle capacity of 100C.c."
Obviously, these supplemental information I have added will only be seen from the literal of "100C.c.c.". A:
This example is too extreme, and the number of such statements encountered in the general conversation and article is probably very small?
B
So, do you look like this, is it very ordinary, very common?
I was in the store that day, I saw a pot of flowers, beautiful, but the price was very high, I was afraid to be awkward.
Anyone listened to the "Lenovo, Reasoning (Judgment) and Choice", will be understood as:
That day (I) in the store, (I) saw a pot of flowers, (this basin flower) is beautiful, but (this pot of flowers) is very high, (if I put it), I am afraid (I), I'm going to be .
Can a computer that do not have a common sense and reasoning, can the computers that determine the ability to make an rationale?
A:
You an example invert me to think of the "Peach Blossom Source" read by Middle School, which has such a sentence:
Seeing the fisherman, it is a big shock, asked, the answer, you must still be home, ...
The words between the two comma are also omitted the subject of questioning answers, but through the specific content of speaking and action, you can completely distinguish from being there and act there:
(Village people) see the fisherman, it is a big shock, ask (fisherman), (fisherman), (village), (invite fisherman) to (their own) home, ...
It seems that such a brief expression is a tradition of Chinese.
However, these examples are "information supplements", then about "information correction"?
B
"Recovery fatigue" and "Fire" are two classic examples. In the actual life, "information correction" is required is still constant. In the first two days, I heard someone in the contemporary TV show: "I think this kind of food is very environmentally friendly!" "" Environmental Protection "in the name is very unspeakable, but we It will correctly understand it as "I think this kind of food is very conforming to environmental ideas (principles)!"
Once I was also in TV, I said that Changping's volunteer traffic assistant said to the reporter's camera: "Our Changping is still very peak on weekends." Although this statement is awkward, I can still be "geography The peak of the upper mountain "Contact" The peak of the traffic flow curve "is associated with the peak on the curve means" clogging of traffic ", which correctly understands the obligatory traffic assistant.
A:
These examples of you are very "drilling". Most of the language in the actual life is probably relatively flat, do not "miss the arm and legs", do not need to turn the corner to understand!
B
Do you say that these examples are very "drill"? It is probably that they are just some special cases. In fact, many general statements have similar "" nature, only low extent, not so obvious. Why did you go through decades? The correct rate of machine translation is always? Some people say that the right rate is only 30%. If this data is acceptable, I am afraid that the 70% sentence has a "".
A:
For this "" example of you, it seems to have only two countermeasures:
1) Improve the intelligence level of the computer (association, reasoning (judgment) and selection process), also enters into the language processing model. 2) Anything that intends to make computer processing, it is forbidden to use this type of "" sentence.
B
In recent years, there have been some teams in accordance with the first way you said (improve computer intelligence, let computer imitate people's association, reasoning, assumption, verification), and achieve remarkable progress.
A:
They are afraid that they will encounter a big difficulty?
B
Of course, the difficulties only have a team that walks this road, and it is clear. From the perspective of the bystanders, if you want to make a computer to be loses, reasoning (judgment) and choice as the human brain, you have to put all the background knowledge into your computer and try to make your computer with the ability of these knowledge. However, the knowledge accumulated in the natural language and human millennium has been closely related, so knowledge is not only a nozzle, but it is still explosive, which makes the developers of such systems face a bottomless Black hole, I don't know when I can go. Therefore, for these scientific research teams in the first line, we are particularly concerned about how they determine their "technical boundaries" and division of work.
A:
So, what is the second way?
B
The second way is more realistic. The problem it involves: how to write a comparison "formal" text, which is to try to make the statement "diamond" in the documentation, and even there is no.
Formalization of text
A:
Yes, you have just discussed the "formation of the process" and the problem of "modeling". "The form of text" or "formal formation of the statement" seems to be another "formal" problem.
It is generally believed that the computer program written in a computer programming language is a formal file; and the article generally written in natural language is not a formal file. Where is this distinction? How should the so-called "formalization of text"?
B
For "formation of text", an ordinary English-English dictionary is explained. For example, the explanation of the term "formalize" in "Longman Dictionary of Contemporary English is:" To Put (An Agreement, Plan, etc.) INTO CLIAR WRITTEN FORM. "
A:
This is some extent, "said the idea", there are some inspiration. However, this explanation is from the perspective of daily social life; the term "formalize" in the field of computer software inherits the original spirit of the word, and when using this word in this field, it will be added to the computer software. Color, so it should also be further elucidated in conjunction with the computer. Before we find the definition of authority, we don't discuss this, see if you can bring a little idea from the perspective of ordinary people (non-expert) point of view, to give the idea's finger. Thus, the term "formalized" should be included in the "formal" term.
B
First of all, "write down", all information should "black words on white paper". In addition to avoiding the "mouth to say no, autumn," there is an important reason, what is said in the mouth, often subject to the surrounding context, including the interference of the people's facial expression. When the speaker said: "You can really!" When you have a sincere or a clear expression, it means completely different. After writing a written language, you can get rid of the speech, this interference can be greatly reduced.
Second, we must "clear make it clear". How can I be clear? This is a problem that is not easy to say. A:
Yes, the same expression is "clear", it is often "not clear" for the computer; "said it is clear" for the adult, and it is often "not clear"; To be clear, "it is often" not clear "for the average person. A correct C language program is "clear" for the computer equipped with the 'C compiler', "said nothing to clear" without equipped with the 'C compiler'.
B
Therefore, the first essential condition of "It is clear" is "the content given by the information" can be combined with the other's existing knowledge ", which can be" understood "for the other party.
A:
What you mean is: If you can satisfy this condition, that is, the other's knowledge is enough to understand what you want to describe, you can also take the two statements of "formalization" and "non-formal". So, where is their difference?
B
I think it is necessary to use the conventional mental labor (according to the information processing) as a judgment process.
Two kinds of "mental labor"
A:
What is "General mental labor (observation information processing)"?
B
I want to explain by an example. Please examine the difference between the following two mathematical topics. The first question is a common four-way operation; the second question is the topic given by the "Hua (Luo Geng) number (learning) class" in Beijing elementary school.
1) 35 × 24 =?
2) Li Ming and Wangbo simultaneously calculate the product of two integers. Li Ming watched the number of bit digits, and the result was 255; Wang Bo missed the ten digits of the multiplier, and the results were 365; Q: Real multiplier and passed What is the number of numbers?
The first question can be resolved through "obstacle-related mental work". This kind of labor is:
l Algorithm can follow
l The information being processed is complete, that is, the phenomenon that does not occur if the information needed
l And the source of information is determined
l Handling the exact "result" (answer)
Most of your computer now works. In the society, people who "book" books to do things as a chapter are in this labor.
In contrast to the first question, the second question needs to be solved by "smart mental labor (creative mental work)", and its characteristics are:
l There is no determination algorithm (at least ordinary people don't know what to determine)
l The information is incomplete, sometimes only "tips" bit information. The missing information requires handlers to find information from the surrounding environment to add information from the surrounding environment. In other words, some sources of information are uncertain
l Treatment does not necessarily have an exact "result" (answer), there can be a variety of "result" (answers), or from "the most likely (optimal)" result "(answer)
"Guess" in daily life, "the owner of portrait comics", etc.
In the natural language, sentences like "Zhang San Plaza Li Si" are formerly as long as they are understood from the literal, which belongs to the information that can be handled by conventional mental labor. However, we often encounter some "incomplete sentences", need readers "turn a brain", fill the missing information according to the "context". For example, "100C.c." mentioned above. A:
Follow you with this statement, "Say clear (formal)" documents can be treated with conventional mental labor; or in turn, any document that can be treated with conventional mental labor is formally formulated. Can you let's talk about how to write a file to handle it with regular brain labor?
B
The writer of the document should be determined:
First of all, "The reader will only understand the content of the file from the literal meaning of the file", that is, both parties will write and read the document content in accordance with the commonly recognized dictionary and grammar rules. Therefore, the discourse should be complete. If the statement has the "component default" phenomenon, the reader will never add itself. [Compiler is the procedure to treat programmers with this attitude]
Second. Since the syntax of the natural language is not carefully designed like a computer programming language, there will be "ambiguity", classic examples, "Table tennis auction." It can be understood as "table tennis auction." It can also be understood as "Table tennis auction." If the file writer is aware of this, you should try to add a tag (for example, 'blank' or 'quotation marks' as a split mark), or to change one saying, for example, "Table tennis is auctioned." "Table tennis is sold out. "
Third, the written file can reference the text information of other places, but must be specified in a clear way, where to get this information. The SVC, API, and C language in the computer program abide by this principle. Conversely, the text on the traffic indication on the street often requires the reader to supplement this principle from the local transportation environment, the government promulgated traffic rules.
A:
If the file writer wrote a ambiguous sentence and it is not realized, what should I do?
B
This is indeed a problem. Sometimes "ambiguity" is very hidden. This joke below is caused by the ambiguity of the word "tooth" in the first thing that does not have ambiguous "tooth".
Xiaoming's uncle: "Xiao Ming, now your teeth are hurting?"
Xiao Ming: "My teeth have been left in the hospital, I don't know that it is still hurting now."
Seriously, Xiao Ming used the "teeth", and Xiaoming's uncle used "tooth" to refer to "total bed" in the oral cavity, so it produced ambiguity.
However, "formalization" is not "all or all" one attribute, but can have an extent differential. What I said above is just some principles, and these principles can improve the "formal" level of the document written. Of course, it is also conceivable to develop a computer software to objectively verify "ambiguity" that may occur in a file.
Finally, I still want to emphasize: "Formal or No" is not just "text" itself attribute, a "text" is "formal", except for this "text", and you want to What kind of processing is made.
"Formal or not" is also related to what kind of processing you want to do
A:
why? Generally, we said that a certain narrative is "formal", or is not "formalized", never involves "processing",
B
I just want to emphasize: "Formal or No" is not just "text" itself attribute, a "text" is "formalized", except for this "text", what is going to do with you? The processing is related.
For example, "worm is a virus" describes that "natural language understanding" (for example, as a new knowledge is placed in a new knowledge), it is not "formal". Because "viruses" can be understood as "biological virus", it is also understood as "computer virus", there is an amplitude. Before you have not clarified this ambiguity, you don't know that you should put it in that area in the knowledge base. Therefore, "worm is a virus" describes the "knowledge update" task, not a formal text. However, for Chinese translation machine translation, the situation is different. "Worm is a virus" This description can be considered a formal text. Because what is the meaning of "virus". It can be translated into "Virus" in a unique way. For machine translation, the process works here; the following is not available.
As for some simpler processing tasks (such as "word frequency statistics), the requirements for the formation of text are lower.
Why do people prefer to use an informal communication method in daily life?
A:
I have a problem now, since "formal" narrative method is more accurate, it is not easy to misunderstand, so why people prefer to use an informal communication method in daily life?
B
First, "decline in formalization requirements" greatly reduced the burden on the writing.
In turn, "decline in formalization" has of course increased the burden of the reader, because the reader wants to "solve" through senior mental activities. On the other hand, since the decline in the number of text symbols in the text is reduced, this low-level mental labor burden of the sentence is understood during the reading process. More importantly, formal text is usually greater, so that important central words are overwhelmed in a large number, not so importantly other words. Two-phase comparison, the total burden increased in the human brain with higher wisdom, good at the match, reasoning and pattern, and even alleviated. [At this point, the computer and people are just in contrast, the computer will never be troubles and trivial, but it will do it if it will go to Lenovo and reasoning. 】
Another point, the "short-term memory" of the human brain is limited. Excessive sentences produced after highly formal make people can't grasp. In turn, there is no problem in memory in memory. That's why .html files provide a "browse" format that is given to the "Browse" format and the length of the "code" format to be seen.
In other words, "formal text" is prepared for contemporary computer "wood head but foot-fast brain power slave"; "non-formal text" is prepared for "smart creatures" in artificial representatives.
Of course, there is a "emission" problem in this, even for people, the more it is, the better, the better. It will cause misunderstandings.
summary
A:
I discussed here, is it a summary? See what opinions we have formed.
B
l Although the files treated by contemporary computers are "formalized files written in form", "non-formal files" is not handled. In the process, advanced "brain labor", such as simulation, Lenovo, Search, Assuming, Assumption, and Test, and the level of computer technology are still very poor in this regard, and the effects are often unsatisfactory.
l Therefore, "whether or not to reason, Lenovo, Search, assume, assume and verify the degree of" formal degree of formation "of the intervention, and the degree of intervention, and the degree of formation of the intervention.
l This division is of course related to "what is processed"
l The text that "form" degree is degraded by deletion, distortion of information, so that the specific gravity of "formal" component is lowered.
l In turn, "Non-Formal Text" - An article written in natural language - can increase the specific gravity of "formal" ingredients by adding, supplementary information. Follow the XML format to rewrite the article written in natural language, so that it is a typical approach. l Therefore, we now solve NLP problems in a two-head approach: a rich computer knowledge base, improve computer knowledge; another head is manually marked, or by means of (amphibious language) tools directly Computer processing has a relatively high degree of text. These two developments will meet where to meet in the tunnel, and they are waiting for future observations, let us wait and see.
A:
Finally, I still want to mention it. There are two different awareness of the nature of "articles written in natural language".
l One is to believe that "non-formal" is the essence of natural language and is a fact that is unable to change;
L An Other View that we are now thinking about the natural language as an informal language, that is because we haven't found enough ingenious language model with sufficient inclusion. Therefore, we should continue to study and invent a variety of increasingly delicate or complex language models, eventually establish a perfect formal system for natural language, to that time, you can want to treat computer programming languages now. Treat a natural language in the same way.
B
The one of these two views may be related to "What is formal" is also "formal" definition.
A:
According to the views established in accordance with us, it is clear that it belongs to the first point of view.
B
Yes it is. However, we are not eligible to overthrow the second point of view. Perhaps the person who holds a second point of view can also propose "formalization" definitions that are adapted to their views. Let us look forward to their elaboration, making our understanding of the next level.