Natural language handling

zhaozj2021-02-16  61

1. Review. 1.1.1. Introduction. 1.1.1. Background, target. 1.1.1.1. Language is the fact that thinking is an important tool for interpersonal communication. Knowledge in the form of language and text in human history, accounting for more than 80% of the total knowledge. As far as the computer is applied, it is based on only 10% of the mathematical calculation. It is used for less than 5% of the process control, and the remaining 85% of the information processing is used for language text. Under such social needs, natural language understanding as a high-level important direction of language information processing technology, has always been one of the core topics of the artificial intelligence. 2. Since the creation and use of natural languages ​​are human high intelligent performance, the research on natural language is also helps to uncover the mystery of human intelligence, deepen our understanding of language skills and thinking. .1.1.1.2. What is computational linguistics refers to such a discipline, which analyzes, processing natural languages ​​by establishing a formal mathematical model, and implements a natural language and implemented in computer. And the process of processing, thereby achieving the purpose of simulating the human and even the language of the machine. Computational Linguistics (Computational Linguistics), sometimes called quantitative linguistics (Quantitative Linguistics), mathematical linguistics (Mathematical Linguistics), natural language understanding (Natural Language Understanding), NLP (Natural Language Processing), human language technology (Human Language TECHNOLOGY). . 1.1.1.3. Tulex test In the field of artificial intelligence, or language information processing, the language information processing is generally believed to use the famous 1950-descriptive Test to judge whether the computer "understands" has a natural language. . .1.1.1.3.1. T 图 模 游戏) l: Male Trial, Female Try, Observer, 3 in 3 different rooms, the room number is X, Y, OL rules: Observer Communicate with the electrically dozen typewriters and the subjects, male quilt deceived observers, female subjects to help observers. l Objective: The observer should judge the gender of the trial in the X room.

.1.1.1.3.2. Turing test L Scene: Trial, computer, observer 3 in 3 different rooms, the room number is X, Y, OL rules: observers "Some Methods "and the testist and computer communication computer deceived the observer, the test person helps the observer L target: the observer should judge the subject in that room. 1.1.1.3.3. Full TURING TEST L Scenario: Trial object (people or computer), observer, observer can see the subject L rules: Observer can communicate with the subject to the subject: The observer is to judge that the subject is a person or a computer. 1.1.1.3.4. Reference 1. AM TURING, Computing Machinery and Intelligence, http://cogsci.ucsd.edu/~asaygin/tt/ttest.html connection http://www.oxy.edu/departments/cog-sci/courses/1998/cs101/ Texts / computing-machinery.html 2. Cao Shugen, "AI History and Problem", the Chinese Academy of Sciences calculates 3. Roland Hausser, Springer, 1999.1.1.2. Research history. 1.1.2.1. In the 1950s, NLP began in the United States early in the 1950s. At that time, the United States was afraid to defeat in the space competition. Technical literature, so that developing machine translation systems, especially Russian-English machine translation systems, the practice is to use the words to translate. Since the cost is high, the efficiency is low, and the financial support is gradually annoyed. .1.1.2.2. Natural language understanding of the 1990s, mostly there is no real grammatical analysis, mainly relying on keyword matching techniques to identify the meaning of the input sentence. In these systems, the designer stores a large number of modes containing certain keywords in advance, each mode corresponds to one or more interpretation (also called response). The system matches the current input sentence one by one, and once the match is successfully explained, this sentence is immediately explained, and no longer consider what impact on the meaning of the ingredients that do not belong to keywords. Sirsir (Semantic Information Retrieval) was 1968 B. Raphael is done, this is part of his work in the University of Massachusetts Institute of Technology. The system is programmed in Lisp language. This is a prototype that understands the machine, because it can remember the user through English, and then answers the questions raised by users by interpretation of these facts. SIR has an ability to accept a restricted subset of English, which matches the input sentences with the following types of 24 keyword patterns: * Is ** is part of * is * *? How much * does * have? What is the the the * Of *? When the symbol "*" matches one of the nouns in the input sentence, the noun is allowed to have a modifier such as A, THE, EVERY, EACH, etc., quantifiers or numbers. Whenever matching a mode, the corresponding action will be triggered in the program. STUDENT1968 Database & Sli Duo Duo Research Bobrow completed another mode-matched natural language understanding system Studen Ding. The system can understand and solve the middle school generation. Eliza 1968, J. Weizenbaum designed in the US MIT, perhaps these most famous natural language systems based on mode matching. The system simulates a psychotherapy doctor (machine) Talk to the same patient (user).

TGNOAM Chomsky created the Generative Transformational Grammar. Start using syntax analysis in machine translation. .1.1.2.3. After the 1970s, a batch of natural language understanding systems with syntax-semantic analysis technology stood out, in terms of the depth and difficulty in language analysis, have a great progress than the early system. The representative of this period is Lunar, SHRDLU and Margie System. Lunarlunar is the first person who allows the use of ordinary English with computer database dialogue, which is W. BBN, US BBN, 1972. Woods is responsible for design. The system is used to assist geologists to find, compare and evaluate the chemical analysis of the moon rocks and soil specimens brought back by Apollo -11 spacecraft. The SHRDLU SHRDLU system was designed in 1972, which is his Ph.D. research in the US MIT. SHRDLU is a natural language understanding system for English conversations in "Building Block". The system simulates a robotic arm capable of manipulating some toys on the table, and users use people - machine dialogue to command robots to knead the block blocks, the system gives an answer and displays the corresponding scenarios of the site. This system is to explain that it is possible to make computer understanding of language; MEANING Analysis, response generation, and lnference on eng1ish is R. Schank and its students have established a system in the artificial intelligence laboratory of Stanford University, USA, in order to provide a sense of intuitive model of natural language understanding. .1.1.2.4. The biggest feature of the natural language understanding system in the 1980s in the 1980s is practical and engineering. Its important logo is a batch of commercial natural language people ---- machine interface and machine translation system appeared in the international market. The famous English people - machine interface system produced by American Artificial Intelligence Co., Ltd. Intellect, Frei Company produced by Frey - the ASK interface developed by the US California Institute; the European Community is Based on the Motor Translation System, the University of Georgetown, the University of Georgetown, successfully conducted a machine translation of the English, Mr., Germany, West, Italian, and Portuguese. System TAUM-Mete0, Japan Fujitsu company developed Atlas English Japan, Japan Hitachi, Japan Hitachi, Japan, Japan, Japan, Japanese Translation System, etc. During the "Seventh Five-Year" period, the "translation star" developed by China Software Corporation is also an example of this. Corpus Linguistics "Corpus Linguistics" is a new branch discipline of a computational linguistic in the 1980s. It studies the collection, storage, retrieval, and statistics of machine readable natural language text. Syntax labeling, syntax semantic analysis, and the use of the above-mentioned function in language quantitative analysis, dictionary compilation, work style analysis, natural language understanding, and machine translation. " Corpus Linguistics began to rise. First, it complies with the needs of large-scale real text processing, and proposes new ideas based on computer-speaking basis and new ideas for natural language processing. This school insists that the true source of linguistics is a large-scale living corpus. The task of calculating the linguist workers is to automatically obtain various knowledge required to understand language from the large-scale spending, they must objectively The ground is not subjectively described in the language facts of the inventory. .1.1.2.5. In August 1990, at the 13th International Computational Linguistic Conference held in Helsinki, the organizer of the General Assembly put forward the strategic goal of handling large-scale real text, and organized before the meeting. " The role of large-scale corpus in the construction of natural language systems "," Dictionary knowledge and representation "and" electronic dictionary ", which preacted a new historical stage of language information processing.

.1.1.2.6. 21.1.2.7. 21st century. 1.1.2.8. Reference 1) Shijiyi, Huang Changning, Wang Jiaqin, "Artificial Intelligence Principles", Tsinghua University Press 2) Chris Manning and Hinrich Schutze, Foundations Of Statistical Natural Language Processing, http://www-nlp.stanford.edu/fsnlp/3) Weighing, "Based on Corpus and Contemporary Natural Language Processing Technology", http: // www. Icl.pku.edu.cn/research/papers/chinese/collection-2/zqlw6.htm .1.1.3. Research content. 1.1.3.1. From the calculated perspective to study the nature of the language so-called language from the calculated perspective The nature is to present the understanding of the structure of the language with accurate, formalized, calculated manner, rather than in the statement of the language of the language, as in other linguistics. Expression form. .1.1.3.2. As a corresponding algorithm for calculation objects, the language is used to study the corresponding algorithm to study the language as a computational object. It is a process of studying how to handle the language object (mainly natural language). Objects, of course, may also be a formal language object), including a language disconnect (such as phrase, sentence or chapter) identification, the structure and meaning of the language disconnection (natural language understanding), and how to generate a language Disclosure to express the meaning of determination (natural language generation), and so on. 1.1.4. Different levels of language analysis. 1.1.4.1. Language-constituenation level. 1.1.4.1.1. Vocabulary .1.1.4.1.2. Phrase. .1.4.1.3. Sentence. 1.1.4.1.4. Paragraph. 1.1.4.1.5. Chapter .1.1.4.2. Relationship between the rhyme word and its pronunciation based on the language characteristics. .1.1.4.2.2. How to form a word, such as Friend-LY. .1.1.4.2.3. Syntax. 1.1.4.2.4. Semantic .1.1.4.2.5. Pragmatic .1.1.5. Application arena. 1.1.5.1. Machine Translation and Machine Translation .1.1. 5.2. Speech Recognition .1.1.5.3. Speech Synthesis .1.1.5.4. Text Classification .1.5.5. Information Retrieval .1.1.5.6. Information Extraction (Information Extraction) and Automatic Summarizing .1.1.5.7. Human-Machine Interface .1.1.5.8. Story Comprehension and Q & A System

.1.1.6. Related disciplines. 1.1.6.1. Cross. 1.1.6.2. Philosophy of a word and a sentence make sense, how to specify the object in the world. What is the belief, goal, and meaning, what is the relationship with the language. Through the intuitive intuition, the natural language is expanded; .1.1.6.3. Mathematics. 1.1.6.3.1. Mathematical logic. 1.1.6.3.2. Figure .1.1.6.3.3. Probability .1.1.6.4. Linguistics The structure of the language, how the word forms a phrase, how the phrase forms a sentence, what is the meaning of a sentence, etc. Tools for research: humans' intuition for appropriate grammar and meaning, and some mathematical tools such as form language theory, model theory semantics. .1.1.6.5. Psychology research the process of human language generation and understanding, how to identify the correct structure of the sentence, when to determine the correct meaning of a word, and how to understand the process. The method of research is to measure experimental techniques for human object implementation, and statistical analysis of observations. .1.1.6.6. Computer science. 1.1.6.6.1. Artificial intelligence. 1..1.6.6.2. Machine study. 1.1.6.6.3. Mode recognition. 1.1.6.7. Information Science .1.6.7.1. Database. 1.1.6.7.2. Data Mining. 1.1.6.7.3. Data Warehouse .1.1.6.7.4. Information Extraction .1.1.6.7.5. Automatic Abstract .1.1.6.7.6. Information Categories .1.1.6.7.7 Information Retrieval. 1.1.6.7.8. Information Filtering. 1.2. Features of English. 1.3. Chinese Features

2. Volume 3. Lessment 4. Syntax .4.1. How to form a phrase, words and phrases to form a correct sentence, and every word role in the institution in the sentence. .4.1.1. Task of syntactic analysis For the analysis of natural language, syntactic analysis has the following two main tasks: 1. Identify sentence of a language and determine the structure of the input sentence given the language method G and the language L of the grammatical description, (1) give a string S, determine if S is to L; (2) give a string S, if S belcomes L, gives the tree structure of the S, 3. Standardization of Syntactic Structure If we can map a large number of possible input structures to a fewer structures, then subsequent processing (e.g., semantic analysis) is simplified. Below is an example of several structural standardization: (1) Some ingredients can be omitted or "zero" in the sentence; (2) Various conversions can link the synthesis of the surface structure, such as active tone and passive tone; 3) Normal word sequence and so-called split structure: That I Like Wine is evident. It is evident this ike Wine. (4) Nouncing structure and Verbic Structure: The Barbarians'Destruction of Rome The Barbarians Destroyed Rome, etc. Such a class of conversion makes a subsequent processing only with a small number of structures. .4.1.2. Different types of syntax analysis. Traditional Non-Probability Analysis Method Probability Method (PCFG) 2. PARTIAL PARSING / SHALLLOW PARSING 3. TOP-DOWN syntax analysis Predicative Parserbottom-UP syntax analysis Shift-reduuce Parser4. Deterministic Parser Analysis of Non-deterministic Parser.4.1.3. Form Syntax Camp 1) TG, GB, MP, ... 2) LFG, GPSG, HPSG, ... 3) PATR-II, DCG , Fug, ... 4) Tree neighborly (TAG) 5) Link Grammar 6) Categorialgrammar 7) Dependency grammar 8) Word syntax (Word grammar) ....4.1.4 Classification of contemporary form grammar theoretical system

.4.1.5. Evolution history of form syntax theory. 4.2. Theory. 4.2.1. Form language and automaton. 4.2.1.1. Basic concept. 4.2.1.1.1. Basic concept. 4.2.1.1.1.1. Letter Table is a non-empty limited set of elements. We refer to the elements in the alphabet to symbols, so the alphabet is also called symbolic sets. .4.2.1.1.1.2. Word (also called string, symbol string) and empty characters (also called empty strings) a poor sequence composed of elements in alphabets. In the symbol string, the order in which the symbol is important. If there is M symbols in a symbol string x, it is called M. Expressed as | x | = m. The sequence of any character is not included, it is ε. | ε | = 0. All words in the alphabet σ are σ *. Σ * is called a symbol string collection on alphabet σ. .4.2.1.1.1.3. Empty set does not contain any elements of any elements, remember to φ. . Also known as the N times power of V) is recorded as VN = VV ... V, V is n specified V0 = {ε}. Let V * = V0 ∪v1∪v2∪v3∪ ... called V * is a V-closure. Remember V = V v *, called V is positive (then) closing package. Obviously, εx = xε = x, x is a symbol string; or {ε} x = x {ε} = x, x is a symbol string collection. .4.2.1.1.2. Regular and formal set below: Regularity and formal set recursive definitions: 1. Ε and φ are all regular forms of σ, which are {ε} and φ; 2. Any a ∈σ, a is a regular shape on the σ, which is {A}; 3. Assume that u and v are correct, and the regular sets they represent are L (U) and L (V), then (u | v), (uv) and (u) * are also Regular formula, the regular sets represent are L (u) ∪L (V), L (U) L (V) (Connection Set) and (L (U)) * (closed). The expression defined only by the following steps is only a normal set of σ on the σ, which is only the formal set of σ only by these regularly. If the formal set represented by two formal u and V is the same, it is considered to be U and V equivalence, and it is written as u = v. .4.2.1.2. Automata. 4.2.1.2.1. State conversion diagram status conversion diagram is a finite direction map. In the state conversion diagram, the node represents the state, represented by a circle. The state is connected with an arrow arc. The mark (symbol or symbol string) on ​​the arc arc represents the input symbol or symbol string that may appear in the state of the knot (ie, arrow arc). A state transition diagram only contains limited state, some of which are referred to as an initial state, and some are referred to as a final (represented by a double circle).

The use of state conversion graphs can construct a lexical and syntax analysis program. However, in order to analyze the automatic generation of the program, it is necessary to form a state transition graph. This creates automatic machine theory. . 4.2.1.2.2. Ε-Closed Packet and A Arc Conversion 1) Status Collection I ε - Closed Pack, is ε-Closure (i), is a state set: a) If s ∈i, s ∈ε-Closure (i); b) If s∈i, the state s' ∈ε-Closure (i) that can be reached from the arc of the arccable epsterer. 2) The A arc conversion of the status set i, is represented as Move (i, a), which is a state set: order j = move (i, a), then J is all those that can be obtained from one of I The whole state of the status arrives at an arc. For status set I and arc A, we define IA = ε-closure (j), where j = move (i, a) is IA is an ε-closure of the A arc converted of the status set I. . 4.2.1.2.3. Determining a finite automation (DFA) one determination finite automation (DFA) m is a five-yuan m = (S, σ, f, s0, z), 1. S is a limited set, and each element is called a state; 2. Σ is a poor alphabet, and each element is called an input character. Therefore, σ is called an input symbol alphabet; 3. f is a portion mapped from S * σ to S (single value). f (s, a) = s 'means: When the current state is S, when the input character is A, it will be converted to the next state S'. We call S 'a subsequent state of S; 4. S0 is an element in S, is the only initial state, also known as the start state. 4. Z is a subset of s and is a final set (blank). The final state is also called acceptable state or end state.

Determining a finite automation (DFA) can represent a (determined) state conversion diagram. .4.2.1.2.4. Undetermination of finite automation (NFA) A non-determination finite automation (NFA) m is a five-yuan m = (S, σ, f, s0, z), 1. S is a limited set, and each element is called a state; 2. Σ is a poor alphabet, and each element is called an input character. Therefore, σ is called an input symbol alphabet; 3. F is a mapping from S * * to S. That is, F: S * σ * à2S4. S0 is a subset of s and is a non-empty primitive set. 5. Z is a subset of s and is a final set (blank).

Non-determination of finite automation (DFA) can be represented as a (non-determined) state conversion diagram.

DFA is a special case of NFA. However, there is a DFA m 'for each NFA m to make L (m) = L (m'). . 4.2.1.2.5. Determining the simplification of a limited self-motivation so-called a determination of limited self-motivation M is simplified: looking for DFA M 'of a state ratio m, making L (m) = L (m) ). We said that a poor self-motivation is simplified, ie, it doesn't have a lot of state and there are no two in its state. A poor self-motivation can be converted into a smallest and equivalent of a poor automation by eliminating excess states and consolidated equivalents. The excess state of the so-called "self-motivation refers to the state of such a state: from the start state of the automaton, any input string cannot be reached. It is assumed that S and T are two different states of DFA m, and we call S and T are equivalent: if you can read a word α from the status S, then, then, from t, from T. After reading the same word α, it is turned on; contrary, if you can read a word α from the status t, then the same word α can be read from the Siya word α and stopped from S. state. If the two states S and T of the DFA M, these two states are distinguished. We introduce a method called the "Segmentation Law" to divide a status of a DFA M (excluded) into some non-intersecting subsets, so that any different two subset is different, and the same child Any two status of the concentration is equivalent. Steps for diving the status set S of the DFA M separately separate S the terminal and non-tetheral separation, divided into two subsets, forming a substantially differentiation π. 2) Assume that π has a m subset of π, π = {i (1), i (2), ..., i (m)}, and the state belonging to different subsets is different. Then check that each I in π can be further diverted. For a certain I (i), let I (i) = {S1, S2, ..., SK}, if there is an input character A such that IA (i) is not included in a subset I (J) of the current π The I (i) is divided into two: i (i1) and i (i2) such that the state in the status and i (i2) in i (i1) is different, so that new points are formed. Π. 3) Repeat 2) until the number of subsets contained in π is no longer growing, get the final scratch π, for each subset in this π, we choose one state in the subset representing other status, which is obtained DFA M 'and the original DFA m are equivalent. .4.2.1.2.6. The conversion theorem of Nfaàdfa: Sets L for a collection of uncertainty. There is a poor self-motivation that accepts L determined. Subcommination: A algorithm for converting NFA into DFA receiving the same language. The following detailed description: Basic ideas: each state of the DFA corresponds to a set of states of NFA. This DFA uses its state to record all states that may be reached after the NFA reads an input symbol. That is, after reading the input symbol string A1A2 ... AN, the DFA is in such a state, and the state indicates a subset t in the state of the NFA, T is from the start state of the NFA along a certain labeled A1A2. ... The path of the AN can arrive. Algorithm: For an NFA Mn = (SN, σN, FN, S0N, Zn), we constructed an MD = (SD, σd, fd, s0d, zd) in accordance with the following method, so that L (Mn) = L (MD) : 1) The status set SD of the MD is composed of some subsets of the SN (the algorithm of these subsets of Sn will be given later).

We use [SD1, SD2, ..., SDJ] to represent any of the SD, where SD1, SD2, ..., SDJ are SN state. Further, the state SD1, SD2, ..., SDJ are arranged in a certain rule, that is, the state of the SD is {SD1, SD2}; 2) MD and MN input alphabetic table. The same, 即 = =N; 3) The conversion function FD is defined in this: FD ([SD1, SD2, ..., SDJ], A) = ε-closure (Move ([SD1, SD2, ..., SDJ] , A)); 4) S0D = ε-closure (S0N); 5) zd = {[SDP, SDQ, ..., SDR] | [SDP, SDQ, ..., SDR] ∈SD & {SDP, SDQ, ..., SDR} ∩ZN! = Φ}

Algorithm for the subset of state SNs constructed in the NPA Mn is given below. It is assumed that the structured subsets are c, ie c = (T1, T2, ..., Ti), wherein T1, T2, ..., Ti is a subset of state Sn: 1. Start, let ε-Closure (S0N) are unique members in C, and it is not marked; 2. WHILE (a subset of subsets that has not been marked in C) DO {tag T, for each input letter A (a! = ε) Do {u: = ε-closure (Move (t, a)); if u is not C in Tken will be used as a subset of unmarked subsets in C;}}

For example: convert the NFA represented by the following figure into DFA. .4.2.1.3. Text Law. 4.2.1.3.1. Rules also known as the REWRITING RULE, Production Rule, or the generating formula, is like α} or α :: = β (α, β Ordered. Where α is a symbol in the positive closure V of a certain alphabet V, β is a symbol in V *. α is called the left, beta called the rightmost portion of the rule. .4.2.1.3.2. A grammatical g of literacy is defined as a quadritical group (VT, VN, S, R), where VT is a terminal symbol set, is a non-empty limited set; the terminator is a basic symbol of the composition. VN is a set of non-final symbols (or syntax entities, or variables), is a non-empty finite set; non-terminator is used to represent grammatical categories; vt∩vn = φ. S is called the identifier symbol or start symbol. It is a non-end symbol, at least in a rule as the left; R is a collection of generated (also known as rules), each generating type α}, α, β ∈ (vt∪vn) *, and α There must be at least one non-final, and cannot be empty characters; at least one of the genes in R is active as s. Usually vt∪vn, V is called alphabet or went sheet of grammar G. For example: g = (vt = {0, 1}, vn = {s}, s, r = {sà0s1, sà01})

Three effects: 1) Generate: Generate all the sentences in language L; 2) Determine: Whether a string belongs to language L; 3) Analysis: Get the structure of the sentence of the sentence;. 4.2.1.4. Language. 4.2.1.4.1. Direct derived / derivation / can be derived from the grammatical g = (Vt, Vn, S, R), we call αa] directly derive γβ, αaβ ==> αγ β is only R in the R A production type, and α, β ∈ (vt∪vn) *. If α1 ==> α2 ==> ... ==> αn, this sequence is called a derived from α1 to αn. If there is a derived from α1 to αn, α1 can be said to be derived. Use α1 = => αn: From α1, one or several steps can be derived; αn can be derived; α1 = * => αn is represented by α1: from α1, it can be derived from 0 step or several steps. 4.2.1.4.2. The left deduction / right-derived derived derivation: Any one-step α ==> β is replaced with the least left non-ending in α. Right derivation: Any one-step α ==> β is replaced with the right non-ending in α. In form languages, the rightmost derivation is often referred to as a standard derivation. The sentence pattern derived from the specification is called a standard sentence. .4.2.1.4.3. Sentence / sentence / language For grammatal g = (VT, Vn, S, R), if s = * => α, α is called a sentence pattern. The typographical type containing only a good sentence is a sentence. The whole sentence of the sentence G is a language that will be recorded as L (g). L (g) = {α | s = => α & α∈vt *} For grams g1, g2, if L (G1) = L (G2), the literal methods G1 and G2 are equivalent. . Sentence, this language is recursive. Recursively Enumerable Language If you can write-part programs, make it possible to output (ie enumerate) one language in some order, saying that this language is recursive. . .4.2.1.5. The formal language Jumsky (Chomsky) established a form language in 1956. Jumsky divided the grammar into four types, namely, type 1, 2, and 3. The difference in these types of grammar is to apply different limits to the generating type. For gelegial g = (Vt, Vn, S, R), 0) If each of the generated α} is satisfied: α∈ (vt∪vn) * and contains at least one non-final, and β∈ (vt∪vn ), Then g is a 0 type text method (PSG). 0 type text method is also known as phrase structure grammars. A very important theoretical result is that the ability of the 0 type literary law is equivalent to Turning. Alternatively, any 0 language is recursive and can be enumerated; in turn, recursive enumeration set must be a 0 type. But some languages ​​are not recursive.

1) Set G is a 0 type text method, if each of the generated α} is satisfied | α | <= | β |, only Sà ε is except for Sà ε, but S must not appear in the right part of the generated, then the gramatory G is a 1 type Or the context-related literary law (CSG). One equivalent definition: set G is a 0-type text method, if each of the G is αaβ ==> αγβ, a∈vn, and γ is not ε, α, β, γ ∈ (vt∪vn) * If the text method G is a 1-type text method or a context. This definition shows that only A appears in the context of α and β, γ is allowed to replace A. 2) Set G is a 0-type text method, if each of the G is a à beta, a∈vn, β∈ (vt∪vn) *, the grammatial g is a 2-type text method or context-free (CFG), also known as BNF paradigm (Backus-Naur Form or Backus Normal Form). This definition shows that the replacement of non-finals can not consider context. The context does not have a copy of the self-motivation corresponding to the non-determination. 3) Set G is a 0 type text method, if each of the G is AàαB or A, A, B α∈vt *, a, b∈vn, the grammatial method G is a 3-type text method or a formal grammar (RG) or right linear grammar. 3 models or formal grammar (RG) Another definition is: set G is a 0-type text method, if G's G is Aà Bα or Aàα, α∈vt *, a, b∈vn, then the literacy G is 3 types or formal grammar (RG) or left-line grammar. Obviously, an NFA can be designed for any 3-type text method G, which can only recognize the language of G. The definition of the four grammar is gradually increasing, so each formal literary law is unrelated to the context, and each context is related to the context, and each context-related literary law is a 0 type. The language generated by the 0 type text method is 0 type. The context-related grammar, the language of the context-free level method and the formal grammap generation is called context, respectively, and the context-independent language is formal.

Difficulty in determination of various types of grammatics: 1) PSG: Semi-semi-determination for a sentence L belonging to GTYPE0, can always determine "Yes" in the determination; but for a sentence L 'that does not belong to GTYPE0, there is no algorithm, You can determine "NO" within the determination step. 2) CSG: can be determined, complexity: NP is complete. 3) CFG: can be determined, complexity: polynomial. 4) RG: can be determined, complexity: linear. . 4.2.1.6. Equivalence and equivalence of normal and finite automators and the equivalence of the self-motivation are described below: 1. For NFA m on σ, a normal formar R routing on σ can be constructed such that L (r) = L (m); 2. For each of the normal raride rs on σ, an NFA m on a σ can be constructed such that L (m) = L (r).

Proof: 1) The corresponding regular R constructed of NFA m on σ. We broaden the concept of the status conversion graph to make each arc can be marked with a regular formal. In the first step, two knots are added to the state transition diagram of M, one is an X node, one is Y node. Use the ε arc to all of the initial nodes of M, and connect from the total end of the M full to the Y node with an ε arc. Form a M ', M' only one initial X and a final Y. In the second step, you will gradually eliminate all nodes in M ​​'until there are only x and y nodes. During the dispensing process, the arc is gradually marked. The rules of the decline are as follows: The label in the last X and Y nodes is a regular R. 2) A NFA M on any of the normal R constructs from σ. The method of l (m) = l (r) is made. a) Represents the normal Reaver Reated conversion diagram: R or when R = φ is:

b) Step by transforming this figure by splitting and adding a knot to R, each arc is marked as σ, a character or ε. Its conversion rules are the degree of decline in 1). For example: R = (A | B) * ABB constructs NFA N such that L (n) = L (r). . 4.2.1.7. Formal grammar is equivalent to formal, formal language is equivalent to formal set. 4.2.1.7.1. Formal grammar is equivalent to the formal formal grammar, there is a formal form of definition of the same language: Conversely, there is a formal grammap that generates the same language for each formal shape. Certificate: 1) A regular conversion of a formal conversion to grammar g = (Vt, Vn, S, R). Let the VT = σ, determine the elements of the generated and VN, as follows: a) Select a non-terminal Sàr to generate a generating Sàr for any formal formula R, and set S to the identification symbol of g. B) If x and y are regular, the generation of Aàxy is generated, rewritten into: A-XB, B yY two generation, where B is the newly selected non-end, ie B∈VN. C) Written as a àx * y in the conversion-converted grammatism, is written as a àxb aày bàxb bày, where B is a new non-end. D) For the generation of Aàx | Y, rewriting is: aàx, aày constantly uses the above rules to change until each generation is up to one end. 2) Convert formal grammatism into a regular basis. Basically, the reverse process of the above process. Finally, only one start symbol definition is left, and the right portion of the generating type does not contain a non-end. The formal grammar to the formal conversion rule is listed in the table: the formal formal regular rule 1 rule 2 rules 3 aàxb, bàyaàxa | yaàx, aày a = xya = x * Ya = x | Y,: 1) R = a (a | d) * Convert to the corresponding formal grammar; 2) will grammar g = (vt =}, r = {s, a}, s, r = {sàaa, sàa, aàaa, aàda, aàa , Aàd}) Converts to the corresponding formal grammar; .4.2.1.7.2. Regular language is equivalent to formal set. 4.2.1.8. Sormal grammar and finite automatic conversion. 4.2.1.8.1. Qualal grammar to limited automatic The conversion of the machine is directly constructed from formal grammatics G, so that L (m) = l (g): a) alphabet is the same; b) for each non-terminator in G Generate a state of M, (may disable the same name). The start symbol S of G is the start state S. c) Increase a new state Z, as the final state of NFA; D) a conversion of M-shaped in the form of a g of a in g, and the end of the endon or ε, a and b is a non-finalizer) Function f (a, t) = b; the generation of G is in the form of Aàt. Constructs a conversion function f (a, t) = z.

.4.2.1.8.2. Conversion of finite automation to formal grammar directly constructs a formal text method G from a poor self-motivation NFA m, so that L (g) = L (m): a) The alphabet of the poor self-motivation is The end symbol set of the grammar; b) The initial state of the poor self-motivation correspondence; C) The conversion rules of the poor automatic motivation are very simple: conversion functions f (a, t) = B, can write a generating style: A-TB adds a generating type: zà ε.4.2.1.9. The main task of lexical analysis is to scan the input string from left to with a character string, generate a word sequence for grammar analysis. Regularity is used to illustrate the structure of the word is very simple and convenient. Then, a formal compiled (or call) is an NFA to convert to the corresponding DFA, which is an identifier that identifies the sentence of the language represented by the formality. .4.2.1.10. The syntax analysis of the context of unrelated grammar context has sufficient ability to describe the grammatical structure of today's programming language. Currently, the context of context is unlicensed as a description tool for programming language syntax. .4.2.1.10.1. Syntax Tree Law Tree also known as the derived tree, it is an intuitive method for describing the sentence pattern derivation of the context. For the context without any sentence type type type G = (VT, Vn, S, R) can construct the grammar tree associated with it, this tree meets the following four conditions: 1. Each node has a tag that is a symbol of (vt∪vn) *; 2. The tag of the root is S; 3. If a node n has at least one of its own descendants, and there is marker A, then A is definitely in VN; 4. If the direct descendants of n (marked as A), from left to right order is nodes N1, N2, N3, ..., NK, which are marked as A1, A2, A3, ..., AK, then Aàa1, A2, A3, ..., AK must be a generating type in R. Example: For grammatism g = ({s, A}, {a, b}, s, r), where R is (1) sàaas (2) aàsba (3) aàsss (4) sàa (5) aàba construct sentence Aabbaa Grammar tree. .4.2.1.10.2. Erlining of the grammar If a sentence has a sentence corresponding to two different grammar trees, it is said that this grammat is secondary. Or, if there is a sentence in a literary method, it has two different left (or right) derivation, and this grammat is secondary. Theorem: The two-meaning problem is not determined. That is, there is no algorithm that can be defined in a limited step in a limited step. .4.2.1.10.3. Syntax analysis method From left to right analysis: ie, always identify input symbol strings from left to right, first identify the leftmost symbol in the symbol string, and then identify one symbol on the right. From the right direction to the left: ie, the input symbol string is always identified from right to left, first identify the rightmost symbol in the symbol string, and identify one symbol on the left. Since the top, the analysis method is also known as the top-down analysis method, and the target-oriented analysis method. Departure from the beginning of the grammat, repeatedly use a variety of generated, looking for "matching" in the input symbol string. The top-down analysis can be divided into two types of identified and uncertain, and the determined analytical method needs to be limited to the literary law, but due to the simple, intuitive, intuitive, intuitive, easy manual construction or automatic generation of syntax, thus It is still one of the currently common methods. Uncertain methods, that is, the backtracking analysis method, this method is actually an exhaustive test method, so the efficiency is low, the cost is high, and thus is minimal. Self-analysis: starting from the input symbol string. Gradually "detriment" until the beginning of the cultural law

The way from grammar tree can understand the difference between these two types. Since the uppermost method is starting from the grammatical symbol, it is the root of the syntax tree, and gradually establishes a grammatical tree down, so that the end node symbol string of the grammar tree is just an input symbol string; from bottom to bottom The input symbol string begins, with the end node symbol string of the syntax tree, constructed the syntax tree up to bottom. The following discussed is from left to right analysis. . 4.2.1.10.4. Since the top and the analysis method. 4.2.1.10.4.1. Problem In the top and down analysis method, it is assumed that the least left non-end symbol to be converted is V, and there is N rules: Vàα1 | α2 | ... | αn how to determine which right to replace V? There is a solution to randomly select one from a variety of possible choices, and hope it is correct. If it is wrong later, we must return back, try another choice, this is called back. Obviously this cost is extremely high, the efficiency is very low, which is the problem we need to solve. .4.2.1.10.4.2. Recursive drop analysis. 4.2.1.10.5. Self-analysis. 4.2.1.10.5.1. Basic concept. 4.2.1.10.5.1.1. Phrase / direct phrase / handle g It is a text method, and S is the beginning of the grammar, and αβδ is a sentence pattern of cultural law G. If there is: s = * => αaδ and A = => β is referred to as a phrase αβδ relative to the non-finals A. In particular, if A ==> β is called β, it is a direct phrase (also known as simple phrase) relative to rule a à beta. The leftmost direct phrase of a sentence pattern is called the handle of the sentence. .4.2.1.10.5.1.2. Normative regulations. 4.2.1.10.5.2. Problem In the bottom-up analysis method, every step of the analysis program works from the current string, will Arrived to a non-end symbol, we temporarily refer to this substring as "can be detrimental." The problem is how each step is how to determine this "can be contracted", that is, how to choose a string in each step, so that it can be contracted, not unable to destination. . 4.2.1.10.5.3. The operator priority analysis method. 4.2.1.10.5.4. LR analyzer. 4.2.1.11. Reference 1) Compilation of "Programming Language Compilation", National Defense Industry Press 2) Principles, Tsinghua University computer series textbook. 4.2.2. State transfer network. 4.2.2.2.1. Finite state transfer network (FSTN) A finite state transition network (Finite State Transition Network) by a set of status (ie nodes) and one An arc (used to put a state connection to another state): 1) One of the states is designated as the starting state; 2) Labeling the syntax in each arc (including the word Or words, etc.). It indicates that such a word must be found in the input sentence, and the transfer specified by this arc can be performed; 3) There is a subset of the end state in the state set. If the header of the input sentence (or phrase) starts from the starting state, after a series of transfer, the end of the sentence just reaches the end state, say this sentence (or phrase) is accepted (or identified) by this transfer network. Description: The finite state transfer network can only be used to generate or identify the formal (ie 3) language. For example: The following figure identifies the "Dong Yong likes the seven fairy" FSTN. .4.2.2.2. State transition table (State Transition Table) Status Transfer Arc N V Q0 Q1 Q1 Q1 Q2 Q2 Q1

.4.2.2.3. Recursive Transfer Network (RTN) Recursive Transition Networks, referred to as RTN) is an extension of a limited state transition network, and the label of each arc in the RTN can not only be a finalist (word or Word class, etc.), and can be a non-finalizer used to indicate another network name. Description: (1) Any subscriber in RTN can call any other network included in the network. (2) From the generation capability, the recursive transfer network is equivalent to the context. For example: .4.2.2.3.1. RTN algorithm (TOP-DOWN). 4.2.2.2.3.1.1. Algorithm Description ---- Basic Concept Subnet Name: S, NP, VP and other status nodes: Q0, Q1, Q2; Outside: From the current state, the arc is transferred to the next state; W1W2W3 ... (the idea is Dong Yong thinks); recording stack: record from which subnet, and return to the upper subnet State; current status: .4.2.2.3.1.2. Algorithm 1. Initialization: The current state is , the string pointer points to the first character of the string to be analyzed, and the stack is empty. 2. 2.1. If the current status node is not a termination state: Pointing the current state node out of the pointer to the first out, 2.1.1. If the currently outline is labeled as a terminal, compare the tag and the current string move pointer finger 19.1.1.1. If equally, the prediction is verified, the subtree is constructed, and the current state node is set to the subsequent state node of the currently outbound, and the current outline; 2.1.1.2 is modified. If you don't wait, the front edge pointer points to the next one. If there is no next outbound and there is a retrospect point, it will be traced, otherwise the analysis failed. 2..1.2. If the currently outline is marked as the non-final, the subsequent state of the current subnet name and the current node, and set the current status node to the start state of the current subnet, and modify the current outbound and subsequent state. 2.1.3 If there is multiple options now, you need to set the recovery breakpoint, save the rendering stack, to analyze the string, the current ATN state, and the full content of the outbound list status to use. 2.2. If the current state node is the termination state but not the termination state of the subnet S: The rendered stack is retired, and the current state is set to the state of the current stack top, and the transfer step 2 continues. 2.3. If the current state node is the termination state and is the termination state of the subnet S: 2.3.1. If the reboot is empty and the string to be analyzed is empty, the analysis is successful, ended; 2.3.2. Otherwise, if there is a retrospect point, backtrack. 2.3.3. If there is no retrospect point, the analysis failed.

.4.2.2.3.1.3. Example analysis "I am a county magistrate": .4.2.2.4. Augment Transition Network, ATN) Expansion Transition Networks, referred to as ATN) This formal system It is 1970 W. Woods is proposed and successfully applied to his famous Lunar system. .4.2.2.4.1. Basic idea ATN syntax belongs to an enhanced context without syntax, its basic idea is to continue to use the context-free syntax to describe the composition structure of the sentence; but add some functions to individual generation in the grammar Mainly describe some necessary grammar restrictions and the deep structure of the sentence. .4.2.2.4.2. ATN's expansion ATN of RTN is extended and enhanced on RTN in the following three: (1) Add a set of registers to store intermediate results obtained during the analysis (such as local) Syntax tree) and related information (as number of noun phrases, semantic characteristics, etc. of some components). But setting which registers are fully dependent on the needs of syntax analysis, and there is no hard regulation; (2) In addition to the scope of the syntactic category (such as the words and phrase tags), any test (TEST) can only This arc can be passed after this test is successful; (3) Actions can also be attached to each arc. When passing a arc, the corresponding action is sequentially executed, these actions are mainly used. To set or modify the contents of the register.

. 4.2.2.4.3. ATN form system The following is an ATN form system defined with BNF: :: = ( *) :: = *) :: = (CAT * (to )) | (TST <1abel> * (to ) | (Push * * (TO )) | (POP

* ) | (JUMP *) : = (Sendr ) | (Liftr ) | (AddR ) :: = (getr ) | (buildq