Specify the compiler from Lex & Yacc (4. Text Law Recognition (1))
Author: tangl_99
QQ: 8664220
MSN: TANGL_99@hotmail.com
Email: tangl_99@sohu.com
I didn't expect this series of files to get 9CBS and everyone's optimistic. First of all, thank you for appreciation and 9CBS recommendation. Then I don't have reason to don't write a few articles below. Original my plan is simple After the introduction of Lex and YACC, I discussed the technical details of the construction of the compiler. However, after recently, after reading some foreign classic textbooks, the identification problem of the literary law was not overlooked in compilation principle and technology. Even if there is now The YACC tools to help me to identify grammar, but sometimes we need to write a simple grammar analyzer.
1. What is literacy recognition (grammatical analysis)
First of all, tell everyone that the literacy here is the BNF, which is not related to the context, which is the BNF of the previous section we have been discussing.
For example, I wrote a sentence.
IF (A> 6 5) Printf ("OK!"); Else Printf ("NO!");
Then it matches the grammar of it.
IF-Stmt -> IF expr STMT
| IF expr Stmt else Stmt
We usually write a lot of BNF-style grammar for a program language. How do you know which texture is matched, this is the grammatical analyzer (or the work you want to do). I know that it is the sentence. We can make the correct explanation for this sentence, so text law identify the work that cannot be ignored. Let me see the algorithm of the literal identification of our usual use.
2. Auto-top-down algorithm (LL algorithm)
The auto-down grammar analysis algorithm is very simple. The top downward algorithm is also the LL algorithm .ll (k) is a top-down algorithm forward to the k symbol. However, whether we are compilation of our domestic compilation The tutorial is still a foreign classic tutorial to discuss the LL (1) algorithm. Because the general programming language, only the LL (1) algorithm is already enough. We are also just discussing the LL (1) algorithm.
There is a special algorithm called recursive decline, in the C language, since the function itself can be recursive, so that this algorithm only needs to write a simple recursive process of several functions.
Why is it called the top down? Because in the analysis process, we gradually analyze from the tree of the grammar tree, so it is called the top downward algorithm.
In order to facilitate the simplicity of the top-down algorithm, let's take a look at one example in << Compilers Principles, Techniques, And Tools >>. (This series often wants to quote the example of foreign classics, I hope everyone should not tell me Plagiarism, I can't find a more classic example than the master's example)
Example 4.1
Consider the grammar of defining variables in a pASCAL.
Special note, Dotdot here ".."
TYPE -> Simple | ID | Array [Simple] of Type
Simple -> Integer | Char | Num Dotdot Num
When constructing an analysis amount for Array [Num Dotdot Num] of Integer, the algorithm is starting from the root node.
Below we will look at the principles of the algorithm in three steps.
First step analysis:
__________________________________________________________________________________________________________
First analysis is the first string "array" of the input string, and it is determined that it belongs to the first-type first collection. So the analysis tree part in the figure, our current analysis is the root node TYPE. (The figure is bidding the arrow It is said that it is part of the current analysis).
Here is a new name: First collection. The compilation course in the university is definitely speaking the first quarter. But I still have to repeat it here.
Noun explanation first collection:
When judging the grammar generation, each generating type is composed of several endors and non-ending. For example
Chinese law in this example
TYPE -> Simple
| ID
| Array [Simple] of Type
Simple -> Integer
| char
| Num dotdot Num
When judging Type's generation, if we judge each of the simpl, id, array, [, simple, simple, sIMPLE, SIMPLE, SIMPLE, SIMPLE, SIMPLE, SIMPLE, SIMPLE, SIMPLE, ID, OF, TYPE, " Test and error "problem. When a grammatic generated analysis is finally analyzed, it will inevitably generate a problem, it is necessary to return to the head, from the new beginning to gradually judgment analysis. We know, The backtracking algorithm is definitely very low efficiency. But in fact, we can avoid such a retrospective algorithm, but completing the same work, the same work, which produces the theory of the calculation of the first collection and the back of the road. Question. Simply, the First is a collection of the most open string (end symbol) of a non-ending. For example.
First (Simple) = {Integer, Char, Num}
First (Type) = first (simple) u {id, array}
There is a Simple non-end of the Type here, then the starting string of the Simple can also be Simple, so first (simple) is also part of the first (Type).
Why do we only calculate the top end of each non-ending? Because we are the LL (1) algorithm that we consider, the LL (1) algorithm only predicts a word symbol before, so we only consider a first collection. It is determined which texture is generated.
It sounds some unlikely here, there is so millions of generated, if you only look at the first non-end symbol, if you can explain which generation of an input string is it? If there are two generated types What should I do, such as what is like an IF statement, but in fact, the literacy of almost all program languages can be analyzed by LL (1). The reason is that we can make the most through the left hand The same generated public end symbol of the beginning is extracted, leaving two sub-production, and their most started non-final symbols are different.
Left Check Cause:
Example 4.2
Considering the grammar
A-> ab
| AC
Here, the two generated end symbols in the two generations of A are 'a', so they cannot be judged from this most started end symbol to which one is generated. So we can modify the grammap
A-> aa '
A '-> b | c
In this way, a text method becomes two, but no matter A or A ', their first-type first-collection is not intended. So, can they only determine which generation is only by the top end symbol .
This change process is a bit ideal about Ab AC = A (B C) in our algebra, so it is called it on the truth.
This is just an example of a simple leg, it will also encounter some complex problems. However, no matter which compile textbook, it will give an algorithm for the legacy truth of all grammar. Similarly, calculate first The algorithm of the collection is also explained in detail in the textbook. I will not describe it here.
Second step analysis:
__________________________________________________________________________________________________________
After the first step, the first string in the input string is "Array" belongs to the third generation of the non-final symbol Type, then this string can be determined to be a third-generating string of TYPE grammar. So in the second step, the third generation of Type is expanded. TYPE -> array [simple] of integer
Then, then, it is to continue to analyze each of the nodes in the type -> array [simple] of Integer generated.
So the arrow put it again in the analysis of the first child of Type in the tree. Because Array is the end symbol, if it is the same as the end symbols referred to in the current arrow in the input, then the arrow moves down to move down to '[' Symbol. Similarly, since the '[' is the end symbol in the tree, then just see if the string in the input is '[' is OK. If yes, then continue to analyze. Analyze the number of analytics When Simple, because Simple is a non-end symbol, then it is necessary to consider Simple's generating style of SIMPLE.
The third step analysis:
__________________________________________________________________________________________________________
In the second step, when analyzing the Simple sub-node in the analytical number, since SIMPLE is a non-end symbol, then it is necessary to consider Simple's generating. Simple has three generations. The current string of the input string is "" Num ", is a third-generated first-generating first-generated first-generating first, so Simple is expanded in the third generated Simple-> Num Dotdot Num in the analysis. So the analysis arrow is also automatically moved to Simple. The first sub-node NUM continues to analyze. Overall, the above-down analysis principle is basically the above process. By calculating the generated first collection, the generated non-finals are generated. The final analysis tree will be scratched to the end of the end, and the non-end symbol is unable to directly judge, it must be expanded later).
Look at the principle, let's see the pseudo code realized. The code is simple.
_____________________________________________________________
Void Match (char token)
{
if Lookahead == Token)
LOOKAHEAD = TOKEN;
Else
Error (0);
}
void type ()
{
IF (Lookahead == Integer || LOOKEAHEAD == CHAR || Lookahead == Num)
SIMPLE ();
Else IF (Lookahead == ID)
Match (ID);
Else if (lookahead == array)
{
Match (array); match (')'); Simple (); match (')'); match (of); type ();
}
Else
Error (0);
}
Void Simple ()
{
IF (Lookahead == Integar) Match (Integer);
Else IF (Lookahead == Char) Match (char);
Else IF (Lookahead == Num)
{
Match (NUM); Match (Dotdot); Match (NUM);
}
Else
Error (0);
}
_____________________________________________________________
Note: The code here is a pure syntax analysis code. There is nothing in the actual implementation process, but the code we construct the syntax tree-Tree is inlaid in these pure syntax analysis code.
2003-10-23
Chengdu, Sichuan University