Said compiler from Lex & Yacc (5. Practical Javacc)
Foreword
The purpose of this series is to let everyone write their own compilers, interpretations, or script engines. After the theory introduces to a degree, I will discuss practical problems. Theoretical, compilation principles textbooks are already enough More, and practical problems are very small.
The first few articles only discussed the word methodology and LL grammar analysis. The key LR grammar analysis is not yet, we don't want to manage complex LR literacy and algorithms, let us use the LL algorithm to actually do something. This article An LL algorithm analysis tool JAVACC is widely used in Java. (This is the Syntax Analyzer Construction Tool I can find using the LL algorithm). This section is not only for Java developers, if you are C / C developers, then please take a look at the excellent tools under this Java, maybe you have used it in the future.
The two tools of Lex and Yacc are classic lexical analysis and grammatical analysis tools, but they are all based on the tool below C language, and friends who use Java can not be used. But Java already has Lex and Yacc Alternative Javacc (Java Compiler). At the same time, JavaCC is also a tool using the LL algorithm, and we can also practice the previous LL algorithm.
First, declare that I am not a Java expert, I have just arrived in java.java may have a lot of tools like Javacc, but as far as I know, Javacc is also the most extensive, the word French grammar under the most standard Java.
Javacc acquisition
Like Lex and Yacc, Javacc is also a common tool for free, it can download the website in many Java-related tools. Of course, the disk space occupied by JavaCC is more than LEX and YACC, there are standard documents. And Examples. Relative to Lex and Yacc, Javacc do more human, easier. If you can't find Javacc, you can contact me, I have this. Now the latest is the Javacc 3.2 version.
JAVACC principle
JavaCC can complete the work of textual analysis and syntax analysis at the same time, which is quite convenient. Similarly, it is the same as Lex and Yacc, first enter a file according to its specified format, then Javacc generates according to the file you entered The lexical analysis is analyzed in grammar analysis. At the same time, the new version of Javacc provides JJTree and other tools to help us build a syntax tree in addition to routine lexical analysis and grammatical analysis. In short, Javacc is doing more than LEX and Yacc Humanization, this can also be embodied in the back input file format.
Javacc input file
JavaCC's input file format is relatively simple. Each non-ending generating function corresponds to a function in a Class, which can be embedded in the function (also called action). This is called Yacc. It is consistent.
JAVACC's input file, a series of system parameters, such as where the LookAhead can be set to be more than 1, then, it can generate the LL (K) algorithm for us (k> = 1), not a simple recursive drop Such an LL (1) algorithm. To know, LL (2) grammar determines that each non-finalization is required to see two marks in front of each non-finalization, such as a constitutional limit. There is less. However, the algorithm of LL (2) is of course slower than the LL (1) algorithm. As a general computer programming language, the LL (1) algorithm is already enough. Even if it is not an LL (1) algorithm, We can also turn it into a LL (1) grammathematically via the previous levy. However, since Javacc makes the Lookahead selection, then in some specific circumstances, we can directly adjust one. The parameters of Lookahead can do not have to correct our grammar.
Let's take a look at the example in the javacc.
Example 5.1
This example can be seen in Javacc-3.2 / Doc / Examples / SimpleExamples / Simple1.jj
Parser_Begin (Simple1)
Public class simple1 {
Public static void main (string args []) THROWS PARSEEXCEPTION {
Simple1 Parser = New Simple1 (System.in);
PARSER.INPUT ();
}
}
Parser_END (Simple1) Void Input ():
{}
{
Matchedbraces () ("/ n" | "/ r") *
}
Void matchedbraces ():
{}
{
"{" [Matchedbraces ()] "}"
}
After setting the bin directory of JAVACC, enter the command prompt.
Javacc Simple1.jj
Then Javacc will generate the following Java source code files for you.
SIMPLE1.JAVA
Simple1TokenManager.java
Simple1Constants.java
Simplecharstream.java
Token.java
Tokenmgrerror.java
Where SIMPLE1 is the object of your grammar analyzer, its constructor parameters are the input stream to be analyzed, here is System.in.
Class Simple1 is defined in tag Parser_Begin (Simple1)
PaSER_END (SIMPLE1) between.
However, it must be clear that the names in Parser_Begin and Parser_end must be the name of the lexical analyzer (here is Simple1).
Parser_end The following definition is the definition of a non-end symbol of the text method.
SIMPLE1's textual law is:
INPUT -> Matchedbraces ("/ n" | "/ r") *
Matchedbraces -> "{" Matchbraces "}"
From its definition we can see that each non-end symbol is for a process.
For example, the process of Input
Void Input ():
{}
{
Matchedbraces () ("/ n" | "/ r") *
}
Remember to remember when defined Void Input, you need to add a colon ":", then the definition of two block {}.
The code in the first {} is the code that defines the data, the initial test data. The part in the second {} is to really define the generating type of INPUT.
Each generating type is connected to "|" symbol.
Note: The generating formula here is not required to strictly bnf paradigm. Its grammar can be both BNF, and it can also be a definition method in which a regular expression is mixed. For example
INPUT -> Matchedbraces ("/ n" | "/ r") *
("/ N" | "/ r") * is a regular expression, indicating 0 to infinite repetition marks of / N or / R.
In addition to
Each non-end symbol (INPUT and MATCHEDBRACES) will form a member function of Class Simple1 in Simple1.java generated by JavaCC. When you call Simple1 Input, the syntax analyzer will start grammar analysis.
Example 5.2
In the example provided by JavaCC, there is no example in Examples provided by Javacc, and other examples are too large. Below I construct a javacc's grammatist identifier with our most common mathematical four mixed operations. This example is written by I write it yourself. It also includes a code of grammar identification to build syntax tree-trees in the simultaneous embedded building. However, due to the reason, I didn't give all the code, here I only gave Javacc. Enter the section related code. And Parse-Tree is a normal 4 fork tree, 3 Child, 1 next (parallel node), I believe that everyone should have learned when learning the data structure. So here is omitted Over the past.
Before you look at these input code, I will give it a definition that it uses, so that everyone has a clear framework.
Expression -> Term {addop term} addop -> " " | "-" term -> factor {mulop factor} mulop -> "*" | "/" factor -> id | NUM | " The literacy here may differ from the BNF paradigm. {} Means 0 times to unlimited repetition, it is the same as the "*" symbol when learning the regular expression, so when the grammat in JavaCC, The {...} part is expressed in (...) *.
In order to make the lexical analysis more easily, we usually use "(", ")" and other word symbols to represent the end symbol, but use the integer symbols such as LParen, Rparen. To represent.
Parser_Begin (grammar)
Public class grammar imports nodetype {
Public Parsetreenode GetParsetree (InputStream in) THROWS PARSEEXCEPTION
{
Grammar parser = new grammar (in);
Return Parser.Expression ();
}
}
Parser_END (Grammar)
SKIP:
{
"|" / t "|" / n "|" / r "
}
TOKEN:
{
|
|
|
|
|
|
|
}
ParsetReenode Expression ():
{
Parsetreenode Parsetree = NULL;
Parsetreenode Node;
}
{
(node = simple_expression ()
{
IF (Parsetree == Null)
Parsetree = node;
Else
{
Parsetreenode T;
T = parseetree;
While (T.Next! = NULL)
T = T.NEXT;
T.NEXT = Node;
}
}
) *
{Return Parsetree;
}
Parsetreenode Simple_Expression ():
{
Parsetreenode node;
Parsetreenode T;
INT OP;
}
{
Node = term () {}
(
Op = addop () t = term ()
{
Parsetreenode Newode = new parseetreenode ();
NewNode.NodeType = OP;
Newnode.child [0] = node; newnode.child [1] = t;
Switch (OP)
{
Case plusop:
NewNode.Name = "Operator: ";
Break;
Case minusop:
NewNode.Name = "Operator: -";
Break;
}
Node = newNode;
}
) *
{return node;}
}
Int addop (): {}
{
|
}
Parsetreenode Term ():
{
Parsetreenode node;
Parsetreenode T;
INT OP;
}
{
Node = factor () {}
(
Op = mulop () t = factor ()
{
Parsetreenode Newode = new parseetreenode ();
NewNode.NodeType = OP;
Newnode.child [0] = node;
NewNode.child [1] = T;
Switch (OP)
{
Case TiMERSOP:
NewNode.name = "operator: *";
Break;
Case Overop:
NewNode.Name = "Operator: /";
Break;
}
Node = newNode;
}
) *
{
Return node;
}
}
Int mulop (): {}
{
|
}
Parsetreenode Factor ():
{
Parsetreenode node;
Token T;
}
{
T =
{
Node = new parsetreenode ();
Node.NodeType = IDSTMT;
Node.name = T.Image;
Return node;
}
|
T =
{
Node = new parsetreenode ();
Node.NodeType = Numstmt;
Node.name = T.Image;
Node.Value = Integer.Parseint (T.Image);
Return node;
}
|
{
Return node;
}
}
The definition in SKIP is to log in the symbol. TOKEN, the words that need to be identified when the lexical analysis is performed, and the lexical marker that is identified. Of course, all this is expressed in a regular expression.
This example has multiple non-end symbols, it can be seen that we need to write a process for each non-end symbol. Different non-final symbols can be adjusted to each other during identification.
Taking the Simple_Expression () process as an example, its generating style is expression -> term {addop term}, and in the Javacc's input file format is, its identification is the name = term () {} (OP = AddOp) T = TERM () {...}) * Asaway, "*" symbols and regular expressions are the same, that is, 0 times to unlimited repetition. So Simple_Expression is equal to Teer Method Term Addop Term Add Term Add Term ... And addop is equivalent to Plus and Minus two arithmetic symbols. Here we are writing the expression of Expression, and also use assignment expressions, because this and YACC different, Javacc identifies the textual method to identify the function process. In the middle, then if we want to identify the textual method of simple_expression, it is equivalent to identifying TERM and AddOP, and identifying the grammat, which is equivalent to calling the two non-ending identification functions. It is this, I think I think JavaCC's text-gramile identification processing is close to the process of operation, we do not need to use strictly literate formats, complex system parameters as Yacc. About YACC, it is more complicated than Javacc, but also need to consider and lexic The problem of analyzer interface, I will tell you later.
As for other grammar operations, I will no longer say more. If you want to say, you will write this article to write this article. This article can only give readers a direction, as for in-depth research, please see everyone Official documentation provided by Javacc.
At last
Since the programmer using Java to do projects abroad is more than the domestic, the person who discusses Java technology is more. People who can read my article here are C / C programmers, but pay attention to other fields in the same direction. Let our knowledge are broader. About JavaCC discussions are mainly in International News Group Comp.Compilers.Tools.javacc If you encounter something when you use Javacc to do actual problems, you may wish to find an expert.
2003-11-16
Author: tangl_99
QQ: 8664220
MSN: TANGL_99@hotmail.com
Email: tangl_99@sohu.com
Chengdu, Sichuan University, Computer College