One. Release. Why is the spirit? If you are familiar with C , then you may know a Parser library called "Spirit". It utilizes C template programming capabilities, using the C language itself provides a framework for recursive falling literacy.
The JPARSec library I introduced here is a recursive falling literary analysis framework in Java. However, it is not a spirit's Java version. JPARSec's Blue Penn is from the Haskell language Parsec library. Parsec is a Monad-based Parser combination sub-library.
The purpose of this library is to provide a library similar to Parsec, Spirit in Java, which is not a patented patent, Java / C # can also be done. This library will also be rewritten on Java 5.0, and it will not be inferior to C . So why is it called "function"? Java is an object-oriented. If you have used the language such as Haskell, Lisp, this function is not to explain that you also know what is going on. If you are an old C / Java programmer, then you have to explain it. Of course, if you are not interested in the nouns of these false bangs, then you can skip this chapter, don't know what "function", does not affect your understanding of this library.
C With the popularity of GP, "function" this old hole, the "function", gradually, has been taken out from the corner. "Function" familiar with a C programmer is probably the STL's for_each, transform, count_if these functions. How to say, just like I can't negate str.length () This call is OO, I can't say for_each, Transform is not a function. However, the essence of "function" is not here. Generally summarized, just as we say what is polymorphic, package, inheritance, "function" characteristics are summarized: 1. No side effects. 2. High-order function. 3. Delay calculation
The most meaningful (at least I think this) is based on a function combination capability of high-order functions. Some people call this Glue. Shortly, what makes functional programming so powerful? It is the ability to combine complex functions with a simple function.
I can imagine, say this, you are still a fog. "What is a combination? 1 1 is not 2 combination of two 1? New A (new b (), new c ()) is not from B and C into A?"
For intuitive, let's give an example. Suppose we have an interface inside Package Predicates: Interface SPredicate {Boolean IS (String S);} We have several basic implementations: class isempty imports spredicate {public boolean is (string s) {return sLength () = = 0;}} This implementation judgment string is empty.
Class Iscaptialized Implements SPredicate {...} This implementation determines that this string is not a capital head.
Class Islowercase Implements SPredicate {...} This implementation is not a full lowercase.
Class ISEQUAL IMPLEMENTS SPREDICATE {Private Final String V; PUBLIC Boolean IS (String S) {Return S.Equals (V);} ISEQUAL (String V) {THIS.V = V;}} This implementation determines whether this string is The string is equal.
Similar basic achievements can also have a lot. Below, if we want to implement a spredicate, it is necessary to judge "This string is a lowercase string, or equal to Hello. What should we do? We can of course this: class predicate1 imports spredicate {boolean is (string v) {return v.islowercase () || v.equals ("hello");}} Only, so we didn't use ISEQUAL and ISLOWERCASE this Two categories of code, although we are logically overlap with these two classes.
Of course, we can also call ISEQUAL and ISLOWERCASE code, such as Class Predicate1 Implements SPredicate {Boolean IS (String V) {Return New ISEQUAL (). IS (V) || New Islowercase (). IS (v);}} However, this code is a process-based, very dead. If I have an ISEQUAL or ISCAPITALIZED logic? Also write a PREDICATE2 class again? If you have a certain skill, you can see that this code does not meet the IOC principle, where there is a place where there is.
Ok, let's make a mistake, according to the principle of IOC, we refacture to the following:
Class orpredicate imports spredicate {private final spredicate p1; private final spredicate p2; public boolean IS (String s) {RETURN P1.IS (s) || p2. (s);}} I will not write. As this, Predicate1 we can write to new orpredicate (New ISLowercase (), New ISEqual ("Hello")); Similar, we can add andPredicate, Notpredicate, Xorpredicate. This allows all Boolean operations.
When we write our own Predicate, we don't have to write the IS function at all, and even forget the existence of the IS function. We are no longer an interface with a Boolean IS (String) sign, but a type that can be combined by various rules.
A Predicate can be simply like new notpredicate (p), or complicated as: new andpredicate (New Orpredicate (A, New Xorpredicate (B, C), D));
Wipe your eyes, now we are equal to your own type of type that can be combined with some specific rules, and SPredicate's signature is no longer important. Our client program becomes an operational string into operation, which is already a higher level of abstraction.
In order to express this, let's change spredicate into the Abstract Class and change the IS () function into a package private (so we can't see this function again).
and many more! You may find out, now we can freely combine different spredicate objects, but what is the use after the combination? Is the IS function visible, is we combined with a combination? Nice, a complete combination, it is still a short bit.
Let us add an UTILITY function inside the Predicates package: public boolean Runpredicate (spredicate p, string s) {return p.is (s);} Ok, merit is satisfactory, we can use this Runpredicate function to perform a combination SPREDICATE object without cares about the IS function inside this object. You may have a little doubt. What is the difference between Runpredicate (P, S) and P.IS (s)?
Oh, there is no difference now. Let's take a look at when this package has a significant benefit. Suppose it is as simple as it is not seen, it may be: boolean is (string s, predicatecontext ctxt); PredicateContext object is responsible for storing and delivering some information. At this point, we may not want to announce this signature. Because this signature is very likely to change, it is a package of a package. PredicateContext is even a private type.
At this point, it is necessary to hide the IS function. In addition, we only open a Runpredicate tool function:
Public Boolean Runpredicate (String S, SPredicate P) {Final PredicateContext CTXT = New PredicateContext (); Return P.is (S, CTXT);}
Ok, now the customer program can combine each Predicate object at will, and finally run with the Runpredicate function. When the package is evolved, it is fully able to change the signature of the IS function at any time as needed to increase the new state.
This is an example of a complete combination.
There is a metaphor, based on the process-based programming method (such as a method of Fortran), just like building a pyramid, you have to work from the foundation, step by step, one step, one of which can't be fails. Once the pyramid is repaired, it has been standing there, the wind turns unchanged. Based on the programming method of function, it is like an organism, each small function is a small organism, a cell, and a different cells are combined to form a larger organism. Such reciprocation until it is evolved into a complicated powerful organism. The evolution of organisms will never stop. This process is equivalent to a ribbon system. A geometrical geometry of the European miles, then the five simple axioms, but gradually, to anger, theorem, and re-theorem, and finally established a grand geometric building. Both process and function types are all solved from the bottom up, and they must start from the smallest unit. The difference is that the functionality provides a flexible combination ability between different units.
Object-oriented method, is a self-turning analysis problem method. It is no longer how to solve the problem, but how this problem decomposes a sub-problem, and each child is responsible for a more "human", "political" methodology. The operation of human society or company may be more description of object-oriented ideas. President, CEO may not understand the program, will not look at the door, sweep, but this does not prevent them to dispatch collaboration through the division of labor, so that the entire company or social operation is there.
Ok, I said here, I also spit it. You may still be scratched, or you have already shackled anger: What is the seven-eight bad gods, I see you like a rivers and lake liar!
Ok, I understand is of course good, I don't know if I don't understand. Our purposes are not to sell some seemingly rather than metaphors. As long as the concrete example, how to use the Parsec library to do Parser, then the actual purpose is reached, who cares about it? II. Parser's background knowledge first introduces a production style. A syntax can be represented by a series of generated formulas (which is BNF or EBNF), for example, a number can be represented as: number :: = ([0-9]) (is one or more 0 and 9 character of)
A variable name in a C / Java can be represented as: alphaum :: = [_A-ZA-Z] ([0-9_A-ZA-Z]) * (that is, the first character is underscore or letters, followed by 0 or Multiple underscores, letters or numbers)
One four operation operators can be represented as: OP :: = ' ' | '-' | '*' | '/'
And the four arithmetic expressions can be represented by a series of generated: Term :: = Number | '(' expr ')' signed :: = (' ' | '-')? SignedMuldiv :: = MULDIV ('*' | '/') SignedExpr :: = EXPR (' ' | '-') MULDIV
If you have learned the principles of compilation, you know that Parser is divided into top-down, Bottom-Up. Bottom-Up Parser has LR, LALR, etc., mainly analyzing production, and then establish a table according to the results of the analysis, one state machine. Parser is read when you run, find the next action in the table. Bottom-Up method is high, and there is no left instruction problem. However, such Parser writes cumbersome, readability of code, very maintenanceability, and it is very difficult to debug.
Top-down is in turn, starting from each generating, and recursively determining which generation of the currently read character satisfies. This Parser is called recursive decline.
One problem with the recursive decline is that reading a character may match more than one generating style, at this time, this secondary resolution. More deadly problems are left instructions. For example, the four arithmetic expressions of the above, EXPR is a left-handed generating style. If Parser is currently trying Parse Expr, then for Parse Expr, it must first Parse it produces the first node on the right side, and this node is still an expr, so Parser is falling into endless recursive.
The benefits of recursive decline are: the code is easy to write, easy to understand, and it is easy to maintain.
However, in real work, people rarely write bottom-up Parser (unless it is a very high-demanding industrial intensity Parser product). Often everyone will choose to use a Parser Generator, such as Yacc, Javacc, Antlr, etc.
Using these Parser Generator, you only need to write your EBNF directly according to the syntax provided by this tool, and then these tools generate the Parser code according to the grammar you provide. (This code can be bottom-up, or may be top-down)
The benefits of using these Parser Generator are: it saves hand-written Parser time, you only need to declare writing the EBNF generated by you, you can help you generate optimized code. This generated code is only a manual Parser, which is only more written than a master-backed Parser. (Note is "carefully" written, my hand-written Parser is definitely the fast) and maintenance without Generator, if there is any change, you only need to modify the syntax file, then regenerate the target code with the generator I.e.
However, Parser Generator also has a little short: 1. The generated code is often not easy to understand. If your syntax is wrong, the syntax file itself can't debug, and tracking the code generated, and reading the book is similar. 2. You need to provide grammar files. It is also necessary to learn the various provisions of this Parser Generator, because the writing of this grammar file is entirely this Parser Generator. This learning curve can not be ignored. 3. There is an additional code generation step, released, and maintenance is slightly cumbersome. 4. Sometimes, you only need to deal with some of the tasks such as simple expressions Evaluate. It is a bit not enough to use Java.util.StringTokenizer, and use the Parser Generator and a little kills the feeling of cow knife. 5. Parser Generator often processes only static syntax. All generated, priorities must be static in grammar files. It is difficult to handle tasks for modifying syntax at runtime. 6. EBNF and target languages (such as Java / C ) cannot be well combined. It is difficult to use code reuse for grammar and semantic actions.
Due to these problems, a technique called "Parser Combinator" is mentioned by people. Combinator is different from generator. It is completely based on the target language and does not generate any code. Therefore, there is no step in which the extra code is generated. Moreover, because Parser is written in the target language, you can use all language features in the target language without being limited to a limited expression function provided by a Parser Generator. Combinator is a Parser technology based on recursive decline. Different from the process-forming method of the traditional handwriting recursive decline, Combinator is a function of a function. A Combinator library often provides a lot of combinations that construct new Parser from existing Parser. For example, EBNF's *, ,? And other symbols can be implemented as a combination of one. The combination provided by these libraries can be much stronger than those of EBNF.
User programs are concerned about a variety of Parser objects based on logical, not recursive calling a PARSE function. This action is done by this Combinator framework.
Representative Parser Combinator libraries have C Spirit libraries with Haskell's Parsec library. Our library will pick up the Parsec library.
These Combinator libraries provide dynamic Parser's ability, breaking through the restrictions on static grammar and manual recursive decline.
We will see a very useful application of this dynamic cultural law (operator priority)
Combinator method: 1, low efficiency. In fact, this is an intrinsic problem in the recursive decline. 2. Left recursive. This is also a problem of recursive decline. 3. Because the target language is often not a language that is specifically designed to express (but a general language), it is not very direct to some EBNF. It is often not as good as read syntax. 4. The transplantability between the language. With Parser Generator, it often translates a grammar file into several different target languages. Parser Combinator is completely dependent on the target language. If you suddenly decide to change the language halfway, then your Parser Combinator's code is white. 5. The Combinator method is based on Higher Order Companies. As a higher abstraction, this kind of functional idea is often not as intuitive as Parser Generator, easy to understand. Ok, introduce the background knowledge of Parser, let us drop the clouds, come and see this JParsec library.
three. JPARSec's use. First, let's see Scanner. JPARSec's Scanner is responsible for reading the character input and identifies whether the input is syntax.
For example, for Number, we have written in front of us, number :: = ([0-9]) , we can write scanner number = scanners.charrange ('0', '9'). Many1 (); here, Scanners.Charrange ('0', '9') is [0-9]. Charrange () generates a Scanner object. This object recognizes a character that must be between 0 and 9. MANY1 () is a combination of " " that " " in front of the previous written. Scanners.charrange ('0', '9'). MANY1 () represents more than 1 number.
In fact, the Scanners class has predefined a corresponding integer scanner: isinteger ().
Look at the number of variables of Java, alphaum :: = [_A-ZA-Z] * [_ a-za-z] can be written: scanners.plus (scanners.ischar ('_ '), Scanners.Charrange (' A ',' Z '), Scanners.Charrange (' A ',' Z ')) Scanners.ischar () is responsible for identifying a character, scanners.ischar (' _ ') is responsible for identifying underscore . Scanners.charrange ('A', 'Z') and Scanners.CharRange ('A', 'Z') are responsible for identifying lowercase characters and uppercase characters. Scanners.plus () represents "or" concept. Scanners.plus (A, B, C) is represented or A, or B, or C. So scanner alpha = scanners.plus (scanners.ischar ('_'), Scanners.Charrange ('a', 'z'), Scanners.Charrange ('A', 'Z')) is represented [_A-ZA- Z].
And [0-9_a-za-z] is one more than one number on ALPHA. Scanner alphaum = scanners.plus (Scanners.Charrange ('0', '9'), alpha); So, Alphaum can be represented as: scanner varname = alpha.seq (alpham.many ()); scanner.seq () function It is a combination of "order". Scanner s = a.SEQ (b) is equivalent to s :: = a b, so alpha.seq (alnumnum.many) is equivalent to alpha alphaple *. Similarly, the ISALPHANUMERIC () function is predefined in the Scanners class. The isword () function is also predefined for VarName.
So, a four-way operator is very simple? Scanner op = Scanners.plus (Scanners.ischar (' '), Scanners.ischar ('-'), Scanners.ischar ('*'), Scanners.ischar ('/'));
Let's look at Parser. Parser is different from scanner. A Scanner is only responsible for identifying the input string of match without returning any data. Parser is different, in addition to identifying the input token stream, reporting Match success or failure, but also returns objects based on the input. All Parser-related combinations and basic Parser are defined in the Parsers class.
Let's take a brief introduction to the basic Parser combination: 1. RETN (Object V), RETN does not recognize any input, only returns an object V. 2. One (), ONE does not identify any input, but it will always report Match success. 3. ZERO (), Zero does not read any input, but it will always report failed. 4. Token (), token () reads a token, determined whether this token meets a request, if it is in line, the report is successful. 5. Plus (), there are several overloaded versions for different number parameters. Similar to Scanners.plus (), also said "or" concept. It is equivalent to the "|" symbol inside EBNF. 6. SEQ (), indicating the order, Parser P = P1.SEQ (P2) is equivalent to P:: = P1 P2. The object returned by P2 is the return object of P. 7. Map2, Map3, ..., Map5, Map it performs several Parser objects, passes the return results of these Parser objects to a Map2, Map3, ..., MAPN objects, convert into a new object. 8. OPTIONAL (). OPTIONAL () represents "?" in EBNF. Parser P1 = P.Optional () is quite P1 :: = P? 9. MANY (). p.MANY () is equivalent to p *. 10. Many1 (). p.MANY1 () is equivalent to p .
Below, let's take a look at how to deal with this four arithmetic expression. Term :: = Number | '(' expr ')' signed :: = (' ' | '-')? SignedMuldiv :: = MULDIV ('*' | '/') signedExpr :: = expr (' ' | '-') MULDIV
When writing this expression Parser, there are several difficulties to handle. 1. Left recursive. The generation of MULDIV and EXPR has left. This is because the addition and subtraction, the multi-division method is the left. 1-1 2 should be (1-1) 2, not 1- (1 2). Such left: JPARSEC framework has predefined combinator INFIXL. The so-called INFIXL represents the abething left combined operator. MULDIV can be represented as: Parser Muldiv = Parsers.infixl (OP_MULDIV, SIGNED); EXPR can be represented as: Parser Expr = Parsers.infixl (op_addsub, muldiv); About OP_MULDIV, OP_ADDSUB Let's wait. Now let's know the OP_Muldiv corresponding to the Method operator. Op_addsub is OK corresponding to the add-to-chief operator. 2. Cycle dependence. TERM's generation of EXPR, while the generation of Expr's creation depends on Term. For this cycle dependence, you need to break this cycle dependence with the Lazy combination. 3. Regeneration. The generation of SIGNED is not a left, but it is still recursive. We can use recursive to handle this production, but the JParsec framework has long provided predefined prefix combinations, Signed can be represented as: Parser Signed = Parsers.prefix (op_positive_negative, term); otherwise, Signed is a There are several positive negative numbers in front of Term. Let's take a look at the four arithmetic Parser of our first edition. Final Parser Lazy_EXPR = Parsers.lazy (New PaSEREVAL () {Return expr ();}}; final parser Term = Parsers.plus (Number, Parsers.between (LParen, rparen, lazy_expr); final Parser signed = Parsers.prefix (op_positive_negative, term); final Parser muldiv = Parsers.infixl (op_muldiv, signed); final Parser expr = Parsers.infixl (op_addsub, muldiv); Parser expr () {return expr;}
The lazy_expr here is introduced to break the loop dependency. Parsers.lazy () Components accept a Parsreval object. The Parsereval object is responsible for delaying the value of a Parser object to the actual Parser running.
Ok, this, a four-way PARSER will get it.
However, this PARSER requires us to write to the priority and coincidence rate of the operator. Eliminating the left instructions Although there is a framework INFIXL help, the programmer still needs to analyze this problem and correctly call INFIXL.
In fact, it can be automated based on the combination rate and priority to produce and call INFIXL, INFIXR, INFIXN, PREFIX, and postfix.
The frame predefined an Expressions class and an OperatRABLE class, which implemented such an operator priority method. The Expressions class is a user code for the core Parsers class of JParsec, which is used in a standard JPARSEC's combination. This also shows that apply the Parsec combination, the customer can build your more convenient value-added library.
Let's take a look at how to use the Expressions class. How do we implement four operations.
First, define the priority of each operator: the highest priority of the positive and negative, first defined as 100; multiplication, division priority, 80; addition, subtraction priority, 50. At the same time, the positive and negative is the prefix operator, and the addition and subtraction is completed as a prefixed left combined operator. Then generate an OperatorTable object based on this information:
OperatorTable Ops = New OperatorTable (). Prefix (op_positive, 100) .prefix (op_negative, 100) .infixl (op_add, 50) .infixl (op_sub, 50) .infixl (op_mul, 80) .infixl (OP_DIV, 80);
Completely declared programming, this is no longer worried about left, generated, is a heart thinking statement priority and combination rate.
Then, use the Expressions class to build Expression Parser: Parser Expr = Expressions.BuildExpressionParser (OPS, TERM);
The complete code is as follows: Final Parser lazy_expr = parsers.lazy () {public parser Eval () {return expr ();}}; final sel}}; final parser term = Parsers.plus (Number, Parsers.between (LParen, rparen , lazy_expr); Final OperatorTable Ops = New OperatorTable (). Prefix (OP_POSTIVE, 100) .PREFIX (OP_NEGATIVE, 100) .infixl (op_add, 50) .infixl (op_sub, 50) .infixl (op_mul, 80) .infixl (OP_DIV, 80); Final Parser Expr = Expressions.BuildExpressionParser (Term, Ops); Parser EXPR () {Return EXPR;}
As for Number, LParen, RParen, OP_ADD, OP_DIV, etc., and another TERMS auxiliary class can help generate these terminals of Parser:
Private final terms words = terms.getoperators (new string [] {" ", "-", "*", "/", "(") "}); private final map2 add = new map2 () { Public Object Map (Final Object O1, Final Object O2) {RETURN NEW LONG ((Number) O1) .longValue () ((Number) O2) .longValue ());}}; private final map2 subs () {PUBLIC OBJECT MAP (Final Object O1, Final Object O2) {RETURN NEW LONG ((Number) O1) .longValue () - ((Number) O2) .longValue ());}}; private final map2 mul = new map2 () {PUBLIC OBJECT MAP (Final Object O1, Final Object O2) {RETURN NEW LONG ((Number) O1) .longValue () * ((Number) O2) .longValue ());}}; private Final Map2 Div = New Map2 () {Public Object Map (Final Object O1, Final Object O2) {RETURN NEW LONG ((Number) O1) .longValue () / ((Number) O2) .longValue ());} }; Private final map posacy = Maps.id (); private static final map negate = new map2 () {PUBLIC OBJECT MAP (Final Object O) {RETURN NEW Long (- ((Number) .longValue ()); }}; private parser make_op (Final String OP, Final Map2 M2) {RETURN WORDS.GETPARSER (OP) .SEQ (RETN (M2));}; private parser make_op (Final String OP, Final Map m) {Return Words.getParser (OP) .SEQ (Retn m));}; private final parser op_add = make_op (" ", add); private final parser op_sub = make_op ("-", sub); private final parser op_mul = make_op (" ", mul); private final PARSER OP_DIV = Make_op ("-", div); private final parser op_positive = make_op (" ", posTIVE); private final parser op_negative = make_op ("-", negate); all of these Map2, Map, such as Add, Sub , MUL, POSITIVE, etc. are all different semantic actions based on different operators.
As a result, a complete four operator is born.
For specific code, see the TestParser class in the test code. In addition, TestsqlParser is a Parser of a non-trivial SQL SELECT statement, you can see more complete use of PARSEC. The download connection is: http://sourceforge.net/project/showfiles.php? Group_id = 122347