Automatic scanner
3.1
Basic scanner
Let's first talk about the basic scanner. The purpose of the scanner is a word that consists of right letters or words into one word. E.g:
#include
void main ()
{
COUT << "Hello World!" << endl;
}
An example of old clocking teeth, let's take a look at what this code will be like after watching the scanner. (From left to right, from top to bottom)
Void
main
(
)
{
cout
<<
Hello World! "
<<
ENDL
;
}
Here, the pre-processing instruction does not count because the compiler does not know that there is a pre-processor (because the compiler processed the code actually the code for the preprocessor processed, and its internal code is omitted).
This code is quizs quartz. Why is Void and Main become two words? Because they have spaces in the middle.
- Space is the separator in the scanner
Main and (why is it separated? The answer is obvious - (not the character can't make a word.
- To distinguish words and punctures
That (and)? Because there is no "()" this symbol, similar problem: <<. Yes, this is an appearance as a symbol because it makes sense. But why not "<", "<"? Here is involved here
The maximum matching principle, thereby eliminating the previous unsatisfaction problem.
- Maximum matching principle
In the usual scanner, there are generally two classes: one is responsible for reading the character stream from the code file, and the other will divide the read character into one word to Token. A Token usually records the string, line number, column number, and an optional type representing the Token. In order to output debugging and display, our type uses a string instead of numbers, such a disadvantage is to compare space and processes slowly (main reason is that the digital representation does not need to be output, so Requires type name table, and save space). The scanner's task is to interpret the code as a Token stream. Instead, the 3-type language is also a regular expression. According to regular expressions, we can have a planning to convert the analysis work as program code. This process has manually implemented, and there is also a tool to be automatically implemented. Both methods we will use during the design and implementation of this compiler. Because our goal is to create an automatic scanner. In this way, when implementing this automatic scanner, we can only use the manual manual manual, and you can use the automatic mode to design the scanner after the automatic scanner implementation. The handmade scan part is related to the relevant chapter of the C3 compiler. Here will only discuss the automatic scanner implemented using the C3 compiler.
3.2
Automatic scanner rule
First, we must define a format so that the C3 code written in this format After compiling, the automatic scanner we have written now can be used without any modifications. Otherwise, change the entire scanner to rewrite a scanner syntax so that "automatic" is equal to an empty talk. The definition standard is as follows:
Operator
Operational segments: Place all operators here, each rule is the Label type rule. If the string type operator (such as the AND / OR used in VB) requires the same rule in the key field. Also because C3 does not allow the same rule name to be used, different prefixes are used to distinguish. The recommended prefix of the operator section is "OP_".
Keyword
Key Field: Place all keywords here, each rule is the Label type rule. The recommended prefix of the key field section is "kW_".
Comment
Note Segment: You can only have 3 rules, respectively: 1.LineComment, indicating line comment. 2. BlockCommentStart, indicating the start symbol of the block annotation. 3. Blockcommentend, indicating the end symbol of the block annotation. 3 rules are optional, if not or not (block comments have only one symbol), then the corresponding annotation rules will be disabled, and if the three rules are disabled, only blank skips work. Other
Other rule segments: The rule here is not a simple string rule, such as identifier identification, you cannot use a single string representation. The rules here use the regular expression language of .NET, but in order to be compatible with EBNF, it is written as a string, which is a Label type rule.
3.3
structure
Let's discuss our automatic scanner implementation. In the design of the automatic scanner, the separation design is used for the convenience of the function and maintenance. The single function item of the scanner is independently as a class, and the interface is designed, which is IMANAGER, and implements Manager with the default. To implement a new function item, you only need to inherit Manager and add new Manager to the appropriate location when the scanner is initialized. The point you need to pay attention to this is because the previous Manager matches will no longer match the back, so natural priority issues are caused in version 1.2. This way, if there is no way to schedule a certain priority, then you have to merge two Manager. In this way, the advantages of functional segmentation are no longer existent. So I envistened to read characters from the CodeReader in accordance with the filling of the analyzer and fill each Manager. Such various Manager is executed in parallel, and the last match occupies the unique matching slot to complete the longest match. According to this program, it is better to solve the comment problem, that is, check whether it is a comment start character, if yes, start working, not, then mark yourself as an inactive, so that the scanner knows not to fill it again. For Operator and Keyword, you can use the character tree to process, very fast and convenient. However, it is ignored, that is, the reason for the 1.3 version of the scanner failed: unable to resolve the character boundary problem, not guarantee the end position of the operator and keyword to the end of the Token. Such as NOT and Noth, the last scanner is higher than the keyword due to the keyword, resulting in two parts: not and hing, and actually not on the boundary. The character tree guarantees that Nothing is not recognized as NOT, but if there is a word as NothingWrong, it will be identified as Nothing and Wrong, apparently wrong. The main problem is that the character tree does not have a lookahead character, and it is not possible to predict the back of the back. For scanner rules (Rules of Other districts), you can guarantee the character boundary problem, so after the matching of the keyword is matched in the previous design (each keyword is a legal ID, Language design perspective, check is a keyword, if it is the keyword, modify token, otherwise pass. In fact, the operator may also be a string like VB, so it should also be considered. Such a string type operator should be in that priority into a problem. Finally, the solution is the solution to check the operator, and leave the character string type operator to check with the keyword. Thus, Manager actually only has 3: handle blank and annotated White; process symbol operator Symboloperator; OtherRule, processing scanner rules and keywords (including string type operators). Otherrule. In the previous design of version 1.2, the scanner rule is running by a specially written code. The advantage is that behavior is controllable, and the speed is still faster. After the regular expression of the .NET standard is used, since it is difficult to convert to .NET regular expression language in C3 language, a non-standard (semantic) C3 compiler is implemented here. It still uses standard C3 syntax, just changing the semantics of the scanner rule syntax: the scanner rules also become the rules of the Label type, the string content is the rule .NET regular expression, will automatically Add "/ g" in front of the regular expression to make the scan work continuously.
Due to the use of .NET regular expression, make the CodeReader's role to decline, now just for the maintenance line number and the column number, handling the carriage return, case. The SCANNER class is only responsible for managing and external interface in a specific structure using functional separation. The specific function is implemented by each Manager class. Although this will be slow, our purpose is not to be a business compiler, at least not now. As long as the correct result can be obtained, it is not too slow (slow in a magnitude), and at least the time complexity of the code length is O (n)).
The left left is the basic structure of the scanner. CodeReader is an attribute of the scanner, responsible for reading characters and counting ranks. And the Scanner itself is a collection class, manages each Manager, and the unique export GetToken function for the scanner is a call for each Manager to implement token's identification. If Manager identifies token and returns True indicates that a valid token has been identified. The GetToken function is found to return to TRUE, and the Token generated by the Manager is returned. Otherwise, use the next Manager to match the job. If there is no match, you have a problem with the words of the code. Each Manager can have its specific to speed optimization, such as Symboloperator, using a symbol prediction set, first read a character, see if there is an operator starting with this character, if there is a match, If you don't return False directly, this avoids matching operations of relatively slow regular expressions; again, add an operator type quick check table to quickly find the matching operator type. One question that is currently existing is not to skip blank every match. Such as QID match ID ("." Id) *, there is no blank between the point and the identifier, but in this processing mode, even if there is a blank, it does not matter, the speed is slower.
3.4
Automatic scanner running
Let me explain the running process of the automatic scanner.
Since the scanner constructor requires a CodeReader as a parameter, you should first construct a codeReader, using the code to be compiled as a parameter. This scanner can get all the information required by CodeReader and Grammar. The next step is to load the Managers you need to use, since WHITE, SYMBOLOPERATOR, OTHERRULE, so the scanner provides a loadDefualt function to load these default Manager. At this time, the structure of the scanner is already the structure presented on the above figure. This way to separate functionality and management is very conducive to functional expansion, and you need to add a function to add a Manager or inherit and override a manager, and then load the required Manager using your own loading function to run. Below is the loading code:
Public Function INITSCANNER
BYVAL CODE
AS String,
BYVAL GR
As grammar)
AS Scanner
DIM CR
AS
New CodeReader (Code)
DIM SC
AS
New Scanner (Cr, GR)
Sc.LoadDeaFult ()
Return SC
END FUNCTION
So, how do you know what you do? For example, Symboloperator, how do you know those symbols are symbols or keywords that are handed by it? Take a closer look at the structure of Manager: Manager has an initialize function, which is used to initialize each Manager, in which SC passes all analyzers syntax rules to each Manager, and each Manager lookup pair The information yourself is useful, and some processing is used for use, so that each Manager has obtained the information you want. And another function invoke is called by the scanner, the scanner passes the CodeReader to each manager in the priority order, while managers sequentially handles the read characters in order through collaboration until a legal TOKEN is obtained. So, at this time, the scanner knows that it has been legally returned, and it is no longer to pass the CodeReader rear behind, which means that this matching operation is complete. Of course, the highly priority Manager may deal with some things thus causing some things, such as the default Manager's highest priority White, which will skip all blank includes comments, which is the manager behind it. The arrival will be all whites all blank and annotations (logically see that these gaps are replaced by separators of empty strings). This way, the manager is not to deal with blank problems - of course this is why the original intention of this structure is. When entering Symboloperator, SO will use a symbolic table created in the initialization phase, if, return token, not, the scanner will continue to pass to Otherrule. Use the regular expression of .NET in this implementation to simplify the operation. After it returns the result, it will also check if it is a keyword or a character operator to change the type of Token returned. 3.5
test
At this time, the scanner can accept the output. We can easily test the unique interface to the analyzer using the scanner.
For the entire compiler, the unique interface of the scanner is GetToken. The analyzer reads token from the code from the code through the GetToken function. At the same time, we can also test the correctness of the following scanners through this unique interface.
Public Sub Testscanner
BYVAL CODE
AS String,
BYVAL GR
As grammar)
DIM CR
AS
New CodeReader (Code)
DIM SC
AS New Scanner (Cr, GR)
Sc.LoadDefault ()
DIM T
As token
T = sc.getToken ()
While
Not isnothing (t)
Console.writeline (T.toString)
T = sc.getToken ()
End while
End Sub
This test program will output all TOKEN output by the scanner. The way to actually use is similar to this, all initialized and then processed all TOKEN sequences in a loop - only the specific processing method is determined by the analyzer. From this test code, it can also be seen that this automatic scanner is still very simple.
Here is to use the above test code and use it.
C- Hello World program to test:
code show as below:
void main ()
{
PRINTF ("Hello World!");
}
Token
The flow is as follows:
TN_VOID (Line: 1, Col: 1): Void
ID (Line: 1, Col: 6): main
Quotelf (Line: 1, Col: 10): (Quoteright (Line: 1, Col: 11):)
BIGQL (Line: 2, Col: 1): {
ID (Line: 3, Col: 2): Printf
Quotelef (Line: 3, Col: 8): (
String (Line: 3, Col: 9): "Hello World!"
Quoteright (Line: 3, Col: 23):)
Endline (Line: 3, Col: 24):;
Bigqr (Line: 4, Col: 1):}
A total of 11 token, 3 lines.
The following picture is the operation result of ParsetreeViewer (detailed instructions), which can clearly see the various TOKEN and the composition of the analysis tree.