My compiler analyzer

xiaoxiao2021-03-05 145

Automatic analyzer

4.1

analysis

The analyzer's task is to generate the TOKEN stream analysis generated by the scanner to generate the tree, and then hand it over to the network. For the structure of the tree, .NET has CodeDom, but we don't have semantic information when analyzing, and cannot correspond to classes in Codedom. Therefore, you can only use a custom class.

The most important issue later is what analyzes what to use. Common is LL and LR. The LL analysis method is from the top, encountered rule calls to the nested function call, which has caused problems that cannot be processed by left hand: Since the function corresponding to the rule is only returned when the match is completed, if it is returned, if formed, if formed Left payment has been calling, no matching, no match, thus causing call stack overflow. But LL is simpler in understanding and is more easier. And the writing of the grammar rules is also closer to the language habits of the natural language. The LR analysis has broken through the left-handed restrictions, but due to the large number of states in complex languages, there is a large space occupation. Due to the presence of LALR (1) analyzer constructor YACC, it makes it easier to construct the LALR (1) analyzer, although it is much more required for EBNF syntax. Only this is back from our original intention - we have to reduce the threshold, which requires less as much as possible.

What we have to do is constructed a LL analyzer, although there may still be a solution for left payment, but its simplicity is still worth use. First, let's analyze what problems will be encountered by the LL analyzer we have to construct. First from simple start,

Empty generation. The problem of empty generation is that after the rule call is generated, since a rule to be matched or has, but does not match, the invalid consumption will not be able to do it normally. Do not allow empty generation? Our goal is to try to solve these problems so that it is transparent to the user. Empty generation development? Perhaps this will make the entire route map is disrupted, and maybe the user's purpose is this (there are many reasons, such as convenient, more semantic). This way we need to solve it, you need a possible

Enough

Token

Recycling back to its parent method. Next is

Useless generated, this thing is not called any rule, so it becomes a useless generation. Its side effects use space. But this is for us, useless or not, the existence or not, for users who use this tool to write compilers, it is transparent, that is, we can put it first, wait for opportunities to eliminate work And the user does not feel at all. The third question is

Public prefix. Other analysis of its constructor is in the analysis method or speed, and does not allow rules that have a public prefix in the EBNF code. Because the public prefix will make the branch prediction mechanism lose its role or generate error results. We can of course deal with this problem through a transparent mechanism of the user, but eliminating the public prefix will change the original syntax structure, and the user needs to analyze the analysis tree in full according to the grammat. And this problem is for users who have a certain understanding of this, you can avoid this problem by manually rewriting syntax. So our solution uses a solution that allows public prefixes. This means that we may have several branches at the same time. Programmaker is easy to think of multithreading, yes, using multithreading can easily solve this problem. However, due to the multi-branching characteristics of the result of branch prediction, there will be a large amount of useless threads if it happens in a very common rule, such as expressions. Because regardless of how many branches are generated, it can eventually have a match to the end, otherwise the syntax has an unity. Here we use the way to manually control analog threads, the shortcomings of this is too complex, the problem to be handled is not only the simulation parallel execution, but also synchronously, call, return ... due to time problems, use OS provided The version of the true thread did not test.

4.2 VMT

The left picture is

A schematic of VMT (Virtual Multi-Thread). The topmost is GrammarServices, responsible for using the syntax rules file generated by the EBNF compiler and controls the life cycle of the process. The role played by GS in the VMT is actually OS, and this is in which we have a large amount of invalid branch, so the interface can be changed, and the memory management mechanism can be changed, select the default memory allocation mechanism or manual manual The memory allocation, and the specific effect will be tested in the test behind. The GS below is Process, which represents a rule call (initialization, using a syntax rule as a parameter, and based on the creation thread operation). A process can be made from multiple threads, each thread can be performed in different places. GS and Process are relatively simple to manage / set classes, and threads are more complicated because this is the core of VMT - it is responsible for all behaviors related to analysis. The core is the processing function that is returned to the TOKEN's processing function and the rule call. Let's take some questions in the specific implementation.

First of all,

Empty generation. Air generation can be generated when a rule all the elements are optional. For the last rule that is optional, the last token does not match, but it does not simply discard it, because it may be in the parent, there may be other preambles after matching this rule, this Since the Token does not match the rules of this level, it should be handed over to the parent. If the parent thinks still does not match, the parent is handled by the parent. If there is no option in the rule, there is no empty generation problem.

Until

Useless generated, because it is never call, there is no other side effects in addition to space. In subsequent implementations, we will use the C3 compile time to useless generation, and remove it.

Public prefix. This is the main problem to solve. The public prefix means that there may be multiple branches when they flow to somewhere - because they are the same, this is why we cost the VMT why we cost the VMT. Let multiple branches run in parallel, there is N-1 branch sooner or later (of course, if there are two branches return, it means the symptom exists), and the last match branch returns or one does not match and then Parental report error.

Between, each token in the Token stream is the same as the message similar to Windows, and the upper and down (via the Dispatch function) finally transmits a matching process; and some of the results generated by FIX have passed similar The way API calls in Windows returns Process or even GS. When the rule call, the FIX function generates a copy of this thread and then lets it wait for the called thread, and continue the following matches. When returning, you still create a copy, let the copy returns, continue to wait (in fact, saying that if you return more than two, you have an unsatisfaction).

4.3

test

For the sake of simplicity, I didn't use branch forecasts before version 1.1. After the first debug passed, I tested its speed. Detailed results

Program Test Report 1. The following is a brief result (10 line code):

Parser Run Time: 0.383102451451123

Scanner Run Time: 0.0231484591495246

The scanner in the entire program is 0.023 seconds, and the analyzer runs 0.38 seconds. This speed is very slow for programs with only 10 lines of code, so you need to perform analysis prediction.

Branch prediction, requires calculation of the first collection and FOLLOW collection. After a period of development, it constructed this namespace and corresponding classes to handle some information about branch prediction. The First set and the computationally related books of the Follow set have a detailed algorithm description, which is no longer detailed. Different from the previous place is made by parallel synchronous running to asynchronous operation. Although it looks very good, I quickly discovered the problem: it's hard to determine the error, when you want to join the wrong handling, it is difficult to determine the error, when you go wrong, you should report it (because of my program is multiplexed) Almost all the way does not necessarily analyze the error and add the final error report). So I finally decided to give up asynchronous operation.

After the synchronous mode is running again, after another overwrite, the same code is tested, the same code is seen.

Analyzer Test Report 2, below is a summary result:

Parser Run Time: 0.0462969182990491

Scanner Run Time: 0.0115738657768816

Result Analysis: The analyzer running time is only 0.0463 seconds, and there is a magnitude speed increase with respect to 1.1. If it is linear growth, this speed can be tolerated. Visible branch prediction is still very necessary.

Due to only one function of the scanner: GetToken, the basic architecture starting from 1.0 is basically no change, just the change in the namespace in the code. From here, you can feel an excellent design for the importance of later development work - the VMT framework that has been used from version 1.0, basically without remember, can work with a new scanner. The default implementation of the code does not use the manual memory pool because the .NET memory management (internal memory pool) is much faster. See the details

Speed and optimization test ".

After the initial debugging was successful, it was found very slow, and for the length of the code, the time complexity was close to O (2 ^ n). Since the manual memory pool management is slower, this problem is definitely not due to the frequent construction / destructor of Process and Thread. Thereafter, it is found that since the VMT is executed in parallel, it is necessary to copy its own sub-analysis tree in threading, considering whether it is caused. Check, child analysis tree buffer uses the Collection class, because such internal maintenance operations are more, it is slower, and it is used in the analytical code. It is changed to arraylist. Test again, the result is unexpected, the time complexity is reduced to o (n), that is, the above result is complete due to the use of the Collection class. All where all things that do not have to use Collection are all changed to other set classes, and the speed is slightly improved.

Speed and optimization test

Default memory management and manual memory pool

First, let go of the efficiency of the memory pool I designed, manual control, because IL code is used, can't catch the native code of the .NET runtime environment. At the same time, the memory pool mechanism does function, and the operating efficiency of the code can be improved. The code has a pre-compiled option, controlling the manual mechanism or the default mechanism.

Grammar

Class rules

Using a HashTable in the Grammar class to save all rules of the analyzer, this is much more faster than the rules in the analyzer is still more frequent. It has been promoted by 7% of the test.

Symboloperator

Match prediction

The symbol type operator is due to chances of problems, about half of the mismatched state, if the match can be directly skipped directly, the time is shortened. The test has enabled the scanner to improve 10%.

Thread

Invalid branch

Since VMT uses multiple parallel execution, it is not necessarily required that the syntax must have only one result on branch prediction, so that the invalid branch of Thread is caused. The invalid branch will make Thread a copy of you, thus occupying a lot of time. After counting, 90% of Thread is invalid branch. This is a place to be optimized. Optimized after the call and return to threads, the generation of invalid threads is reduced, so that the number of invalid branches in the test condition is reduced to 60%, that is, one rule has two branches (runtime probability statistics) . GetChar (Len As Integer) AS String

function

The implementation is implemented using getChar () as char, as long as the problem is considered, the ranks count. Redesigned using a proprietary algorithm, counting a newline. The speed is slightly improved.

After optimization

Analyzer test report 3.

I found a few interesting issues during the debugging process (the syntax file used is

B-.txt):

1. MethodCallStmt is included in SimpleStmt, generating 2-righteous matching, the result is that MethodCallStmt is commented by me.

2. The terminal of the Terminal / Token type must be placed behind the Reference type element when processing the fully prefix. Otherwise, the branch call is generated before the TERMINAL / TOKEN model has added the current token to the buffer, so that the branch is generated, there is already a buffer in the branch thread when branch call is generated. This problem should be placed in the C3 grammatical documentation.

The only problem until now is also an urgent need to solve the problem is the problem of useless threads, and the number of unusless threads is higher than expected. When I started, I predicted should be around 1: 3 ~ 4 (using the number of threads: useless thread), but in fact, this ratio is 1: 9, so how to improve it has become a new problem, otherwise, the words not only occupy memory. Also occupied the CPU time to create and destroy these unusless threads, and run time.

After completing the basic framework, we have to add an error handling code - Obviously, no matter what error in the compilation process, the analyzer is just a simple report "error", it is obviously not enough, at least we need one Compare specific error information - so that the programmer knows there. As mentioned earlier, an error message has been generated in the grammar file, but in fact this information is very simple, such as "mismatch". But this is very good in many cases, and it is better than not good. As for the error recovery, put it on one side first, let's take a look at how VMT reports an error. First, there is a thread that does not match, then - returns a mismatch information to the process, then - I found a problem, that is, the process is returned when the process is received, and it does not matter whether it does not match the return or not Match returns. Join the judgment code. This only occurs when the last thread is returned, the thread generates an error report and handed over to Grammarservice, GS checked the currently active thread. If there is a simple abandonment to process this error, and kill the process, then Send a message "Do not match" to the caller. And if this is the last thread, it is obvious that GS can't sit and ignore - has already reached the point where it is going to continue to analyze, the GS needs to do something to solve this problem, this is the error handling. Since it is not considered an error recovery problem, we are just a simple return error and then end the analysis process - this is enough for the test item. However, the code cannot be "so enough", you need to leave the interface so that you can easily add code to implement error recovery when you need to recover, in this purpose, when you finally generate errors, GS triggers event Recover, and take the final error thread as a parameter. Since the internal structure of the VMT is exposed, the error recovery program should be counterfeited in this event, then select a suitable place to restart the analysis process and record the error information. Since the changes in the code mentioned above have greatly reduced the number of useless threads, the analysis speed has a degree of improvement.

Totel Parser Run Time: 0.0300432

Thread Count: 186

Proc Count: 77

Thread forks: 109

It can be seen that there is a relatively large increase in speed, and the number of useless threads is greatly reduced (77/109). So start trying to process complex code. The first thing is to need a complex grammar file. Because self-compilation is a must-have project, I can handle the vB code by improving B- makes it handle most of the VB code. I took a piece of code that was fully supported from the code, first test.

A total of 852 token, 177 lines.

Code generate.

Totol Use Time: 0.110144S

Speed can, syntax is compatible. Further, paste the code copy, generate a 24,000-row code file. And use it as a speed test.

A total of 113858 token, 23922 lines.

Totel Parser Run Time: 9.163176

Thread Count: 500700

Proc Count: 230472

Thread forks: 270228

According to this speed, the 852 token above only 0.0685s can be analyzed, remove the time required to compile .NET regular expression, there is certainly a certain IO time.

4.4

Basic principle of automatic analyzer

At this point, an automatic analyzer is basically completed. Next, let's enter the interior of VMT to see its probably running process. Prevent the above (A | (B [C])) D as an example. When the process is input to the AD, the match is matched, and the B sub-expression is ignored by the matching outlet, and it can be clearly seen from the above figure). At this match, the legal end is generated, and the analysis tree is generated and returned. For the input of the BD, a does not match, and does not match the outlet, enter B, match, matched out, enter C, do not match the outlet, enter D, and the match is completed. This is just the simplest case, there is no rule call. In the case of using rule calls (there is no complicated language can not use rule calls), there is a problem when generating rule calls, due to the VMT semantics, rule calls To create new process, do not know immediately Does it match, so if the endors do not match, whether the thread that is ruled is ended? The answer is obvious - not - because if the end is ended, there is no return point when the rule call returns, so that there is no way to continue. Don't end when there is rule call, then wait? What should I do if I wait? If you don't wait, where is the rule call returned, where is the return? You can use a table to record the information of the return point. The rule call returns to find the return point to continue running. However, if there are two rules return correctly, then who is the second thread to perform? Also if the first returns rapidly generate other rule calls, how is the return point information record? Obviously this program encounters a problem that it is impossible to solve the problem - multi-threaded return. This is the main problem to be solved by VMT, and the VMT makes a copy of the current thread regardless of the need to generate a branch, and the return thread of the rule called the rule is set to this copy, and then we will continue the matching of the end. Since it is returned to the thread of the call source in the normality of the process, this program takes effect - the thread of the call source is not distinguished by the return of the process, that is, the copy thread and the original thread are The call source thread is the same. This is just a simple production of a copy thread when the initial thread discovery rule is called, and sets the return point and let the copy thread wait for the return of the rules, and you have been doing it to do not match the outlet. There must be no more than one thread in all threads to eventually return --0 is the rule does not match; 1 is matching. This may have a lot of thread parallel execution in the same process when analyzing the same place. Don't even need branch prediction! Because the analyzer will travel all the possible situations - Although it is slow - this is why I re-introduced the reasons for branch prediction in the subsequent version. Ok, call / return problem is solved. If the call is an empty generation? It is reason to return to the call source thread and processed by it. However, the synchronization method of VMT does not allow the next round to re-release this token, so you must process it immediately. So this parameter is used. Redis is REDISTRIBUTE, which means re-release, that is, telling the source has a thread in the calling process has arrived at the end, but the Token does not process, and legally end, it is necessary to process the call source. And the source hair is now returned to the corresponding function for the regular call in the return parameter. This will return immediately even if the empty generation type is called, and the original TOKEN is re-released to the rule call source. And this actually solves the problem that is not only empty generation - the end of the optional elements can also be treated well. 4.5

Actual example

As example, the example of the left, C is here that the rule is also non-final. In the figure below, you will not match the outlet, you want to think, if you don't use the rule call, you should embed the following route map throughout the top of the road map, such a 1st point should point 4, and Which / some port should you tell? Obviously 2 is legal export, then the legal export of the C rules is 5, 6, 7 three ports should point 2. And 8 spent is illegally export, should point to 3. In fact, since 2, 3 points to D, 5, 6, 7, 8 are pointed to D. When the input is bxd, the X enters the C rule, match X, and the post D is not matched by the 6 port and the REDIS parameter d, immediately returns D, match, accept. For the BD input, after entering C, after the X, Z does not match, the 8-bit REDIS parameter D is entered from 3 to D, and continue to match. So when it returned? According to theory, a rule call can only be returned by a legal return, then do you continue to wait when the process returns? Should no longer wait, should continue to be executed from the return point immediately. But if there is a public prefix? For example: (ab) | (ABC), this is actually written as AB [C], but it can't say that the previous way is wrong - that is actually the write of BNF. If you enter ABC, then the previous write will definitely returns 2 times. The return of AB is likely to return immediately after returning. If the two returns make the call source returns, then your symptom has an erriness - the same code has two different explanations! After weighing, in order to reduce the requirements of the user, I finally decided to accept secondary return. Although the public prefix problem can be resolved, it is still an unpleasant thing for primary users. The option is to use a parameter to determine so that primary users and senior users can use (1.4 and previous versions are not supported).

4.6

Branch

Let us take a look at some details of the thread in the VMT model, which is also the core of VMT. In the right image (assuming A is a rule, non-end), when a match is started, when is the match? Who should you start? In accordance with normal C3 semantics, you should wait for a match to match and do not match the match B. The problem is often not so simple. In the way VMT operation mode, the waiting A end means that the front token stream disappears - it is consumed by A. This way, the correct match may match the match failure due to the A. So there is only one on the VMT synchronization mode, that is, let a thread fork, then return one of the waiting A, and the other continues to match B. Who will match B? It seems that you can. However, if you have carefully observed, it will be thought that a rule call produces up to one matching end and may be composed of multiple branches, then the opportunity made by mismatch outlets is much greater than the opportunity to match the export. This means that the number of copies that match the copy does not match the exit, the number is much larger than the number of copies of the other road. More threads fork will undoubtedly become slower operation. So let the copy match match the exit, and the source thread has been penetrating to the final image that does not match the outlet and ends due to mismatch.

The picture below shows:

The source thread is generated in the process of matching A, and yourself cross A match B.

If the non-matching exit of B has other primitives, it still uses this rule - the terminator directly matches, not the end of the end, generates a copy thread waiting to return. The above is theoretical conclusion. Actually, some unexpected - the opportunity to match the outlet is not like the opportunity to be unexpected! This result is found when using B-test. Why don't you do with our reason? Carefully check the reasoning above, we will take a condition that is not necessarily established as a premise! Obviously it is wrong. This "prerequisite" that makes our reasoning failure "is" the opportunity to do not match the outlet is much greater than the opportunity to match the export. " After analyzing B-grammar files, the new discovery is that most of the grammar rules are single branches in most of the time, that is, there is no branch, there is no need to generate a copy. The original thread should match itself instead of generating a copy, it will end because of the mismatch. The method of judgment is also very simple, just joining a piece of code that does not match the exit is empty in the original code, if you empty yourself, not empty, you generate a copy thread. It can be proved that the analysis method used by VMT is equivalent to the LL (1) analysis: When the syntax encounters a branch, the analyzer waits for the next token, and the branch is selected by judging. However, even after the branch prediction is added, 50% of the thread has been tested to end the thread after Fork, and they waste a lot of runtime. However, these branches do not take too much memory, which can run well with a better memory pool - although there is more useless branches, there is not much branch at the same time. So if you want to port to an environment without a memory pool, it is recommended to provide a memory pool first. Of course, you can also open the "Using Handmade Inclusion Pool" option in your code, but it is recommended to implement a memory pool because the handmade memory pool in the code is not fast. Although VMT did not solve the problem of left instructions, it has completed the issue of public prefixes and empty generated issues. In the use, anyone can achieve the purpose of the two test languages (C- and B-) that can be used by the two tests provided by the augmentation of the gourd scoop to achieve its design objectives.

As of 2004-3-27, the case is not fully supported.

4.7

Key code

The most important thing in the VMT model is two functions of the Fix function and wake-up.

Public Sub Fix (Byval T as Token)

'Initialization variable, declared variables ...

If T.Type = "__END__" THEN

'Treatment of end symbols

END IF

While Not IsNothing (CurrentrulexItem)

Select Case CurrentrulectMapItem.Type

Case ruleMapItem.RuleMapItemType.Terminal

If string.compare (currentrulemapItem.text, t.text, mignorecase) = 0 THEN

'Match, move the current token into the buffer, the currentruleMapItem pointer is moved

Else

'Generate an error message, it may be destroyed.

END IF

Case ruleMapItem.rulexitemType.Token

If mgr.compatible.compatible (t, currentrulemapItem.text) THEN 'matches

Else

'Generate an error message, it may be destroyed.

END IF

Case rulemapItem.ruleMapItemType.Reference

'Find the rules called

R = mgr.findrule (currentrulemapItem.text) if isnothing (r) THEN

Throw New Exception ("Can Find Rule:" CurrenTruleMapItem.text ".")

EXIT SUB

END IF

'Branch prediction

If mgr.predict.item (r.Name) .predict (t) or r.enableempty thein

'Each call generates a copy.

DIM TH As Thread

TH = MPROC.FORK (ME) 'This thread is used to wait, so it is not to pass T to it.

Th.setState (Runstate.waiting) 'Waiting

Th.mretpoint = currentrulemapItem 'Back to this

Th.Curritem = currentrulemapItem 'The current item is here

MProc.Service.CallRule (R, TH, T) 'This here will pass T to the called process.

Else

'Anyway, I definitely not match simply ignore it.

END IF

End SELECT

Opt = CurrentrulectMapItem.Option

Ri = currentrulemapItem.dismatch

End while

IF TERFIX THEN

IF not isnothing (inext) THEN

'Terminal / token matches

Curritem = INEXT

Else

[RETURN] (Nothing) 'Since it matches it will not be reissued.

END IF

Else

IF OPT THEN

[RETURN] (T) 'Optional End Reissant.

Else

Terminate ()

END IF

End Sub

It can be seen that the most important branch issues are on the processing of the Reference type road map chart (gray area).

Public Sub Wakeup (Byval Pt As Parsetree, Byval E AS Boolean, Byval Redis As Token)

'Each return is regenerated into a copy If the called process is not end (E = false)

IF e = false kil

DIM TH As Thread

TH = MPROC.FORK (ME)

Th.Wakeup (PT, True, Redis) 'spoofing means, will execute the following code

Else

'After returning, it should not end but continue.

IF isnothing (pt) Then 'and isnothing (redis) THEN

'No match

'If it is an optional, you have run. Unselected, do not match, end. Do not need to be released.

Curritem = MRETPOINT 'gives the error handled

Terminate ()

Else

'Matches

SetState (Runstate.Ready)

Buff.add (pt)

Curritem = MRETPOINT.MATCH

If isnothing (curritem) Then 'This rule has been matched. If you let FIX processing will directly Terminate

[RETURN] (Redis)

Else

IF not isnothing (redis) THEN

FIX (Redis)

Else

END IF

End Sub

It can be seen that the code is very simple, and the core is the processing (gray area) returned by the rules, and the wait thread is not rushing to start processing, but to see if he is the last returned recipient, if not, it will produce again A copy that allows the copy to return (judge by a parameter e). In fact, if you read the VMT code carefully, the core is only these lines, and the other code is managing the management code. It can be seen that the LL analyzer running on the road map is also very easy to create.

转载请注明原文地址:https://www.9cbs.com/read-33326.html

9cbs

New Post(0)