introduction
I need a form editor during the development of a database project. After I have a poor implementation, I suddenly discovered that I need an interpreter! According to my own ideas, I wrote some code, but I quickly found that this is not a solution. I need a real interpreter. So I started to conduct a text analysis research. In the basic theory, when I came into contact with EBNF, I was surprised to discover, what I am looking for is this - as long as an EBNF compiler You can generate the interpreter of the target language by writing the EBNF code to describe the destination language. After the use of a period of use, the syntax of YACC (a tool for generating the analyzer C) is very obscured, so that the newcomers of this compiler are very uncomfortable. So, I started to create my own EBNF dialect. Constantly explore in the dark, and eventually have a little result described in this article.
brief introduction
Automatic analysis tool is used to write a language analyzer, write a 2-type syntax in eBNF language, and then generate the corresponding C code file (most of the traditional automatic analyzer tool to generate C code by automatic analysis tool, mainly because of large Some compilers are written using C / C ), then link into the source code for the other parts of the compiler, compile together, and then generate a complete compiler. Common automated analysis tools have YACC and use LEX using LR analysis, because they all generate C code, need to be compiled again, and the syntax of EBNF is biased towards C style, and for expert compiler producers, This may not be a problem, but for other non-compiler producers, use an interpreter or a miniature compiler, learning C style EBNF syntax obviously more cost-effective, and VB's approaching natural language grammar for beginners Have very attractive. So I had a compiler / interpreter that made such an EBNF close to the VB grammar. Using syntax closer to natural language to reduce the difficulty of making compiler / interpreters.
Let us first understand some of the details of the EBNF mentioned above. EBNF is based on old BNF, which is more expanded, which is closer to the logic of the program relative to BNF, such as cycling, condition. The BNF is a BACKS-NAUR parapet, which is a form language proposed by this two. Using this language to write syntax unified syntax documents not only help to communicate but also the language that constructs a description language is possible. And what we have to do is constructing such a compiler while developing a compiler using it to prove its availability.
Since YACC's grammar is very embarrassed, I don't plan to use its grammar, so we will define an eBNF dialect to achieve our goal. The first question to consider is the syntax. What kind of syntax is used? Of course, it is now nothing benefits to us without any benefits. Because we can only use manual code, a complex syntax must significantly increase the complexity of implementation, and grammar files and EBNF The compiler is the wrong way, it will be difficult to find, so it is better to define only the language that uses the functionality of all EBNF compilers, the more simple and better. Since the problem is primarily a grammar rule, it is undoubtedly a good way to join the start symbol and end symbols in simplicity. Of course, the grammar is close to VB will be helpful to learn this language. So the initial idea is of course:
Section xxxxx
Rule xxx => ....
End section
This syntax is very close to VB, and the symbol is also very obvious (for example, the derivation =>). And its syntax is recursive, and it is also very simple for constructing such a compiler.
In this way, when the scanner (the scanner, analyzer, analyzer, etc.) encountered section, and then starts the segment name and then starts the analysis of the grammar rules. For grammar rules, in the issue of using the carriage return or the semicolon as a logical row end, the same carriage return as VB is initially used as the end, but in the later use, some grammar rules are very long, so if used The carriage return as a logical row end value will make a row very long, and additional newlines will undoubtedly increase the difficulty, so in the subsequent version (1.1 ~ 1.5), the semicolon is used as the logical row end character, the same belt The problem is that space is no longer a valid element character, and it is a unclosed separator. The scanner uses a 3-type language, and the analyzer uses a type 2 language. Since the 2-type language can describe all languages that can be described in 3 languages (specifically see the textbooks of the compiler), it is better to describe the scanner directly, but only for use as some restrictions. This will greatly simplify the work of implementation (you can experience these restrictions when reading the code, so that the scanner's structure makes full use of the .NET class library makes the code greatly simplified) - therefore only need to implement a 2-type compiler .
Another design problem that cannot be ignored is how to use this compiler? Compile syntax into original code? .NET's assembly? Still compiling into an internal structure, then stored as a file by serialization? Objectively speaking, directly compiling the assembly of .NET is the best, convenient management, and the speed of use may be faster. But it is not this, but if you want to compile becoming an assembly, you need to be integrated with the scanner and analyzer that is driven. Think about it, we have n'thing to do now! I can't integrate anything, what should we do? Compiled as the source code. The last solution is to use the internal representation of the serialization. Reverse sequences are used as internal representations, and then construct a universal scanner and analyzer or add an adapter to operate the scanner and analyzer. Moreover, the grammar rules and analyzer / scanners can be debugged separately, which is also very suitable for projects that are now in research phases.
Below is the definition and self-description of C3, which is 1.4 after several modifications (the original version of the most used is referred to as version 1.0).