Say the compiler from Lex & Yacc (1. Regular expression)

zhaozj2021-02-08  264

Say the compiler from Lex & Yacc (1. Regular expression)

Author: tangl_99

QQ: 8664220

MSN: TANGL_99@hotmail.com

Email: tangl_99@sohu.com

Friends who have learned compilation principles must have come into contact with Lex this small lexical scan tool. But few people really use Lex in their own procedures. When constructing a professional compiler, they often need to use Lex and Yacc. It is because of these two tools, making we prepare a compiler, an interpreter and other tools, it is very simple. However, the person who will use Lex and Yacc will not be simple. Lex and Yacc involve a series The theoretical knowledge of compilation principle is not simply reading the book. This article simply introduces LEX and YACC how to use. Related compilation, please see the undergraduate textbook.

Inside the domestic university textbook, there is very little introduction to Lex and YACC. Some of them have not, but there are many compilation principles in foreign countries. According to the classification of the discipline, the << Compilation Principles >> tutorials opened by domestic universities are just explaining compilation The principle does not explain practices. For practical aspects, it is another discipline << Compilation Technology >>. Books on compilation technology are fewer in China. Not long, I heard that Shanghai Jiaotong University is published. Teaching materials for compilation technology. Unfortunately, we can't see it. Fortunately, the Machinery Industry Press introduces classic books in the United States Kenneth C.LOUDEN << Compilation Principles and Practices >> Introducing LEX And YACC use.

Lex belongs to the tool inside the GNU. It is usually a GCC with a tool. If you use the Linux operating system, then affirmation the system itself has Lex and Yacc, but YACC's name becomes bison. If you use the Windows operating system Then you can get it in Cygwin or GNUPRO. There are also Windows version Lex and Yacc on the Internet. You can find it yourself.

There are two articles in this article, one is introducing lex, and the other is to introduce Yacc. Lex and Yacc matching, we construct your compiler or interpreter is like a child. So I called the name of this article.

In this paper, it is flex (Fase Lex) as an example, and the two explains how to construct the scanner.

Flex can pass through an input file and then generate the C source code for the scanner.

In fact, the scanner is not only used for compiler. For example, when writing the script engine, I see that many developers are the scanner written by themselves, and their algorithms are quite backward (completely without DFA conceptualization), even many scripts development The lexical scanner is not written, but looking for token during the run. In modern computer speed, it is indeed possible that the small scripting engine is scanned in the run, but as a qualified programmer, or A qualified computer undergraduate graduate, can use compilation principles and technical practice, should be a basic requirement.

If you want to say that the scanner source code of the lexical analysis is also very simple, it will be written in the C language. But Kenneth Louden spends more than 50 pages in << Compilation principle and technology>, because the reason is from the theoretical angle Introduce the standard, scalable, efficient lexic scanner. From the regular expression to the DFA (have a poor automatic), then to NFA (non-deterministic has a poor automatic), finally arrived at the code Written. The method of compiling the scanner with the automatic machine principle is basically the standard method of the current lexical scanner, which is the method used by Lex. In Lex, we don't even need to construct the words of the words yourself, we only need to put the corresponding regular Expression input, then Lex can generate DFAs yourself, then generate source code, it can be described as convenient.

This article does not talk about DFA, Lex input is a regular expression, we can first look at the regular expression knowledge.

1. Regular expression (Regular Expression):

For friends who have learned compilation principles, this section can not look at it. However, some things still have to pay attention, because the use of regular expressions in Flex is not explained in our class.

Let's take a look at the example:

Example 1.1

Name tangl_99

This is the regular expression that defines the Name, it is equal to the string Tangl_99. So, if you have Tangl_99 in your source program, it is equal to a NAME regular expression.

Example 1.2

DIGIT 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9 This expression is to say, regular expression DIGIT is a letter in 0, 1, 2, ..., 9. So, no matter 0 , 2, or 9 ... all belong to the regular expression of DIGIT.

"|" Symbol indicates "or" meaning.

Then define the regular expression name tangl_99 | running, the same, if you have TANGL_99 or Running in your source program, then it is equal to a NAME regular expression.

Example 1.3

One 1 *

"*" Symbol indicates "zero to unlimited repetition"

Then the string of one represented by ONE can be an empty string (what characters are not), 1, 11, 111, 111111111111111, 11111111 ..., etc. In short, one is composed of 0 or N 1 (N can For any natural number).

There is a " " symbol as "*". Please see the example 1.4

Example 1.4

Realone 1

" " Symbol indicates "1 to unlimited repetition"

The only thing in Realone and One is that Realone does not contain an empty string, because " " indicates at least one repetition, then Realone has at least one 1. So the string expressed by Realone is 1, 11, 111, 1111, 11111 ..., etc. Wait.

Example 1.5

Digit [0-9]

Letter [A-ZA-Z]

The DIGIT here is equal to the DIGIT in Example 1.2, that is, A | B | C is equivalent to [A-C].

Similarly, letter is equivalent to a | b | c | d | e | f | ... | y | z | a | b | c | d ... | z, point not, you can't write Letter [AZ] And must be capitalized and lowercases should be written.

Example 1.6

NOTA [^ a]

"^" Said that it is not all characters other than this character.

So NOTA is represented by all characters other than A.

Let us take a look at some of the commonly used examples in some general advanced programs.

Digit [0-9]

Number {DIGIT}

Letter [A-ZA-Z_]

Digit has been mentioned many times, that is, the Arabic number of 0-9.Number is all numeric combinations, that is, integers.

The LETTER has also mentioned, the only difference is to have a downline. Because in general our C language, there is a loop to indicate the defined variable name, so I also put the underline as an English letter to handle it.

Here Number uses the DIGIT regular expression defined above. In Lex, {DIGIT} is to represent the regular expression DIGIT.

Newline [/ n]

Whitespace [/ t]

Newline is the meaning of this. We use / n this symbol, which is the same as the prop-language. The problem is that everyone may ask why [] symbols. Because in Lex, if you use [], Then it is definitely a single word symbol, not to be understood as "/" and "n" two characters.

Whitespace is the meaning of space symbols. There are two kinds in the general advanced programming, one is a simple space, there is a / T tester. Use the " " symbol, indicating the unlimited combination of these blank symbols .

转载请注明原文地址:https://www.9cbs.com/read-3202.html

New Post(0)