Compiler (Interpreter) Writing Guide - Writing Compiler (Interpreter) Tools --lex

zhaozj2021-02-08  398

Author: Riceball (riceballl@hotmail.com)

Keywords: Compiler, Interpreter, Lex, Yacc, Compilation Principle, Regular Expression, Pascal

Preparatory knowledge: Compilation principle, regular expression, Pascal

This article does not want to deepen the thorough explanation of compilation principles, but explain how to use tools (generating compilers) to write compilers. If you don't know what to compile, then please understand the principle of compilation. In this article, this article is not prepared for beginners.

First, what is the compiler (interpreter) compiler is a program that translates a computer language into another computer language. The compiler writes the program written by Source Language as an equivalent program written by the input and translation. The source program is generally high-level language, such as PASCAL or DELPHI, and the target language is the target code of the assembly language or the target machine, and sometimes referred to as a machine code source → compilation The target → Target program interpreter is also the same as a compiler. It is different from the compiler that it performs the source program instead of generating the target code. In principle, any programming language can be interpreted or compiled.

(1) Scanner (Scanner) by the Scanner Read the source program (usually in the form of a character stream), lexical analysis, translating the source program into a word ID (TOKEN), put the word ID (token) table. During this process, the scanner will make a simple spell check.

Word ID (TOKEN): A token can have several types, typical: Keyword, such as IF and While; Identifier is a user-defined variable name, process name, etc., usually by letters And numbers consisting and starting with one letter; Special Symbol, such as arithmetic symbols and *, some multi-character symbols, such as> = and <>. TokenType is the enumeration data type. In fact, an integer value, each value represents a word ID type. TTokenType = (ttNone, ttStrVal, ttIntVal, ttFloatVal, // ttNAME Identifier is an identifier or keyword ttNAME, ttSWITCH, ttVAR, ttCONST, ttTYPE, ttRECORD, ttARRAY, ttDOT, ttDOTDOT, ttOF, ttTRY, ttEXCEPT, ttRAISE, ttFINALLY, ttON , ttREAD, ttWRITE, ttPROPERTY, ttPROCEDURE, ttFUNCTION, ttCONSTRUCTOR, ttDESTRUCTOR, ttCLASS, ttNIL, ttIS, ttAS, ttVIRTUAL, ttOVERRIDE, ttREINTRODUCE, ttINHERITED, ttABSTRACT, ttEXTERNAL, ttFORWARD, ttIN, ttBEGIN, ttEND, ttBREAK, ttCONTINUE, ttEXIT, ttIF , ttTHEN, ttELSE, ttWHILE, ttREPEAT, ttUNTIL, ttFOR, ttTO, ttDOWNTO, ttDO, ttCASE, ttTRUE, ttFALSE, ttAND, ttOR, ttXOR, ttDIV, ttMOD, ttNOT, ttPLUS, ttMINUS, ttTIMES, ttDIVIDE, ttEQ, ttNOTEQ, ttGTR , ttGTREQ, ttLESS, ttLESSEQ, ttSEMI, ttCOMMA, ttCOLON, ttASSIGN, ttBLEFT, ttBRIGHT, ttALEFT, ttARIGHT, ttCRIGHT, ttDEFAULT, // Tokens for compatibility to Delphi ttPRIVATE, ttPROTECTED, ttPUBLIC, ttPUBLISHED, ttREGISTER, ttPASCAL, ttCDECL, ttSTDCALL, TTFastCall);

TToKENTYPES = set of tttokeentype;

In order to represent the content of Token, we will define token: Ttoken = Record tokenty: TtokenType; tokenValue: Variant;

TP Lextp Lex is the generator of the Lexical Analysis Scanner source, which is used to create a Pascal (TurboPascal, Delohi) scanner subsystem. TP Lex Analysis LEX file (default extension is .l), generates the Lexical Analysis scanner process, outputs the PASCAL source file. If the LEX file discovers errors in the analysis process, the error message will be written to the corresponding list file (extension is .lst). The created Pascal source file program will contain the lexical analysis scanner process: Yylex. Function Yylex: Integer; You should call this process in your main program to analyze the lexical analysis. Each time, YyleX's return value is the current analysis token type value. When the file ends, Yylex's return value is 0. The code template for the Yylex process is in the Yylex.cod file. TP LEX needs this file to build a generated Pascal source file file. This file must be in the current directory or TP LEX. Also generated source program requires the lexlib.pas file to compile.

Usage: Lex [options] lex-file [.l] [output-file [.pas]]]

Options -------- V "Verbose:" In this parameter, LEX will generate a readable instruction file, extended `.lst 'while generating the lexical analyzer. -o "Optimize:" Lex will optimize the DFA table to produce a minimum DFA.

How to write the Lex file (.L) Lex file (.L) is divided into three parts, each part is separated from "%%": definitions %% rule section (rules) %% auxiliary process part Auxiliary Procedures)

The three parts can be empty and there is no relationship, with behavioral units as the separator of the statement.

Definitions definition section appears before the first double percent sign. Definitions can include the following elements: - Regular expression General definition format: Defined Expression Name (Name) Replacement Results (Substitution)

The name of the regular expression is also defined in this part. The definition of this name is written in the first column of another line, and thereafter (there is one or more spaces later) is the regular expression it represents. The defined name (name) must be a legitimate identifier (first bit must be letters, the second bit can be a letter or number) The result is a LEX regular expression, you can also behave in regular expressions. The previously defined expression name is referenced, as long as the name is expanded with the curly bracket ("{}"). For example, a definition with symbol numbers: Number [0-9] SignedNumber (" " | "-")? {Number}

- Start status Writes:% Start Name ... This is used to specify the start condition of the rule (see the rule section for details). The% Start keyword can be disconnected as% s or% s.

- "% {" and "%}" are inserted in "% {" and "%}" to any Pascal source code outside the function (please note the order of these characters).

Rule Parts They consist of a regular expression with PASCAL code; when the corresponding regular expression is matched, the later PASCAL code (action) is executed. The format of the rule is as follows: Regular expression (statement); Note: Statement must be a separate PASCAL statement, finally at the end of the semicolon (if there are multiple statements using begin ... End). The statement can be divided into multi-line writing, but the subsequent line must first leave at least one space or Tab, which is used to indicate that the line belongs to the previous line. Use "|" to indicate the action of the expression and the action executed by the next expression (statement). For example, PASCAL Note: "(*" | "{" begin repeat c: = get_char; case c of '}':; '*': begin c: = get_char; if c = ')' TEN EXIT ELSE UNGET_CHAR c) End; # 0: begin CommenteOf; exit; end; end; Until false end; TP LEX library unit provides a range of useful variables and procedures, you can use in your written action (statement). Such as: YYTEXT variable returns a matching string. YYLENG variable returns a matching string length.

In "% {" and "%}" pairs in the rule section, the middle-inserted PASCAL source code is treated as a local variable (process) appearing. Auxiliary Procedures Auxiliary Process Section can include a PASCAL source program, such as auxiliary process or main program, which is simply placed at the end of the file.

转载请注明原文地址:https://www.9cbs.com/read-788.html

New Post(0)