C ++ word method analysis

xiaoxiao2021-03-06  99

C lexic analysis lexical analysis is the process of referring to the original document being decomposed into a lexical symbol, which is usually the first step in the entire compilation process. Although it is generally considered to understand the programmers of C not to write a C compiler for most people, this article is not just to meet some people's curiosity, because I I feel that any rigorous C programmers should have aware of C 's lexical analysis to avoid errors that will happen in accidents (we spent several years to familiarize with C grammar, why not spend a day time casually familiarity C lexical?;)). A C lexical analyzer written two years ago and a C source file shader completed on this basis and the source code of a C source file shader. I. The lexical analysis process is divided into six steps in accordance with the predefined order of the C language. Source mapping, escarism, pre-processing, target character mapping, pre-processing, target character mapping, pre-processing Adjacent string connection. The order of determining is very important, because the same source file, different order, may be different. Some simple examples: If there is such a line of code: "Abcxa" "BCD", if we first connect adjacent strings to make a target character mapping, then the last result is "abcxabcd", where XAB is interpreted as a character; If we first make a target character mapping, XA will be interpreted as a character, and B is a subsequent character. Or, if there is such a code #define pora "polyrandom" this time, if we first escape the change, the code will become #define PORA "Polyrandom", it is a legal preprocessing statement; but if we do first After preprocessing, do the transfers, then this code will be wrong. From the above two examples, it should be able to see the importance of strict order sequence. Below we will explain the specific work of each step: 1. Source character mapping 1.1 Basic Source Character Set Basic Source Character Set Contains All Size English, All Digits, Blank, Vertical and Horizontal Tab, Wrap, Form feed and _ {} [] # () <>%:;.? * - / ^ & | ~! =, "'A total of 96 characters. 1.2 Trigraph (Triple Group) Trigraph is some • The continuous three-character combination of the beginning, it includes ?? =, ?? /, ?? ', ?? (, ??), ??!, ?? <, ??> and ??, these characters will Directly replaced with the corresponding character, {#, ^, [,], |, {,}, and ~. Introducing Trigraph is for convenient input character, early Some keyboards do not support them. 1.3 Universal character name (Universal) -Character-name) The character sequence of UXXXX and UXXXXXXXX is actually a mapping for the ISO / IEC 10646 character set, used to indicate that the basic source character set is not in the character.

1.4 Mapping Process In this step, the characters in the physical file will be mapped to the basic characters of the C language. For example, if you use a special wrap logo in the physical file, then this step will be Map the newline character in the C basic source character set. This step will also replace Trigraph, if a character is not in the basic character set, it will be mapped to a universal character name. 1.5 Some Precautions First, because TrigraPh occurs in the first step, and is a global replacement (that is, even the TrigraPh in the string will be replaced), the trigraph in the following sentence will be replaced: Char s [] = "?? -"; CHAR S [] = "?? -", if you want to express it ?? - Such a string, you can use "?? -" or "?" "? - "The form is expressed. Second, similar to ?* / ** / such a character sequence will be replaced with ** / instead of the latter as an annotation. Again, similar to ?? = define ABC, will be a legal pretreatment statement in addition, you can use a general character name in your own program, but for some basic source character, you can set the characters, and Those characters less than 0x20 or between 0x7f to 0x9f, you can't use a generic character name, otherwise it will be considered an error. 2. Side of the community is very simple, that is, check all the wraps, if it is followed by it, then this wrap and this character will be removed. This step needs to be noted: First, if the two lines are reconnected, a universal character name appears, then the result will be undefined; second, if a non-empty source is not ending, or With the end of the transfer, then the result of this program is undefined (most compilers give a warning. I saw this question on a lot of BBS.) Third, if a single line comment // is At the end of the transfer, the next line will be considered continuation of the annotation. 3. Pretreatment symbols and blank divisions, the comments in the program are replaced with a blank character, and multiple consecutive blank characters may be replaced with a single blank character. C standards specify that any file cannot end in the middle of the pretreatment instruction or the comment. So, if you have two source files: polyrandom.h / * polyrandom.cpp

#include "polyrandom.h" * / This code is illegal. In fact, this is usually also laid on the willingness of the programmer, as the annotation process is performed before the pre-processes are executed. In addition, the comment is replaced with a blank, that is, unsigned int / ** / i; it is defined a unsigned integer variable, name I, rather than an unsigned integer variable named INTI. 4. Prepare symbols and blank division This step is the macro's expansion and file content. Each of the included files requires the first to fourth steps. Similarly, this step will result in a universal character name, then the result will be undefined. 5. Target character mapping 6. Adjacent string connection. All adjacent strings are connected together, including a wide string. The lexical elements in II. C C are included in the following statement: int Ratio = 0.5; // The Convert Ratio They are: keywords, blank, identifiers, operators, constants, separators And comments. Standard C keywords included: asm auto bool break case catch char class const const_cast continuedefault delete do double dynamic_cast else enum explicit export externfalse float for friend goto if inline int long mutable namespace newoperator private protected public register reinterpret_cast return shortsigned sizeof static static_cast struct Switch Template This Throw True Trytyped Unsigned Using Virtual Void Volatile Wchar_TWHILE We take the AND, AND_EQ, BITAND, BITOR, COMPL, NOT, NOT_EQ, OR, OR_EQ, XOR and XOR_EQ as a reserved word. Speaking of C operation symbols must mention the so-called "Alternative Token". Each optional symbol has the same meaning as a normal operator syntax, but they are not replaced with the operators they correspond to the previous steps. They are <% {%>} <: [:>]%: #%:%: ## and&& bitor | or || xor ^ compl ~ bitand & and_eq & = or_eq | = xor_eq ^ = not! Not_eq! = One thing to mention is that in C , you can also use a character constant in the form of 'c', or use 'ab', 'Abcd'. The integer of the corresponding length is represented. Iii. Cppdyer Source Code This source code comes with the lexical analyzer that I am mentioned above (at least I think so :)), of course, there is no implementation of pre-processing. It is relatively early, so it is useless to use a regular expression processing library, but written a very rough regular expression analyzer.

转载请注明原文地址:https://www.9cbs.com/read-96616.html

New Post(0)