The lexical analysis is the process of decomposing the original file being decomposed into the lexical symbol, which is usually the first step in the entire compilation process. Although it is generally considered to understand the programmers of C not to write a C compiler for most people, this article is not just to meet some people's curiosity, because I I feel that any rigorous C programmers should have aware of C 's lexical analysis to avoid errors that will happen in accidents (we spent several years to familiarize with C grammar, why not spend a day time casually familiarity C lexical?;)). A C lexical analyzer written two years ago and a C source file shader completed on this basis and the source code of a C source file shader.
C word method analysis
The lexical analysis is the process of decomposing the original file being decomposed into the lexical symbol, which is usually the first step in the entire compilation process. Although it is generally considered to understand the programmers of C not to write a C compiler for most people, this article is not just to meet some people's curiosity, because I I feel that any rigorous C programmers should have aware of C 's lexical analysis to avoid errors that will happen in accidents (we spent several years to familiarize with C grammar, why not spend a day time casually familiarity C lexical?;)). A C lexical analyzer written two years ago and a C source file shader completed on this basis and the source code of a C source file shader.
I. Law analysis process
According to the standard definition, the C language lexical analysis is divided into six steps in accordance with the order of execution: source character mapping, escaping, pre-processing, pretreatment, target character mapping, adjacent string connection . The order of determining is very important, because the same source file, different order, may be different. Some simple examples: If there is such a line of code: "Abcxa" "BCD", if we first connect adjacent strings to make a target character mapping, then the last result is "abcxabcd", where XAB is interpreted as a character; If we first make a target character mapping, XA will be interpreted as a character, and B is a subsequent character. Or, if there is such a code #define pora "polyrandom" this time, if we first escape the change, the code will become #define PORA "Polyrandom", it is a legal preprocessing statement; but if we do first After preprocessing, do the transfers, then this code will be wrong. From the above two examples, it should be able to see the importance of strict order sequence. Below we will explain the specific work of each step:
Source character mapping
1.1 Basic Source Character Set)
Basic Source Character Set contains all cases in English, all numbers, blank, vertical and horizontal Tab, wrap, form feed and _ {} [] # () <>%:;.? * - / ^&& ~! =, "'A total of 96 characters.
1.2 Trigraph (Triple Group)
Trigraph is some consecutive three-character combinations starting with the beginning of ?? =, ?? /, ?? ', ?? (, ??), ??!, ?? <, ??> and ?? - These characters will be replaced with the corresponding characters, respectively #, ^, [,], |, {,}, and ~. Introducing Trigraph is for convenient input of these characters, some keyboards do not support them. 1.3 Universal-Character-Name
These definitions are actually a mapping of the ISO / IEC 10646 character set, which is used to represent characters in the ISO / IEC 10646 character set.
1.4 mapping process
In this step, the characters in the physical file will be mapped to the basic characters of the C language. For example, if you use a special wrap identifier in the physical file, then this step will be mapped to C A newline character in the basic source character set. This step will also replace Trigraph, if a character is not in the basic character set, it will be mapped to a universal character name.
1.5 some precautions
First, since Trigraph occurs in the first step and is a global replacement (that is, even the TrigraPh in the string will be replaced), the trigraph in the following two sentences will be replaced:
CHAR S [] = "?? -"; char s [] = "??
So if you want to express it ?? - You can use "?? -" or "?" "?".
Second, similar to ?* / ** / such a character sequence will be replaced with ** / instead of the latter as an annotation.
Again, similar to ?? = define ABC, will be a legal pretreatment statement
In addition, you can use the general-purpose character name in your own program, but you can't use a universal character name for characters in some basic source characters, and those that are less than 0x20 or between 0x7f to 0x9f. Otherwise it will be considered an error.
2. Side of the community
This step is very simple, it is to check all the wraps, if you follow one, then this wrap and this character will be removed. This step needs to be noted: With the end of the transfer, then the result of this program is undefined (most compilers give a warning. I saw this question on a lot of BBS.) Third, if a single line comment // is At the end of the transfer, the next line will be considered continuation of the annotation.
3. Pretreatment of symbols and blank
This part, comments in the program are replaced with a blank character, and multiple consecutive blank characters may be replaced with a single blank character. C standards specify that any file cannot end in the middle of the pretreatment instruction or the comment. So, if you have two source files:
Polyrandom.h
/ *
/ *
Polyrandom.cpp # incrude "polyrandom.h"
#include "polyrandom.h"
* /
Such code is illegal. In fact, this is usually also laid on the willingness of the programmer, as the annotation process is performed before the pre-processes are executed.
In addition, the comment is replaced with a blank, that is,
Unsigned int / ** / i;
It is defined a unsigned integer variable, name I, instead of an unsigned integer variable named INTI.
4. Prepare symbols and blank division
This step is the macro's expansion and file of the file. Each of the included files requires the first to fourth steps. Similarly, this step will result in a universal character name, then the result will be undefined.
5. Target character mapping
6. Adjacent string connection.
All adjacent strings are connected together, including a wide string.
Lendectal elements in II. C
The lexical elements in C are included in the following statement:
Int ratio = 0.5; // The Convert Ratio
They are: keywords, blank, identifiers, operators, constants, separators, and comments.
The keywords in standard C include:
asm auto bool break case catch char class const const_cast continuedefault delete do double dynamic_cast else enum explicit export externfalse float for friend goto if inline int long mutable namespace newoperator private protected public register reinterpret_cast return shortsigned sizeof static static_cast struct switch template this throw true trytypedef typeid typename Union Unsigned Using Virtual Void Volatile Wchar_TWhile
We look at And_eq, Bitand, Bitor, Compl, Not, Not_eq, OR, OR_EQ, XOR, and XOR_EQ as a reserved word.
Speaking of C operation symbols must mention the so-called "Alternative Token". Each optional symbol has the same meaning as a normal operator syntax, but they are not replaced with the operators they correspond to the previous steps. They are
<% {%>} <: [:>]%: #%:%: ## and&& bitor | or || xor ^ compl ~ bitand & and_eq & = ory = = xor_eq ^ = not! Not_eq! =
It is also necessary to mention that in C , you can also use a character constant in the form of 'c', but also use 'ab', 'abcd', which represents an integer of a corresponding length.
Iii. Cppdyer source code