Regex grammar specification (translation)

xiaoxiao2021-03-06 59

This part contains the syntax of the regular expression of the Boost.Regex library. This is a programmer guide, and the actual syntax is determined by the options in the regular expression in the program. (Translation: The FLAG parameter of the regex class constructor.)

Literals

In addition to characters, any other characters represent its literal meaning.

".", "|", "?", "{", "}", "[", "]", "^", "$ " with"/"

To use the literal meaning of these characters, use "/" characters in front. A literal character matches its own, or matches the result of traits_type :: translate (), the traits_type is the characteristic template parameter of the Basic_RegeX class. The Template Parameter.

Wildcard

Point number "Matches any single character. When the match_not_dot_null option is used in the matching algorithm, the point number does not match the null character. When using the match_not_dot_newline option in the matching algorithm, the point does not match the Newline Character.

Repeats

A repetition is an expression (translation: regular expression) Repeat arbitrary number. An expression is then connected to a "*" indicates the number of any number of times (including 0). An expression will be connected to a " " indicates repetition (but at least once). If the expression uses regex_constants :: bk_plus_qm (the Flag parameter of the regex class constructor), " " is a normal character (the translation: ie " " means its literal meaning), "/ " is used to represent Repeat one or more times. An expression is taken "?" Indicates repeating 0 or 1 time. If the expression uses the regex_constants :: bk_plus_qm option, "?" Is a normal character, "/?" Is used to indicate 0 or 1 time. If you need an explicit specified duplicate maximum number of times, use the boundary operator "{}", then "a {2}" means that the letter "a" repeats 2 times, "A {2, 4}" means letters " A "Repeat 2 to 4 times," a {2,} "indicates that the letter" a "repeats at least 2 times (no upper limit). Note: There is no space between {}, and the size of the upper and lower boundaries is not upper limit. If the expression is compiled using the regex_constants :: bk_braces option, "{" and "}" are normal characters, "/ {" and "/}" to represent boundary operators. All repeating expressions are the shortest Possible Previous Sub-Expression: a single character, a character set, or a sub-expression that is like "()".

example:

"BA *" matches "B", "Ba", "BAAA", etc.

"BA " matches such a "ba", "baaaa", without matching "b".

"BA?" Matches "B" or "ba".

"BA {2, 4}" matches "BAA", "BAAA" and "Baaaa".

Non-Greedy Repeats

Regardless of whether "Extended" regular expression syntax (default) is enabled, it is always allowed to use non-greed repeated, as long as it adds "?"? " Non-greedy repetition is a repetition of the Shortest Possible String. For example, you have to match a pair of tags of HTML, you can use:

"] *> (. *?) "

Here $ 1 will contain text between the labels, which is the shortest matching string.

Parentheses (Parenthesis)

There are two roles of parentheses: composition sub-expression and tag match (to group items to max, and to mark what generated the match.). For example, expression "(ab) *" matches all "ababaB" strings. Match algorithm regex_match and regex_search each requires a Match_Results object to report how to match, and the function returns Match_Results will contain the entire expression and each sub-expression. For example, in the above example, Match_Results [1] will include an iterator pair representing the last "ab". The sub-expression also allows a matching flight. If the sub-expression match is empty-, for example, sub-expression is part of the mismatched in the selection - then a pair of iterators point to the end of the input string, and the matched property of this sub-expression is false. The sub-expression is from left to the right, from 1 to the index, the child expression 0 is the entire expression. (Translation: The above expression or sub-expression refers to a regular expression.)

Non-Marking Parenthesis

Sometimes you need to use parentheses to form a child expression, but unlike a marked sub-expression (translation note: the expression in match_Results is a marked sub-expression). In this case, non-marking parentheses (?: Expness) can be used. For example, the following expressions do not generate sub-expression:

"(?: ABC) *"

Before seeing the assertion (Forward Lookahead Asserts)

There are two forms: one is a positive look at the assertion; one is a negative look:

"(? = ABC)" matches 0 characters unless the expression starts with "ABC".

"(?! ABC)" matches 0 characters unless the expression does not start with "ABC".

(Translation: Asaices,, for example: "(? = ABC) Abcdef" Match "Abcdef", "" = ABC) "does not match" ABC ", but to see if ABC is starting, if you need to match "ABC" still needs to be written later.)

Independent Sub-Expressions

"(?>" "Match" Expression "as a separate atomic action (unless an error is generated, the algorithm will not return to return.).

Alternatives

Selecting the appearance where you need to match a child expression or another child expression. Each selected item uses "|" segmentation, or when setting the regex_constants :: bk_vbar option, use the "/ |" segmentation, or when setting the regex_constants :: newline_alt option, use a wrap-off segmentation. Each selected item is always the longest possible sub-expression, which is opposite to the rest of the operator.

example:

"A (b | c)" matches "AB" or "AC".

"ABC | DEF" matches "ABC" or "DEF". Sets

A collection is a collection of characters that matches the character of any member. Collection Use "[" and "]" to include text, character range, character class, control element, and equivalent class. The collection of "^" is expressed.

example:

Character text:

"[ABC]" matches "a", "b", or "c".

"[^ ABC] matches any character other than" A "," B ", and" C ".

Character range:

"[A-Z]" matches the character between "A" to "Z".

"[^ A-z]" matches characters outside "A" to "Z".

Note that if the regex_constants :: collate option is set, the character range is dependent on the region (Locale Dependent): They match the characters between the ranges, when using the default "C" Locale, the scope follows ASCII rule. For example, if the library is compiled using the Win32 geographical model, [A-Z] matches the A-Z ASCII characters and 'A', 'B', etc., but does not match 'Z', it is exactly behind 'Z'. By default, the prohibition of the behavior of geographical specialization follows the code of the ASCII characters.

The character class is a collection of "[: ClassName:]" grammar declaration. For example, "[[: Space:]] is a collection of all blank characters. The character clay is only valid only if the regex_constants :: char_classes option is set. Available character classes are:

Alnum Any character number ALPHA A-Z and A-Z letters. If the geographical is set, you may contain other characters. Blank any blank character, space or tab character. CNTRL Any control character Digit Any 0-9 Digital Graph Any graphic character Lower A-Z lowercase characters. If the geographical is set, you may contain other characters. Print Any Printable Character PUNCT Any Patch Symbol Space All Space Character Upper A-Z. If the geographical is set, you may contain other characters. XDigit Any 16-Binary Digital Word between 0-9, A-F and A-F, any word character - alphanumeric plus Underline Unicode Any code greater than 255, can only be used in wide characters

When the regex_constants :: escape_in_lists option is set, you can use some character classes:

/ w instead [: word:]

/ s instead [: Space:]

/ D instead [: DIGIT:]

/ L instead [: Lower:]

/ u replace [: Upper:]

Control elements (Collating Elements) are expressed in the collection statement by [.tagname.], Where tagname is a single character or the name of the control element. For example, [[.a.]] Is equivalent to [A], [[[. COMMA.]] Is equivalent to [,]. The library supports all standard POSIX control element names and the following additional names: "AE", "CH", "LL", "SS", "NJ", "DZ", "LJ", each can lower, uppercase Or start capitalize. The multi-character control element makes a set match more than one character, such as [[.ae.]] Matches two characters, and [^ [. Ae.]] Only matches a character.

Equivalence classes take the generalform [= tagname =] inside a set declaration, where tagname is either a single character, or a name of a collating element, and matches any character that is a member of the same primary equivalence class as the collating element [ .tagname.]. An equivalence class is a set of characters that collate the same, a primary equivalence class is a set of characters whose primary sort key are all the same (for example strings are typically collated by character, then by accent, and then by case; the primary sort key then relates to the character, the secondary to the accentation, and the tertiary to the case) If there is no equivalence class corresponding to tagname, then [= tagname =] is exactly the same as [. .tagname.]. Unfortunately there is no locale independent method of obtaining the primary sort key for a character, except under Win32. for other operating systems the library will "guess" the primary sort key from the full sort key (obtained from strxfrm) SO equivalence classes are probably best considered broken under any operating system other than Win32.To include a literal "-" in a set declaration then: make it the first character after the opening "[" or "[^", the endpoint of a range , a collating element, or if the flag regex_constants :: escape_in_lists is set the precede with an escape character as in "[/ -]" t l ^ "in a set thrake .

Line anchors

Anchor (anchor) is used to match an empty string at a row or end-oriented: "^" on a row, "$" matches the empty string of the row.

Back References Retreat is a reference to a sub-expression that has been matched, this reference is a string that matches the child expression, not the child expression itself. The retreat reference is composed of a "1" to "9", "/ 1" references the first sub-expression, "/ 2" references the second and the like. For example, expression "(. *) / 1" matches any repeating 2 string, such as "abcabc" or "xyzxyz". The rollback reference of the child expression does not participate in any match, matching an empty string: Nb This is different from other general regular matching. Only use the regex_constants: bk_refs option to use the rollback reference.

Code character (Characters By Code)

This is an extension of the algorithm, not in other libraries. It consists of a modified character encoding of 10-encered character encoding. For example, "/ 023" means that the 10-encycloped code is a character of 23. When the expression is split using parentheses, it may cause fuzzy: "/ 0103" means 103 encoded characters, "(/ 010)

3 "

Indicates that character 10 is then "3". To use a 16-way encoding, use / x to add a 16-en-weight number, you can use {}, such as / XF0 or / x {AFF}, note that the following example is a Unicode character.

Word Operators

The following operators provide compatibility with the GNU regular Curvet.

"/ W" matches any character belonging to the "Word" class, equivalent to "[: word:]]".

"/ W" matches any character that is not "Word" class, equivalent to "[^ [: Word:]]".

"/ <" Matches the empty string of the beginning of a word.

"/>" Matches the empty string of the end of the word.

"/ B" matches the word opening or ending.

"/ B" matches the empty string in the word.

The start of the sequence passed to the matching algorithms is considered to be a potential start of a word unless the flag match_not_bow is set. The end of the sequence passed to the matching algorithms is considered to be a potential end of a word unless the flag Match_not_eow is set.

Buffer Operators

The following operators provide compatibility with GNU regular libraries and Perl regular libraries:

"/` "Matches a buffer opening (the start of a buffer).

"/ A" matching buffer opening (the start of the buffer).

"/ '" Match the end of the buffer.

"/ Z" matches the end of the buffer.

The "/ z" matching the end of the buffer may contain one or more newline characters.

A buffer is the entire sequence provided to the matching algorithm unless the match_not_bob or match_not_eob option is set.

Escape Operator

Call code characters "/" have a few meanings.

In the collection declaration, the change in the symbol is a normal character unless set the regex_constants :: escape_in_lists option, the characters after "/" indicate its literal meaning without considering its original meaning.

Coinwords can lead to other operations, such as rollback references, or word operators.

Calcodes can pick a normal character, such as "/ *" means a literal "*" rather than repeating operators.

Single character Escape Sequences)

The following is a single character's modulation string:

Call string character encoding meaning / a 0x07bell

Character. / f 0x

Form feed. / N 0x

Newline character. / R 0x0D Carriage return. / T 0x09 Tab character. / V 0x0B Vertical tab. / E 0x1B ASCII Escape character. / 0dd 0dd An octal character code, where dd is one or more octal digits. / XXX 0xXX A hexadecimal character code, where XX is one or more hexadecimal digits. / x {XX} 0xXX A hexadecimal character code, where XX is one or more hexadecimal digits, optionally a Unicode character. / cZ z- @ An ASCII escape sequence control-Z, WHERE Z is any ascii character greater tour or equal to the character code for '@'.

Various modified strings:

The following provides compatibility with Perl, but pay attention to / 1 / l / u, and / u.

/ w is equivalent to [[: Word:]].

/ W is equivalent to [^ [: word:]].

/ S is equivalent to [[: Space:]].

/ S is equivalent to [^ [: Space:]].

/ d is equivalent to [[: Digit:]].

/ D is equivalent to [^ [: Digit:]].

/ L is equivalent to [[: Lower:]].

/ L is equivalent to [^ [: Lower:]].

/ u is equivalent to [[: Upper:]].

/ U is equivalent to [^ [: Upper:]].

/ C any character, equivalent to '.'.

/ X Match any unicode group and string, such as "A / X 0301" (character with heavy tone).

/ Q Start the reference operator, and any follow characters are considered literal meaning unless the / e end reference operator appears.

/ E End the reference operator, termination / Q starting sequence.

What gets matched?

When the expression is compiled as a POSIX-compatible regex then the matching algorithms will match the first possible matching string, if more than one string starting at a given location can match then it matches the longest possible string, unless the flag match_any is set, in which case the first match encountered is returned Use of the match_any option can reduce the time taken to find the match -. but is only useful if the user is less concerned about what matched - for example it would not be suitable for search and replace operations. In cases where their are multiple possible matches all starting at the same location, and all of the same length, then the match chosen is the one with the longest first sub-expression, if that is the same for two or more matches, THE Second Sub-Expression Will Be Examined and So ON.THE FOLLOWING TABLE EXAMPLES ILLUSTRATE The Main Differences Between Perl and Posix Regular Expression Matching Rules:

Expression Text POSIX Match ECMAScript Depth Priority Search Matching A | AB Xaby "AB" A ". * ([: AlNum:]] ). *" ABC DEF XYZ "$ 0 =" Abc DEF XYZ "$ 1 = "ABC" $ 0 = "abc DEF XYZ" $ 1 = "z". * (a | xayy) zzxayyzz "zzxayy" "zzxa"

These differences between Perl matching rules, and POSIX matching rules, mean that these two regular expression syntaxes differ not only in the features offered, but also in the form that the state machine takes and / or the algorithms used to traverse the state machine.

转载请注明原文地址:https://www.9cbs.com/read-56969.html

9cbs

New Post(0)