Unveiling the mystery of regular expressions

xiaoxiao2021-04-03  207

Unveiling the mystery of regular expressions

[Original article, reproduced please keep or indicate: http://www.regexlab.com/en/regref.htm]

Introduction Regular Expression describes a string matching mode, which can be used: (1) Check if a string in line with a certain rule is included, and can be obtained; (2) Matching rules Flexible alternatives for strings. Regular expression learning is actually very simple, there are few more than a few abstract concepts. The reason why many people feel that the regular expression is more complicated. On the one hand, because most of the documents do not do it by shallow to deeply explain, there is no attention to the order in order, and it is difficult to understand the readers; on the other hand, various engines The self-contained documents generally introduce its unique features, but this part of the unique features are not what we first have to understand. Every example in the article can be tapped into the test page for testing. Gossip less, start.

1. Regular expression rules

1.1 ordinary characters

Letters, numbers, Chinese characters, underscores, and latter symbols in the back section, are "normal characters". The normal characters in the expression match the same character as matching a string. Example 1: Expression "C", when matching the string "abcde", the matching result is: success; the matched content is: "c"; the matching position is: start at 2, ending 3. (Note: The subscript starts from 0 or from 1, due to the difference in the current programming language) Example 2: Expression "BCD", when matching strings "abcde", match results are: success; match The content is: "BCD"; the matching position is: start at 1, ending at 4.

1.2 Simple escape character

Some inconvenient writing characters, using a "/" method in front. These characters are actually known.

Expression can match / r, / n represents Enter and disabled / T tuper // represents "/" itself

There are other punctuation symbols in the special place in the rear section, and after adding "/" in front, it represents the symbol itself. For example, ^, there is a special meaning. If you want to match the "^" and "$" characters in the string, the expression needs to be written as "/ ^" and "/ $".

Expression Match / ^ Match ^ Symbol itself / $ Match $ Symbol itself /. Match Dixes (.) Itself

The matching method of these escape characters is similar to "normal characters". It is also a character that matches the same. Example 1: Expression "/ $ d", when matching strings "ABC $ DE", the match results are: success; matching content is: "$ D"; matching position is: Start at 3, end The 5.

1.3 Expression that matches the 'multiple characters'

Some representations in regular expressions can match any of the 'multiple characters'. For example, expression "/ d" can match any number. Although it can match any of the characters, it can only be one, not multiple. This is like playing a playing card, and the king can replace any card, but can only replace a card. Expression can match / d any one of the numbers, any one / W in 0 ~ 9, or a number or underscore, that is, A ~ Z, A ~ Z, 0 ~ 9, any one of: including a space, Any one of the blank characters such as tabs, change page characters. The decimal point can match any of the characters other than the wrap (/ N).

Example 1: Expression "/ D / D", when matching "ABC123", the result of the match is: success; the matched content is: "12"; the matching position is: start at 3, ending at 5. Example 2: Expression "A./d", when matching "AAA100", the result of the match is: success; the matched content is: "aa1"; the matching position is: Start at 1, ending 4.

1.4 Customization can match the expression of 'multiple characters'

Using square brackets [] contains a series of characters to match any of the characters. With [^] contain a series of characters, you can match any of the characters outside of the characters. The same reason, although it can match any one, but can only be one, not multiple.

Expression can match [AB5 @] Match "A" or "B" or "5" or "@" [^ ABC] match "A", "B", "c" matching other character [FK] match Any one between "f" ~ "k" matches "A" ~ "f", "0" ~ "3" other characters

Example 1: When "[BCD] [BCD]" matches "ABC123", the result of the match is: success; the matching content is: "BC"; the matching position is: start at 1, ending 3. Example 2: When "[^ ABC]" matches "ABC123", the result of the match is: success; the matched content is: "1"; the matching position is: start at 3, ending 4.

1.5 Special symbols for modifying matching

The expression mentioned in the previous section, whether only the expression of a character can be matched, or can match any of the expressions in any of the various characters, can only match once. If you use an expression to add a special symbol of the modified matching, you can repeat the match without repeated writing expressions. The method of use is: "The number of modifications" is placed behind "modified expression". For example, "[BCD] [BCD]" can be written "[BCD] {2}".

Expression effect {n} expression repeats N times, such as "/ w {2}" equivalent to "/ w / w"; "a {5}" is equivalent to "AAAAA" {m, n} expression at least repeat M times, up to N times, such as "BA {1, 3}" can match "ba" or "baa" or "baaa" {m,} expression at least M times, such as "/ w / d { 2,} "You can match" A12 "," _ 456 "," M12344 "...? Match 0 times or 1 time, equivalent to {0, 1}, such as:" A [CD]? "Can match" A "," ac "," AD " expression at least once, equivalent to {1,}, such as" A B "can match" AB "," AAB "," AAAB "... * Expression The formula does not appear or arbitrarily, equivalent to {0,}, such as: "/ ^ * b" can match "B", "^^^ B" ... Example 1: Expression "/D /.?/ D * "When matching" IT COSTS $ 12.5 ", the result of the match is: success; the matched content is:" 12.5 "; the matching position is: start at 10, ending at 14. Example 2: Expression "Go {2, 8} GLE" When matching "ADS BY GOOOOOOOGLE", the result of the match is: success; the matching content is: "gooooogle"; the matching position is: start at 7, The end is over 17.

1.6 Some of the other representatives of the special symbol

Some symbols represent the specific significance of the abstraction:

Expression effect ^ Matching the start of the string, does not match any character $ with the end of the string, does not match any character / b matching a word boundary, that is, the position between words and spaces, does not match any characters

Further text description is still abstract, therefore, for example, help you understand. For example 1: The expression "^ AAA" matches "XXX AAA XXX", the matching result is: failed. Because "^" is required to match the string, "^ aaa" can match only when "AAA" is at the beginning of the string, such as "AAA XXX XXX". Example 2: When the expression "AAA $" matches "XXX AAA XXX", the matching result is: failed. Because "$" matches the end of the string, "AAA $" can only match when "AAA" is at the end of the string, such as "xxx xxx aaa". Example 3: Expression "./b." When matching "@@ ABC", the match results are: success; the matched content is: "@ a"; the matching position is: Start at 2, end 4. Further explanation: "/ b" is similar to "^" and "$", itself does not match any character, but it requires it to be on the left and right sides of the position where the match results are located, one side is "/ w" range, the other is right The range of "/ w". Example 4: Expression "/ BEND / B" When matching "weekend, endfor, end", the match results are: success; the matched content is: "end"; the matching position is: start at 15, end 18. Some symbols can affect the relationship between sub-expressions within the expression:

Expression effect | The left and right bid expressions "or" relationship, match the left or right () (1). When the number of matches is modified, the expression in parentheses can be modified as a whole as a whole (2). Take match results At the time, the contents of the expression in parentheses can be obtained separately

Example 5: Expression "Tom | Jack" When matching string "I'm Tom, He is Jack", the match result is: success; the matching content is: "Tom"; the matching position is: start 4, ending at 7. When matching the next, the matching result is: success; the matching content is: "jack"; matching position: Start at 15, end of 19. Example 6: Expression "(GO / S *) " When matching "Let's Go Go Go!", The matching result is: success; match to content is: "Go Go Go"; the matching position is: Start 6, ending at 14. Example 7: Expression "¥ (/ d /.? / D *)" When matching "$ 10.9, ¥ 20.5", the result of the match is: success; the matched content is: "¥ 20.5"; match The position is: start at 6, ending at 10. The content that separates the parentheses range is: "20.5". 2. Some advanced rules in regular expressions

2.1 greed and non-greed in the number of matches

When using a special symbol of the number of modified matches, there are several representations that allow the same expression to match different times, such as "{m, n}", "{m,}", "?", "* "," ", The number of specific matches is determined with the matched string. This expression of repeated matches is always as many as possible during the matching process. For example, for text "DXXXDXXXD", for example, as follows:

Expression matching result (d) (/ w ) "/ w " will match all characters "xxxdxxxx" (D) (/ W ) "/ w " will match the first "D" " D "and the last" D "all character" xxxdxxx ". Although "/ w " can match the last "D", but in order to make the entire expression match success, "/ w " can "let" it "it can match the last" D "

It can be seen that "/ w " always matches the characters that match the rules as much as possible. Although in the second example, it does not match the last "D", but it is also to make the entire expression to match success. Similarly, the expression with "*" and "{m, n}" is as much as possible, and the expression "?" Is as much as possible when the expression can match, and "to match" as much as possible. This matching principle is called "greed" mode.

Non-greedy mode: Add a "?" Number after the special symbol of the modified matching time, you can match the number of expressions as possible, so that you can match the expression that can be matched, as much as possible match". This matching principle is called "non-greed" mode, and it is also called "bare" model. If less match can cause the entire expression to match, similar to greed mode, non-greed mode minimally matches, so that the entire expression is successful. For example, for text "DXXXDXXXXD" example: Expression matching result (d) (/ w) (/ w) "/ w ?" Will match the first "D" as little as possible, the result is: "/ w ? "Only match a" X "(D) (/ W ?) (D) to make the entire expression match success," / W ? "Has to match" XXX "to make the" D "behind it, thus The entire expression is successful. Therefore, the result is: "/ w ?" Matches "xxx"

More, for example, Example 1: Expression " (. *) " and String "

AA

BB "When matching, the result of the match is: success; matching content is"

aa

BB < / p> "The entire string," "in the expression will match the last" "in the string. Example 2: In contrast, when " (. *?) " matches the same string in Example 1, only "

AA < / TD> ", when the next one is matched, the second"

BB "can be obtained.

2.2 Reverse references / 1, / 2 ...

When the expression is matched, the expression engine is recorded in the character string that is parentheses "()". When obtaining the matching result, the strings that the brackets that are included in the brackets can be acquired separately. This, in the foregoing example, has been shown many times. In practical applications, when searching in a certain boundary, while the content to be acquired does not include a boundary, small brackets must be used to specify the range. For example, " (. *?) " in front. In fact, "strings that parentheses are included in the expression" "" "Not only can be used after the match is completed, but also can be used during the matching process. The part behind the expression, can reference the string that has been matched in front "brackets". The reference method is "/" plus a number. "/ 1" references the string that matches the parentheses, "/ 2" references the string that matches the parentheses ... With this class, if a pair of parentheses contain another parentheses, The layers of parentheses first sequence number. In other words, which pair is "(" before, then this pair first sequence number. For example: Example 1: Expression "('|") (. *?) (/ 1) " When matching "'Hello'," World ", the matching result is: success; the matching content is:" 'Hello' ". When you match the next one, you can match" World "." Example 2: Expression "(/ w) / 1 {4,}" When matching "AA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB ABCDEFG CCCC 111121111 99999999", the matching result is: success; the matching content is "CCCCC". When matching the next one, you will get 999999999. This Expression requirements "/ w" range of characters repeated 5 times, pay attention to the difference between "/ w {5,}". Example 3: Expression "<(/ w ) / s * (/ w (= (= '| "). *? / 4)? / S *) *>. *? " Match " " The match results are successful. If "" is not paired with "", it will match the failed; if it is changed to other pairs, it can also match success.

2.3 Prepregnaining, do not match; reverse pre-search, do not match

In the chapter in front, I told several special symbols representing the abstract meaning: "^", "$", "/ b". They all have a common point, that is: they do not match any characters, just a condition for "the gap between the two headers" or "characters". After understanding this concept, this section will continue to introduce another, more flexible representation of "two" or "gap" additional conditions.

Positive Search: "(? = Xxxxx)", "(?! Xxxx)", "(? = Xxxxx)", in the matched string, it is "gap" or "two head" The additional conditions are: the right side of the gap, must be able to match the expression of this part of XXXXX. Because it is only the condition that is attached to this gap, it does not affect the characters after the end of the gap is truly matched. This is similar to "/ b", which does not match any character. "/ b" just judges the characters before, after the gap, and does not affect the expression of the rear side to truly match. Example 1: Expression "Windows (? = NT | XP)" When matching "Windows 98, Windows NT, Windows 2000", will only match "Windows" in "Windows NT", other "Windows" words are not Match. Example 2: Expression "(/ W) ((? = / 1/1/1) (/ 1)) " When matching strings "AAA FFFFFF 999999999", you will match 6 "F" before 4 The first 7 "9" can be matched. This expression can be read to: Repeating more than 4 alphanumeric numbers, match the portion left before the last 2 bits remain. Of course, this expression can be written, the purpose here is used as a presentation. Format: "(?! Xxxxx)", the right side of the gap must not match the XXXXX in part. Example 3: Expression "((?! / Bstop / b).) " When matching "fdjka ljfdl stop fjdsla fdj", it will match the beginning of "stop" from the head until "STOP" is not "STOP" in the string. The whole string is matched. Example 4: Expression "Do (?! / W)" Only "do" can only match when matching strings "DONE, DO, DOG". In this example, "do" (?! / W) "and use" / b "effect is the same.

Reverse Pred Search: "(? <= Xxxxx)", "(?

There are also some universal rules between each regular expression engine, not mentioned in the foregoing explanation.

3.1 In the expression, you can use "/ xxx" and "/ uxxxx" to represent a character ("x" means a hexadecimal number)

Formal character range / xxx numbers characters in the range of 0 ~ 255, such as: spaces can use "/ x20" to indicate / uxxx any character can use "/ u" to add 4-bit sixteen-entered numbers, such as : "/ U4E2D"

3.2 While expressing "/ s", "/ d", "/ w", "/ b" indicates the special meaning, the corresponding uppercase letter represents the opposite meaning

Expressions Match / s Match All Non-Free Characters ("/ S" Map Matches Each Blank Character) / D Matches All Non-Number Characters / W Match All Letters, Digital, Underline Matches Non-Word Boundary, That is, the word "/ w" range or the left and right sides is not a character gap when "/ w" range.

3.3 There is a special meaning in the expression, you need to add "/" to match the character summary of the character itself.

Character description ^ Match the start position of the input string. To match the "^" character itself, use "/ ^" $ Match the end position of the input string. To match the "$" character itself, use "/ $" () to mark the beginning and end position of a child expression. To match parentheses, use "/ (" and "/)" [] from definition to match the expression of 'multiple characters'. To match the parentheses, use the "/ [" and "/] {} Symbol of the number of modifications. To match braces, use "/ {" and "/}". Match any of the characters other than the wrap (/ N). To match the decimal point itself, use the "/."? The number of modifier matches is 0 or 1 time. To match "?" Character itself, use "/?" Modifier matching time at least once. To match the " " character itself, use "/ " * The number of modes matching is 0 or any time. To match the "*" character itself, use "/ *" | "or" relationship between the left and right bids. Match "|" itself, please use the sub-expression in "/|"3.4 Brand" () ", if you want the matching result, you can use the" (?: Xxxxx) format

Example 1: Expression "(?: (/ W) / 1) " When "A BBCCDD EFG" is matched, the result is "bbccdd". The matching result of the bracket "(? :)" is not recorded, so "(/ W)" is referenced using "/ 1".

3.5 Common Expression Properties Settings: Ignorecase, SingLine, Multiline, Global

Expression Properties Description IgnoreCase By default, the letters in the expression are case sensitive. Configuring IgnoreCase allows cases that do not appear cases. Some expression engines extends the "case" concept to the case of Unicode range. SINGLINE By default, the decimal point "." Matches characters other than the wrap (/ N). Configured to SINGLINE allows the decimal point to match all characters including the wrap. Multiline By default, expressions "^" and "$" matches the start 1 and end 4 positions. Such as: 1XXXXXXXXX2 / N3XXXXXXXXX4 is configured to Multiline to match the "^" matching 1, and after matching the charter, the next line begins the position, so that "$" matches 4, you can also match the change line, one line 2 s position. Global is primarily rooted when expressing expressions is used to replace, configured to replace all matches.

4. Comprehensive tips

4.1 If you want to ask the expression that the expression is the entire string, not to find a part from the string, you can use "^" and "$" in the expression of the expression, such as: "^ / d $" requires the whole Strings only have numbers.

4.2 If the content required to match is a complete word, not part of the word, then use "/ b" in the expression of the expression, such as: Use "/ b (if | while | Else | Void | Int ...) / b "to match the keywords in the program. 4.3 Expression Do not match the empty string. Otherwise, it will always be successful, and the results do not match. For example: is ready to write a form of expression, integer, decimal point, and decimal numbers, but do not write expressions: " /D*/.?/d* ", because if nothing, this expression can also match success. A better way is: "/ d /.? / D * | /. / D ".

4.4 Matching the child matching the empty string Do not loop unlimited. If each part in the sub-expression in parentheses can match 0 times, and this brackets can match unlimited times, then the situation may be more serious than the previous one, and may die during the matching process. Although some regular expression engines have passed the way to avoid this situation, such as the regular expression of .NET, but we should still try to avoid this. If we encounter a dead cycle while writing expressions, you can also start from this point, look for this reason.

4.5 Reasonably select greed mode and non-greedy mode, see topics discussions.

On the left and right sides of 4.6 or "|", it is best to match a character, so that it will not be different because of the expression of the expression because of the exchange position because of the expression of the "|".

5. More Regular Expressions Access the "Regular Expression Topics" and further discuss regular expressions.

转载请注明原文地址:https://www.9cbs.com/read-131649.html

New Post(0)