Regular Expressions

xiaoxiao2021-03-06  38

Preface Regular Expressions (hereinafter referred to as RE) has always been a gods of the younger brother. See some networks on the network, simply use RE to decompose some text issues, the younger brother has risen to learn RE The idea, but the younger brother is more lazy, always hope that there is a way to learn quickly, so the younger brother will ask Google Great God, with His God, the younger brother found Mr. Jim Hollenhorst in the network, after passing Reading, the younger brother feels good, so be careful to report, share with Move-To.net friends, I hope to bring a big point to the help of learning RE. The URL of Jim Hollenhorst is as follows, and there is a need to be directly connected. The 30 Minute Regex Tutorial by Jim Hollenhrthtp://www.codeproject.com/UserItems/Regextutorial.asp

What is Re? If you have to make a lot of time to find a file, you can use the multi-use character "*". For example, when you want to find all the Word files in the Windows directory, you may use "* .doc". The way is to find because "*" is any character. What RE is like this is similar to this, but its function is more powerful. When writing a program, it is often necessary to comply with a specific style. The RE is the most important functionality to describe this specific style, so it can treat RE as a specific style of a specific style, and an example, "/ w " The represented is the non-null string of any letters and numbers. It provides a very powerful category library in .NET Framework, you can easily use RE to use Re to make textual findings and replacement, and work on complex headers and verification text. The best way to learn RE is to do it by example person. Jim Hollenhors has also provided a tool Expresso (Cup Coffee Bar) to help us learn RE, downloaded the URL is http://www.codeproject.com/UserItems/Regextutorial/expressosetup2_1c.zip. Next, let us experience some examples.

Some simple examples assume that you want to find Alive text strings after elvis, using RE may pass through the following procedure, parentheses is the meaning of the underre: 1. Elvis (looking out ELVIS) The above representative The characters you want to find The order is ELVIS. On the .NET, you can set the case of the inert characters, so "ELVIS", "ELVIS" or "ELVIS" is all the RE below 1. However, because the order of this tube is ELVIS, Pelvis is also a RE under one. You can use 2 RE to improve. 2. / Belvis / B (viewing ELVIS as a whole word lookup, such as Elvis, Elvis is generous) "/ b" is particularly meaning in RE, in the above example, the word The boundary, so / Belvis / B use / b to define the front and rear boundaries of ELVIS, which is to elvis. Suppose you want to follow the ELVIS in the same line, you will find another two special characters "." And "*". "." The represented by any character of the wrap character, and "*" is repeated * before the project is repeated until the RE is found. So ". *" Is referred to in addition to any number of characters outside the wrap character. So finding the alive text string after Elvis in the same row, you can follow the RE of 3. 3. /BELVIS/B.*/balive/b (Looking for Elvis, there are alive words, such as elvis is alive) can form a powerful RE with simple special characters, but it also found that it is more and more When you have a special character, RE will be more and more difficult to understand. Take a look at another example

The constituent valid telephone number is to collect the 7-digit phone number of the customer format to XXX-XXXX from the web page, where x is a number, and the RE may write this. 4. / B / D / D / D / D / D / D / D (Find the seven digits of phone numbers, such as 123-1234) Each / d represents a number. "-" is a general linker symbol, in order to avoid too much repetition / d, RE can be rewritten into a 5 manner. 5. / b / d {3} - / d {4} (Find a good method of seven digit phone numbers, such as 123-1234) {3} after / D, the representative repeats the previous project three times, that is Equal to / d / d / d.

RE learning and test tool Expresso

Because RE is not easy to read and the user is easy to remove the characteristics of RE, Jim has developed a tool software Expresso to help users learn and test RE, except for the URL described above, or to UltraPico website (HTTP) : //www.ultrapico.com. After installing Expresso, in Expression Library, Jim is bigger to build an example of the article, you can test the article side test, you can also try the remedies under the example, immediately you can see the result, your brother feels very easy to use . You can try it greatly.

The basic concept of the foundation of the .NET has a special character, such as "/b" / "" / "," ",", ",",, etc. "/ S" representing any space, such as spaces, tabs, newlines et al .. "/ W" represents any alphanumeric or numeric character.

Look at some examples 6. / ba / w * / b (look up the word starting, such as able) This RE Description To find the beginning of the word (/ b), then it is the letter "a", plus any number of any number The alphanumeric (/ w *) is connected to the end boundary of this word (/ b). 7. / D (Find Digital Strings) " " and "*" are very similar, except at least the previous item is repeated. That is to say, there is at least one number. 8. / b / w {6} / b (Find words of six alphanumeric characters, such as AB

123C

The following table is a special character for RE.

In addition to the line of line characters / W Arbitrary alphanumeric characters / s arbitrary space characters / d Arbitrary digital characters / b deficiency boundary ^ The beginning of the word ^ Article, such as "^ the '" is used to indicate the string appeared at the beginning of the article For the end of the article, "End $" is used to indicate that the end of the article is "end" special characters "^" and "$" are used to find some words must be the beginning or end of the article. This is particularly useful when verifying whether the input is in line with a pattern, such as verifying the seven-digit phone number, may enter the RE.9 of the following 9. ^ / D {3} - / d {4} $ (Verification Seven numbers phone number) This is the same as the 5th RE, but there is no other characters before and after, that is, the whole string only the phone number of these seven numbers. If you set the multiline option in .NET, Then, "^" and "$" will compare each line, as long as the beginning of a row is in line with RE, not the entire article string is a comparison. Transaction character (escaped character) sometimes requires "^", "$" Literal Meaning instead of treating them as a special character, "/" character is used to remove a characterful character of special characters, so "/ ^","/."," / "The literal meaning represented by" ^ ",". "," / ". Repeating the aforementioned items I have seen" {3} "and" * "in front can be used to repeat the characters, and then we will see how Subexpressions is repeated with the same syntax. The following table is some ways to repeat the aforementioned items.

* Repeat Any number of times repeat at least once? Repeat zero or once {n} Repeat N times {n, M} Repeat at least N times, but not more than M times {n,} Repeat at least n times

Try some examples of examples 10. / b / w {5, 6} / b (look for five or six alphanumeric characters, such as AS25D, D58SDF, etc.) 11. / B / d {3} / s / d {3} - / d {4} (looking for ten numbers of phone numbers, such as 800 123-1234) 12. / D {3} - / d {2} - / d {4} (Find social insurance number, such as 123-45-6789) 13. ^ / W * (The first word of each row or whole article) Multiline and no multiline can be different in Espresso.

Matching a range of characters sometimes need to find some specific characters? At this time, the bracket "[]" will send the field. Therefore, [AEIOU] is "A", "E", "I", "O", "U" these metades, [.?!] What to find ".", "?", "! "These symbols, the special meaning of special characters in the middle brackets will be removed, which is interpreted into a simple literal meaning. You can also specify some range of characters, such as "[A-Z0-9]", which refers to any lowercase letters or any number.

Next, look at the RE example 14 of the initial complex lookup phone number. / (? / D {3} [(] / s? / D {3} [-] / d {4} (looking for ten digits) Number, such as (080) 333-1234) This RE can be found in a more format phone number, such as (080) 123-4567, 511 254 6654, etc. "/ (?" Represents one or zero left small Brand "(", and "[]" represents a right small bracket ")" or space character, "/ s?" Refers to one or zero spacer group. But such RE will be similar "800) 45

-3321

Such a call is found, that is, the parentheses does not have a symmetrical balance, and then learn the alternative to decompose such problems.

Does not include in a specific character group (Negation) sometimes needs to be found in a character in a particular character group, which explains how to do this. / W is not alphabetical number of characters / S is not a space for any character / d is not a digital character, any character / b is not in the word boundary location [^ x] No character [^ aeiou] is not A, E, I , Any character of o, u

15. / s (not including a string of spaces)

Alternatives sometimes need to find a few specific options. At this time, "|" This special character will be used in the field. For example, to find five numbers and nine numbers ("" - ") Postal code. 16. / b / d {5} - / d {4} / b | / b / d {5} / b (Find five numbers and nine numbers (with "-") Postal Codes) In using Alternatives It is necessary to pay attention to the order before and after, because RE will prioritize the leftmost item in Alternatives, 16, if you put the five numbers of items in front, then this RE will only find five numbers of postal codes. . Understand one, you can make better corrections.

17. (/ (/ D {3} /) | / d {3}) / s? / D {3} [-] / d {4} (10 digital phone numbers)

Grouping parentheses can be used to describe a secondary description, and can be repeated or his processing for the time description.

18. (/D }/d} .)} 2} (simple RE for network address) This RE means the first part (/ D {1, 3} /.) {3}, what is pointed to have a maximum of three digits, and there is a "." The symbol, the total number of this type, after which one to three digits, that is, as 192.72.28.1 Number. But there is a disadvantage because the network address number is up to 255, but the above RE is consistent with the number of one to three digits, so this requires that the number of the comparison is less than 256, but only RE alone It is not possible to do this comparison. In 19, the restrictions on the address are used in the required range, that is, 0 to 255.19. ((2 [0-4] / d | 25 [0-5] | [01]? / D /d? )/.)} }(2[0-4]/d|25[0-5]|[01]】 The alien says it? Just look for the network address, directly look at the REs.

Expresso Analyzer View

Expresso provides a feature that can turn the downward of the tree into a tree, a set of separate descriptions, provided a good defect environment. Other functions, such as partial compliance (Partial Match only finds the part of the anti-white RE) and except for the part of the conformity (Exclude Match only finds the part of the anti-white RE), leaving you to give a big trial.

When describing the group groups, the text described next can be used to process or the RE itself. Under the preset model, the consistent group is named by a number, starting from 1, from the order, the automatic group name, can be seen in the SKELETON VIEW in Expresso.

BACKREFERENCE is the same text used to find the same text that can be grasped in the group. for example"/

1"

The words referred to in line with the group 1.

20. / b (/ w ) / b / s * / 1 / b (looking for repeat words, repeating the same means the same word, the middle with blank sequel, such as DOG DOG) (/ W ) Grab at least one letter or number of words, and name it to group 1, then find any space for any space character, then connect, and group 1.

If you don't like the group automatically named, you can name it yourself. Take the above example, (/ w ) is rewritten as (? / w ), which is named Word, Backreference By rewriting becoming / k

21. / b (? / w ) / b / s * / k / b (using the burn group to grab the copy)

There are also many special syntax elements using parentheses. The relatively universal list is as follows: Captures (Exp) Compliance with EXP and captures its enterprise-naming group (? exp) in line with EXP and capture it Named Group Name (?: EXP) complies with EXP, do not grabbing it lookarounds (? = EXP) complies with the word of EXP (? <= EXP) in line with the prefix (?! EXP) is not The text of the EXP word (?

(? = exp) is a "Zero-Width Positive Lookahead Assertion". It refers to the text that meets the tail of EXP, but does not contain EXP itself. 22. / b / w (? = = Ing / b) (word of the word inog, such as Fill, which is in line with, is a "Zero-Width Positive Lookbehind Assertion". It refers to the text that meets the prefix is ​​EXP, but does not contain the EXP itself. 23. (? <= / BRE) / W / B (the word in which RE is RE, for example, the prepeated is PEATED 24. (? <= / D) / d {3} / b (in the tail Triple numbers, and previously connected one digit) 25. (? <= / S) / w (? = / S) (alphanumeric string separated by spaces)

NEGATIVE LOOKAROUND Before you mention, how to find a non-specific or non-specific group of characters. But if you just want to verify that a character does not exist but don't come in? For an example, suppose you want to find a word, there is Q but the next letter is not u, you can use the following RE .

26. / b / w * q [^ u] / w * / b (one word, q but the next letter is not u) This will have a problem, because [^ u] should correspond to one Character, so if Q is the last letter of the word, [^ U] will put the spaces according to the space, the result may match the two words, such as "Iraq Haha" text. You can solve such problems using Negative Lookaround.

27. / b / w * q (?! U) / w * / b (a word, there is Q but the next letter is not u) This is "Zero-Width Negative Lookahead Assertion".

28. / d {3} (?! / D) (three digits, thereafter do not pick a bit number)

Similarly, you can use (?

29. (? ). * (? = ) (text between HTML volumes) This uses Lookahead and Lookbehind assertion to remove the text between HTML, does not include HTML volume labels.

Please bring any special uses in brackets, is used to enable annotation, syntax "(? #Comment", if the "IGNORE PATTERN white" option is set, the space in RE is used when rejection It is slightly. This option is set, the text after "#" will be slightly. 31. The text of the HTML volume, plus the annotation (? <= # Lookup prefix, but does not include it <(/ w )> #html tag) # End Find prefix. * # In line with any text (? = # Lookup word Tail, but does not contain it # in line with the string of the graphics 1, that is, the HTML tag of the front parentheses) # End lookup

Looking for the word of the most characters and the least character (Greedy and Lazy) When you want to find a range of repetitions (such as ". *"), It usually looks for the most characters, which is Greedy matching. for example.

32. a. * B (in line with the most characters of the beginning of A) If a string is "aabab", the consisted string obtained by using the above RE is "aabab", because this is looking for the most characters word. Sometimes I hope that the word that is in line with the least character is Lazy Matching. As long as you add a question mark that repeated the aforementioned item (?), You can turn all them into lazy matching. Therefore, "*?" Is repeated any number, but the number of repetitions is used in accordance with the number of repetitions. For example, please:

33. a. *? B (in line with the least character of the End of A.) If a string is "aabab", use the above RE first result, "AAB" is coming "AB ", Because this is the word looking for the least character.

*? Repeat any number of times, the minimum number of repetitions is the principle ? Repeat at least once, the minimum number of repetitions is the principle ?? Repeat zero or once, the minimum number of repetitions is the principle {n, m}? Repeat at least n times, but not more than M Second, the minimum number of repetitions is principled {n,}? Repeat at least n times, minimum number of repetitions is principled

What else is there? So far, many elements that have established RE have been mentioned. Of course, there are still many elements that have not been mentioned. The following table sorted out some elements that have not mentioned, the number of the left field is explained. Examples in Expresso. # 语法 说明 / / A Bell character / b usually refers to the boundary of the fingers, the represented by the character group is Backspace / T Tab34 / R Carriage Return / V Vertical Tab / F from feed35 / n new line / e escape36 / nnn ASCII Characters 37 / XNN Sixteen Code for NNN Nn Nn 38 / Unnn Unicode is NNNN's Character 39 / CN Control n character, for example, Ctrl-m is the beginning of / cm40 / a string (and ^ Similar, but no status by multiline options) 41 / z String end / z Strings End 42 / g The start 43 / P {name} Unicode character group name name Name, such as / P {LowerCase_Letter} The referred to whether the lowercase (?> EXP) Greedy is described, also known as Non-BackTracking document. This is only compliant and does not pick backtracking. 44 (? - exp) or (? - eXP) balance the group. Although complicated but easy to use. It allows the named grab group to operate in the stack. (The younger brother is not too understanding this also) 45 (? IM-nsx: exp) to describe the EXP to change the RE option, such as (? -I: ELVIS) Turn off the option of the ELVIS inexpensive to 46 (? IM-NSX) Change the RE option for later groups. ((Exp) Yes | NO) Document EXP is considered Zero-Width Positive Lookahead. If there is in line there, YES is described as the next conformance, if not, NO times are described as the next conformance. ((exp) yes) and the above, but no NO times description (? (Name) YES | NO) If the Name group is a valid group name, YES is described as the next conformance, if, NO times The description is the next conformance. 47 (? (Name) YES and the above, but no NO times description conclusions have passed a series of examples, and the help of Expresso, I believe that there is a basic understanding of RE, and there are of course many articles about RE, If you are very interested in http://www.codeproject.com and there are still many related articles about RE. If you are interested in your book, many of Jeffrey Friedl's Mastering Regular Expressions have a lot of push (the younger brother has not read). I hope that I will report this kind of experience, which can make the greatness to renew the learning curve for RE, of course, this is the first time that the younger brother contact, if there is any mistake in the article, you can please big Appreciate, and ask all the great people to fix the local mail to the younger brother, the younger brother will thank you very much.

转载请注明原文地址:https://www.9cbs.com/read-52752.html

New Post(0)