Quick course
Steven A. Smith
Scope of application: Microsoft® .NET FrameworkMICROSOFT® ASP.NET Regular Expression API
Summary: Regular expression is a useful tool for processing text. Whether verifying the user input, searching the pattern within the string, or resetting the text format in a variety of efficient ways, the regular expression is very useful.
Download the source code of this article.
This page
Introduction Substitution Expression Using History Simple Expression Limited Element Character Chart Predefined Collection Element Character Expression Sample ASP.NET Verification Regular Expression API Free Tool Advanced Topics Conclusion Resources Author
introduction
Microsoft®.NET Framework is a first-class support for regular expressions, even in Microsoft® ASP.NET, also relying on regular expression languages. This article describes the basics and recommendation contents of in-depth learning regular expressions.
This article is mainly for a little or not using experience in regular expressions, but is familiar with ASP.NET, which can be programmed by .NET programming. In addition, I hope this article is along with the Regular Expression Cheat Sheet to become a regular expression using experienced developers' hand reference materials or training materials. The discussion herein is as follows:
1. General Expression History 2. Simple Expression 3. Limits 4. Metacity 5. Character class 6. Pre-defined Collection Element Characters 7. Expressions Example Details 8. Verification 9 in ASP.NET 9. Regular expression API 10. Free Tools 11. Advanced Topics Overview 12. Summary and Other Resources
Usually, if you have questions about this or the regular expression, please visit http://www.aspadvice.com/, and ask questions through Regex Mailing List. More than 350 subscribers participated in this article.
Back to top
General Expression History
Regular expressions are designed in the fifth year, there is nowday. Regular expressions were initially used to describe the "regular set", which are the model of some neurophilia research. Regular expressions were first proposed by mathematician Stephen Klene, eventually used by Ken Thompson in two very popular text utilities qed and grep. Jeffrey Friedl is further elaborated in its book "Mastering Regular Expressions". It is recommended that those who want to know more about regular expressions and history see this book.
In the last 50 years, regular expressions have gradually developed from ambient math concept into the main functions applied in various tools and software packages. Although many UNIX tools have supported regular expressions in decades, it is only reflected in most Windows developer toolkits for nearly ten years. In Microsoft® Visual Basic® 6 or Microsoft® Vbscript, even if the situation is ideal, the regular expression is still difficult to use. However, as the .NET Framework is implemented, regular expression supports the development to the extreme, all Microsoft developers and all .Net languages can use regular expressions.
So, what is the regular expression? Regular expressions are a language that can clearly describe the mode in the text string. In addition to simply describing these modes, regular expression engines can usually be used to traverse matching, and use mode as separators to resolve strings into sub-strings, or replace text or reset text format in intelligence. Regular expressions provide a valid and simple way to resolve many of the common tasks related to text handling.
When discussing regular expressions, the regular expression is usually analyzed based on the text of the regular expression match (or mismatch). This article (and System.Text.RegularExpressions class) will reference 3 participating objects in the regular expression interaction: "Mode", "Input" strings, "input" strings, and "Match" in the string of characters. Back to top
Simple expression
The simplest regular expression is familiar with the text string. A particular string can be described by the text itself; a regular expression pattern like FOO can accurately match the input string foo. In this example, you will also match the following: The Food Was Quite Tasty, if you want to match exactly, this may not be the expected result.
Of course, using regular expressions match equal to its own exact string is no value, and does not reflect the true function of regular expressions. If you don't find foo, but look for all words starting with the letter F, or all 3 letters of words, what should I do? Currently, this exceeds the reasonable range of text strings. We need to study the regular expression more deeply. Here is a text expression example and some matching inputs.
Mode Enter (Match) Foo Foo, Food, Foot, "There's Evil Afoot."
Back to top
Default
A simple method provides a simple method for specifying the number of times that allows specific characters or character sets themselves in the mode. There are 3 non-explicit qualifiers:
1. *, describe "0 or more times". 2. , describe "1 or more times". 3. Describe "0 or 1 time".
The qualifier always references the default (left) mode, usually a single character unless the parentheses creates a mode group. Here are some mode examples and matching inputs.
Mode Enter (Matching) FO * Foo, Foe, Food, Fooot, "Forget IT", FUNNY, PUFFY FO FOO, FOE, FOOD, FOOT, "Forget IT" fo? Foo, foe, food, foot, "forget it" , Funny, PuffY
In addition to specifying a given mode except 0 or 1 time, characters can also force a character or sub-mode matching number (if multiple characters in the input string).
In addition to the non-significant default (generally called a qualifier, there is an explicit qualifier outside of the next group, so that the non-significant qualifier) is known. The concept of the qualifier is very blurred in terms of the number of models. Use explicit qualifiers to accurately specify numbers, ranges or digital sets. The explicit qualifier is located behind the model applied, which is the same as the regular qualifier. Explicit qualifiers use the curd {} and the digital value representation of the digital value indicating the number of times the number of times. For example, X {5} will accurately match 5 X characters (XXXXX). If only one number is specified, the maximum number of times; if the figures follow a comma, such as x {5,}, indicate the X character that matches any number of occurrences greater than 4. Here are some mode examples and matching inputs.
Mode input (match) AB {2} C ABBC, AAABBCCC AB {, 2} C AC, ABC, ABBC, AABBCC AB {2, 3} C Abbc, ABBBC, AABBCC, AABBBCC
Back to top
Element character
In the regular expression, there is a specific constructor, that is, the elevational character. There are many known metades, such as *,?, And {} characters. Other characters have special meaning in regular expression languages. These characters include: $ ^. [(|)] And /.
(Dictionary or Point) Metacity is the simplest but most commonly used character. It can match any single characters. If you want to specify that some modes can contain any combined characters, use the period to use, but must be within a specific length range. In addition, we know that the expression will match all modes contained in a longer string, if only exact match mode is required, what should I do? This often occurs in the verification scheme, for example, to ensure that the user entered by the postal code or phone number is correct. Use the ^ metacharacter to specify the beginning of a string (or line), use the $ element character to specify the end of the string (or row). The force mode can only match the precisely matching input string by adding these characters to the beginning and end of the mode. There is also a special meaning if the ^ fiddle character is used in square brackets []. The details are as follows. / (Backslash) element characters can be used to specify an example of a predefined set element character in accordance with a special meaning "escape" character. Similarly, the details are as follows. In order to include a character character in the regular expression, a backslash must be used to "escape". For example, if you want to match a string starting with "C: /", you can use: ^ c: //. Note that you want to use the ^ metamorphic to indicate that the string must be used as the beginning, then use the backslash element character to escape the text backslash.
| (Pipeline) metamodes are used to alternately specify, especially for specifying "this or" this "in the mode. For example, A | B will match any input content containing "A" or "b", which is very similar to character class [ab].
Finally, brackets () are used to give the mode grouping. It allows the use of qualifiers to make a full mode multiple times. To make it easy to read, or separately match the specific input portion, you may allow the analysis or resetting the format.
Some examples of use of metamodes are listed below.
Mode input (match). A, B, C, 1, 2, 3. * ABC, 123, arbitrary strings, also match ^ C: // C: / Windows, C: /, C: / Foo when c: // c: / foo .txt, c: / followback with any other content abc $ ABC, 123ABC, any string (ABC) {2,3} Abcabc, abcabcabc
Back to top
Character class
Character classes are "mini" languages in regular expressions, defined in square brackets []. The simplest character class is just a character table in parentheses, such as [AEIOU]. When using a character class in an expression, any of these characters (but only one character) can be used here, unless a qualifier is used. Note that you cannot use a character class to define a word or mode, you can only define a single character.
To specify any numeric numbers, you can use the character class [0123456789]. However, since this is not very convenient to use characters, it is necessary to define the range of characters in parentheses. Connecting characters have special meaning in the character class (not in the regular expression, so it accurately said that it cannot be called regular expression metamorphic characters), and only when the characters are not the first character, even characters are characters There is a special meaning in the class. To use a hyphen to specify any numeric numbers, you can use [0-9]. Like lowercase letters, [A-Z] can be used, uppercase letters can be used [A-Z]. The scope of the condiction definition depends on the character set used. Therefore, the characters are in the order in which the ASCII or Unicode tables determine the characters included in the range. If you need to include a hyphen in the range, specify it as the first character. For example: [-.?] Will match any of the four characters (note that the last character is a space). Also note that regular expression metamorphism does not do special processing in the character class, so these metad characters do not need to escape. Considering a language that is separated from other regular expression languages, the character class has its own rules and syntax. If you use a character ^ as the first character of the character class to negate such a class, you can also match any characters other than the character class member. Therefore, you can match any non-metatomy characters, you can use the character class [^ aaeeiiouu]. Note that if you want to negate the characters, you should use the characters as the second character of the character class, such as [^ -]. Remember, ^ The role in the character class is completely different from it in the regular expression mode.
Some character classes used in operation below are listed below.
Mode input (match) ^ b [aeiou] T $ BAT, BET, BIT, BOT, But ^ [0-9] {5} $ 11111, 12345, 99999 ^ C: // C: / Windows, C: /, C: /FOO.TXT, C: / After any other content ABC $ ABC, 123ABC, any string ended with ABC (ABC) {2,3} Abcabc, abcabcabc ^ [^ -] [0-9] $ 0 , 1, 2, ... (do not match -0, -1, -2, etc.)
In the next version of .NET Framework, the code name "whidbey" is added to the character class as a new feature, called the character class subtraction. Its main role is to allow another character class to be subtracted from a character class to provide more readable ways to describe certain modes. This specification can be accessed by address: http://www.gotdotnet.com/team/clr/bcl/techarticles/techarticles/specs/regex/characterclassSubtraction.doc. Its syntax is similar to [A-Z- [aeiou]], matches all lowercase consonants.
Back to top
Predefined set element characters
Use the currently available tools to do a lot of work. However, to use the [0-9] representation mode, or (worse) to indicate any alphanumeric characters using [0-9A-ZA-Z], there is a relatively long process. In order to mitigate the pain of processing these common but length mode, the predefined metader collection is defined in advance. Different implementations of regular expressions define different predefined metader characters collection, the predefined metader character set described below is supported by the System.Text.RegularExpressions API in .NET Framework. The standard syntax of these predefined metaders is to follow one or more characters in the backslash / post. Most predefined metades have only one character, and their use is easy, and it is an ideal alternative character for lengthy character classes. Here are two examples: / d Match all numeric numbers, / w Match all word characters (alphanumeric padding). The exception is that some specific character code matches, and the address of the match must be specified, such as / u000d will match the Unicode Enter. Some of the most common character classes and their equivalent metades are listed below.
Metamic Equivalent Class / A Match Ringtones (Alert); / U0007 / B Matches the word boundary outside the character class, matching the retracted character, / u0008 / t matching tab, / u0009 / r matched Enterprise, / u000d / w Match the vertical tab, / u000b / f matching top, / u000c / n matches the new row, / u000a / e match escar, / u001b / 040 matches 3 bits 8 Enter ASCII characters. / 040 Represents space (decimal 32). / x20 matches the ASCII character using 2 bits 16. In this case, / x2-represents space. / CC matches the ASCII control character, and this example is Ctrl-C. / U0020 matches Unicode characters using 4-bit 16-based numbers. In this case / u0020 is space. / * The arbitrary characters that do not represent the predefined character class are only treated as this character. Therefore, / * is equivalent to / x2a (which is text *, not * metadam). / p {name} Match any of the characters in the named character class "name". The support name is a Unicode group and block range. For example, LL, ND, Z, ISGREEK, ISBOXDRAWING, and SC (currency). / p {name} Match the text not included in the named character class "name". / w Match any of the words characters. For non-Unicode and ECMAScript implementation, this is equivalent to [A-ZA-Z_0-9]. In the Unicode category, this is equivalent to [/ p {l} / p {lu} / p {lt} / p {lo} / p {nd} / p {pc}]. / W / w negation, equivalent to ECMAScript compatibility set [^ A-ZA-Z_0-9] or Unicode character category [^ / p {ll} / p {lu} / p {lt} / p {lo} / p {nd} / p {pc}]. / s Match any blank area character. Equivalent to Unicode character class [/ f / n / r / t / v / x85 / p {z}]. If you use the ECMAScript option to specify ECMAScript compatibility, / s is equivalent to [/ f / N / R / T / V] (please pay attention to the front guide). / S Match any non-blank area character. Equivalent to Unicode character category [^ / F / N / R / T / V / X85 / P {z}]. If you specify an ECMAScript compatibility method, / s is equivalent to [^ / F / N / R / T / V] (please pay attention to the space after ^). / d Match any decimal number. In ECMAScript mode, equivalent to Unicode's [/ P {ND}], non-Unicode [0-9]. / D Match any non-decimal number. In ECMAScript mode, equivalent to Unicode's [/ p {nd}], non-Unicode [^ 0-9]. Back to top
Expression example
Many people like to learn through examples, and some expressions are provided below. To get more examples, visit the regular expression online database in the following address: http://www.regexlib.com/.
Mode Description ^ / D {5} $ 5 numeric numbers, such as American Postal Codes. ^ (/ D {5}) | (/ D {5} - / d {4} $ 5 numeric numbers or 5 numbers - Short line-4 numbers. Match 5-digit format US Postal Codes, or 5-digit 4-digit format US Zip code. ^ (/ D {5} (- / d {4}) $ with the same as the previous one, but more efficient. Use? Can make the 4 digits in the mode become the 4 digits Optional part, rather than requires two modes (by another way). ^ [ -]? / D (/. / d ) $ Match the real number of any optional symbols. ^ [ -]? / d * /.? / d * $ is the same as the previous, but also matches the empty string. ^ (20 | 21 | 22 | 23 | [01] / d) [0-5] / d $ match 24-hour time value. //*.*/*/ Match C language style notes / * ... * / return to top
Verification in ASP.NET
ASP.NET provides a set of validation controls that verify the input to verify input on a web form compared to the old (or preferred) ASP processing task. One of the very effective verifiers are RegularExpressionValidator, such as you, allow you to provide a regular expression that must match the input to verify the input. Setting the ValidationExpression property of the Control Specify the mode of the regular expression. The verification program for verifying the postal code field is shown below:
ControlTovAlidate = "ZipcodeTextBox" ErrorMessage = "Invalid Zip Code Format; Format Should Be Either 12345 Or 12345-6789. " ValidationExpression = "(/ D {5} (- / d {4})?" /> Pay attention to several issues using RegularExpressionValidator: • Never use the empty string in the control you want to verify to be verified to activate the validator. Only RequiredFieldValidator can capture empty strings. • You do not need to specify the start and end of the matching character (^ and $) - they are in advance. If you add a start and end, there is no impact, don't do this. • For all verification controls, you must verify on the client and server. If the regular expression is not an ECMAScript compatibility, client verification will fail. To avoid this, make sure the expression is an ECMAScript compatibility method, otherwise only the server is verified on the server. Back to top Regular expression API In addition to the ASP.NET verification control, most of the cases of using regular expressions in .NET use the class found in the system.text.regularExpressions namespace. Especially those that you want to be familiar with the primary REGEX, Match, and MatchColction. By the way, the regular expression abbreviation style Regex is / reg-eks / or / rej-eks /, there are some controversies. I tend to the latter, but both pronunciation has experts to agree, so which pronunciation is determined by yourself. The Regex class has a large number of methods and properties if you haven't used it before, it may feel awful. The following summarizes some most common methods: Method Description Escape / Unescape strings are used as text in the expression. ISMATCH If the regular expression is found in the input string, return "TURE". Match If you find a match in the input string, return the matching object. Matches If you find any or all match in the input string, returns a matching collection object. Replacing replaces the matching of the input string with a given replacement string. Split returns an array string when splitting input string into a separate array element with a regular expression. In addition to specifying many methods, there are some options to specify, usually in the Regex object constructor. Since these options are part of the bit shield, these options can be specified simultaneously (eg, Multiline and SingleLine can be specified. Method Description Compiled Use this option when performing many matching operations in the loop. This can save analysis expression steps for each loop. Multiline has no relationship with the number of rows in the input string. Specifically, it only modifies ^ and $, in order to match the bank (BOL) and row end (EOL) instead of match the beginning and end of the entire input string. IgnoreCase enables the mode to ignore the case when the search string is matched. IgnOREPATTERNWHITESPACE allows you to include any number of blank areas as needed, and also supports using (? # 注释 #) syntax to add comments in the mode. Singleline doesn't matter if it's the number of rows in the input string. More specifically, it will result in. (Sentence) element character matching a character, not any character other than / n (default). The operations that use regular expressions often include: verification, matching, and replacement. In most cases, these operations can be done using the static method of the Regex class, and do not need to instantiate the Regex class itself. To perform verification, all to do is to create or find the correct expression, then use the ISMATCH () method of the RegeX class to apply the expression to the input string. For example, the following function demonstrates how to use regular expressions to verify postal codes: Private Void ValidatezipButton_Click (Object Sender, System.EventArgs E) { String zipregex = @ "^ / d {5} $"; IF (Regex.ismatch (ZiptextBox.Text, ZipRegex)) { Resultlabel.text = "zip is valid!"; } Else { ResultLabel.Text = "Zip is invalid!"; } } Similarly, you can use a static replace () method to replace the match to a specific string, as shown below: String newText = regex.replace (InputString, Pattern, ReplacementText); Finally, you can use the following code to traverse the matching set of input strings: Private void matchbutton_click (Object Sender, System.Eventargs E) { Matchcollection matches = regex.matches (SearchStringTextBox.text, MatchexpressionTextBox.Text); Matchcountlabel.text = matches.count.toString (); Matcheslabel.text = "" Foreach (Match Match in matches) { Matcheslabel.text = "Found" match.toString () "AT Position " Match.index ". } } Typically, instances of Regex classes are required when you need to specify the default mode. Especially when setting options. For example, to create a regex instance that ignores the uppercase and mode blank area, then retrieve the collection with the expression, you should use the following code: Regex Re = New Regex (Pattern, Regexoptions.ignorecase | regexoptions.ignorepatternwhitespace; Matchcollection mc = Re.matches (InputString); The complete use of these examples in this document is the same as the simple ASP.NET page. Back to top Free tool Regulator (http://royo.isha-geek.com/iysializable/regulator/) - a regular expression test tool running on the client, tightly integrated with Regexlib through web services, providing "matching", " Split "and" Replace "and the like. Includes performance analysis and syntax highlighting. Regexdesigner.net (http://www.selsbrothers.com/tools/) - a powerful visual tool that helps construct and test regular expressions. It generates C # and / or VB.NET code and compiled assembly code to help you integrate expressions into your application. Regular Expression Workbench (V2.0) (http://www.gotdotnet.com/community/Usersamples/details.aspx?sampleguid=c712f2df-b026-4d58-8961-4ee2729d7322) - Eric Gunnerson developed tools for creation, Test and study regular expressions. With the "examine-o-matic" function, the mouse hover is hoverted above the regular expression, and the meaning is decoded. Back to top Advanced theme Regular expressions have two functions that have to be said, one is a "named group" and the other is "four-way processes" (Lookaround Processing). Since these features are rare, only simply elaborate here. With named groups, you can name your matching group separately, and then reference these groups in your expression. This feature is particularly effective if you reset the format of the input string (by replacing the elements in the input string) by replacing the elements in the input string. For example, assume that the date uses the string of the MM / DD / YYYY format, and you want the date format to be DD-MM-YYYY. At this point, an expression can be written to capture the first format, traverse it, and analyze each string, then use a string operation to establish a replacement string. This requires a lot of code and a lot of processing. If you use a naming group, you can complete the same task, see below: String MDYTODMY (String Input) { RETURN Regex.replace (INTPUT, @ "/ b (? $ {MONTH} - $ {year} "); } You can also use the number or by name or by name. In any case, this reference is admissible to "reverse reference". Another interoperability often uses reverse references itself, as follows, the following expression is used to find repetitive letters: [A-Z] / 1. It will match "aa", "bb", "cc", but it is different from [AZ] {2} or [AZ] [AZ], the wake two is equivalent, and the latter is allowed to match "AB" or "Ac" or any other two letters. Reverse reference allows expressions to remember some characters in the expression that have been analyzed and matched. "Four-way Process" refers to the positive lookahead and LookBehind features supported by many regular expression engines. Not all regular expression engines support verification four-way processing. These constructs do not use characters, even if they can match characters. Some modes may not be described without using four-way processing. In particular, when some of the modes exist dependent on another, if four-way processing is not used, such modes cannot be described. The syntax for each four-way process is described below. Grammatical description (? = ...) Positive lookahead (?! ...) Negative lookahead (? <= ...) Positive Lookbehind (?
Password verification is an example of a four-way process. Assume that in the password limit, the password must be between 4 and 8 characters and must contain at least one number. To do this, you can test / d only in the match, then use a string operation to test the length. But to implement all this in the regular expression, you must use lookahead. Especially the positive lookahead, as shown below: ^ (? =. * / D). {4,8} $ Back to top in conclusion Regular expressions are a very efficient way to describe text modes that make text modes become excellent resources for string authentication and operation. .NET Framework provides powerful support for regular expressions through System.Text.RegularExpressions Namespace (especially in the Regex class). Using the API is simple, but the correct regular expression is usually more laborious. Fortunately, the regular expression can be reused, you can find out the expression of other people, or help when creating expressions in the network. Back to top Resource Regular expression library http://www.regexlib.com/ Regular expression discussion list http://aspadvice.com/login.aspx?returnurl=/signup/list.aspx?L=68&C=16&l=68&c=16 Regular expression forum http://forums.regexadvice.com/ Regular expression web log http://blogs.regexadvice.com/ Mastering regular expression (O'Reilly), author Jeffrey Friedl http://www.regex.info/ .NET regular expression reference http://msdn.microsoft.com/library/en-us/cpref/html/frlrfsystemtextregulaearxpressions.asp JScript regular expression syntax http://www.msdn.microsoft.com/library/en-us/script56/HTML/JS56JSGRPREGEXPSYNTAX.ASP Regular expression information http://www.regular-expressions.info/ Back to top About the Author Steven A. Smith is the most valuable expert (MVP) of Microsoft in ASP.NET technology, is the president and owner of Aspalliance.com and devadvice.com. In addition, he is also owner and chief teacher of Aspsmith Ltd (a company that provides .NET training). He has written two books: "ASP.NET Developer's Cookbook" and "ASP.NET BY EXAMPLE" and published some articles in MSDN and ASPNETPRO magazines. Steve issued a speech every year and is a member of the iNeta Liaison Office. Steve has a master's degree in management and a bachelor's degree in computer science engineering. If you want to contact Steve, send an email to SSMith@aspalliance.com.