Regular expression (Regular Expression)
Keywords: regular expression, Regular Expression
Original: smile
Created in: May 03, 2004, last updated:
May 04, 2004 21:12
Copyright Notice: Using Creative Public Copyright Agreement
Quote address: Regular Expression
NAV: smile eight hours / smile eight hours outside data index
Foreword
Regular expressions are cumbersome, but powerful, the application after learning will make you bring you absolute accomplishment in addition to improving efficiency. Just read these information carefully, add a certain reference when applied, master the regular expression is not a problem.
index
1._ 引 2._ Regular expression history 3._ Regular expression definition
3.1_ Ordinary Character 3.2_ Non-Printing Character 3.3_ Special Character 3.4_ Limizes 3.5_ Locator 3.6_ Select 3.7_Autist Reference
4. - Operation priority of various operators 5._ All symbols Interpretation 6._ Some examples 7._ Regular expression matching rules
7.1_ Basic mode match 7.2_ Character cluster 7.3_ Determining repeated appearance
Primer
At present, regular expressions have been widely used in many software, including * NIX (Linux, UNIX, etc.), HP and other operating systems, PHP, C #, Java and other development environments, and many applications can be seen. Regular expression of the shadow.
The use of regular expressions can achieve powerful functions through a simple approach. In order to simply and effective, it is difficult to cause the regular expression code. It is not easy to learn, so it is necessary to pay some efforts, and after the entry will refer to a certain reference, it is also relatively simple and effective.
Example: ^. @. // .. $
Such code has repeatedly returned myself. Many people may be scared by such code. Continue reading This article will allow you to freely apply this code.
Note: The sections here seem to have repeated with the previous content, and the purpose is to re-describe the part in the front table, and the purpose is to make these contents easier to understand.
2. History of regular expression
Regular expressions "ancestors" can have been traced back to an early study on how the human nervous system works. Two neur physiologists of Warren McCulloch and Walter Pitts have studied a mathematical way to describe these neural networks.
In 1956, a mathematician called Stephen Kleene published a paper title "Neural Network Emergencies" on the early work of McCulloch and Pitts, introduced the concept of regular expression. Regular expressions are used to describe expressions he called "regular set algebra", so the term "regular expression" is used.
Subsequently, it is found that this work can be applied to some early studies using Ken Thompson's computing search algorithm, Ken Thompson is the main inventors of UNIX. The first practical application of the regular expression is the QED editor in UNIX.
As they said, the rest is a well-known history. Since then, until now the regular expression is an important part of the text-based editor and search tool.
3. Regular expression definition
Regular expression (regular expression) describes a string matching mode that can be used to check if a string contains a sub-string, replacing the matching substring or from a string that meets a certain condition Wait.
When the column directory, DIR * .TXT or LS * .TXT * .txt is not a regular expression because * is different from the meaning of regular *.
Regular expression is a text mode composed of normal characters (such as characters a to z) and special characters (called metammatics). Regular expression As a template, a character mode matches the search string. 3.1 Ordinary Character
It consists of all printed and non-print characters that are not explicitly specified as metabits. This includes all uppercase and lowercase letters characters, all numbers, all punctuation symbols, and some symbols.
3.2 Non-print characters
Character Meaning / CX matches the control character indicated by x. For example, / cm matches a Control-M or an Enterprise. The value of x must be one of A-Z or A-Z. Otherwise, the C is treated as a primary 'c' character. / f Match a change page. Equivalent to / x0c and / cl. / n Match a newline. Equivalent to / x0a and / cj. / r Match a carriage return. Equivalent to / X0D and / cm. / s Match any blank character, including spaces, tabs, change page, and the like. Equivalent to [/ f / n / r / t / v]. / S Match any non-blank character. Equivalent to [^ / f / N / R / T / V]. / t matches a tab. Equivalent to / x09 and / ci. / v Match a vertical tab. Equivalent to / x0b and / ck.
3.3 Special characters The so-called special characters, is some of the special meaning characters, as in "* .txt" said above, simply means that any string means. If you want to find a file in the file name, you need to escape *, that is, before it is added. Ls /*.txt. Regular expressions have the following special characters.
Special character description $ Match the end position of the input string. If the demiline property of the Regexp object is set, $ or '/ r' is matched. To match the worth itself, use / $. () Mark the beginning and end position of a child expression. Sub-expressions can be used later. To match these characters, use / (and /). * Match the previous sub-expression zero or multiple times. To match * characters, use / *. Match the previous sub-expression once or multiple times. To match characters, use / . Matches any single characters other than the resort / N. To match., Please use /. [Marking a bracket expression. To match [, please use / [. • Match the previous sub-expression zero or once, or indicate a non-greedy qualifier. To match? Characters, please use /?. / Tag the next character as a special character, or primary character, or backward reference, or eight-way escape. For example, 'n' matches character 'n'. '/ n' matches changing. Sequence '//' Match "/", and '/ (', match "(". ^ Matches the start position of the input string unless used in square brackets, it indicates that it does not accept the character set. Match ^ Character itself, please use / ^. {Tag qualifier expression. To match {, please use / {. | Indicate two options. To match |, please use / |.
The method of constructing a regular expression and a method of creating a mathematical expression. That is, using a variety of metamorphic characters to create a larger expression together with the operator. Regular expressions can be a single character, a character set, a character range, a selection between characters or any combination of all of these components.
3.4 Limizes Limits Use to specify a given component for a given component to meet the match. There are 6 types of * or or {n} or {n,} or {n, m}.
*, And? The qualifier is greedy because they will match the text as much as possible, only with one after they add one? Non-greed or minimum match.
Regular expression of the expression:
Character Description * Matches the previous sub-expression zero or multiple times. For example, ZO * can match "Z" and "ZOO". * Equivalent to {0,}. Match the previous sub-expression once or multiple times. For example, 'ZO ' can match "ZO" and "ZOO" but cannot match "Z". Equivalent to {1,}. • Match the previous sub-expression zero or once. For example, "Do (ES)" can match "do" in "do" or "does". Is equivalent to {0,1}. {n} n is a non-negative integer. Match the determined N times. For example, 'o {2}' does not match 'o' in "Bob", but can match two O in "Food". {n,} n is a non-negative integer. At least n times. For example, 'o {2,}' cannot match 'O' in "Bob", but can match all O in "fooOOD". 'o {1,}' is equivalent to 'o '. 'o {0,}' is equivalent to 'o *'. {N, M} M and N are non-negative integers, where n <= m. Match at least n times and matched M times. For example, "O {1, 3}" will match the top three O in "foooood". 'o {0,1}' is equivalent to 'o?'. Please note that there is no space between commas and two numbers. 3.5 Locator is used to describe the boundaries of strings or words, ^ and $ respectively refer to the beginning and end of the string, / b Describe the front or rear boundary of the word, / b represents the non word boundary.
You cannot use a qualifier for the locator.
3.6 Selecting all the options in parentheses, binds between adjacent selection items. However, there will be a side effect with parentheses, which is the associated match. Is it available at this time?: Placing this side effect before the first option.
Among them?: It is one of the non-captured elements. There are two non-arrested elements. = And?! The location of the pattern is matched to match the search string, the latter is a negative forecast, and the search string is matched without matching the regular expression mode.
3.7 Adding parentheses on a regular expression mode or partial mode to a regular expression mode or partial mode will result in related matching to a temporary buffer, each sub-match captured is encountered from left to right in the regular expression mode. Content storage. The buffer number of the storage sub-match starts from 1, continuous numbers up to the maximum 99 sub-expression. Each buffer can be accessed using a '/ n', where n is a one or two-digit decimal number identifies a particular buffer.
You can use non-capture element characters '?:', '? =', Or '?!' To ignore the saving of related matches.
4. Operation priority of various operators The same priority is calculated from left to right, and the calculations of different priorities are first low. The priority of various operators is as low as the following:
Operator describes / escape (), (? :), (? =), [] Parentheses and square brackets *, ,?, {N}, {n,}, {n, m} definition ^ , $, / Anymetachacter location and order | "or" operation
5. All symbols explain
Character Description / Tag the next character as a special character, or a primary character, or a backward reference, or an octave. For example, 'n' matches characters "n". '/ n' matches a newline. Sequence '//' match "/" "/ (" matches "(". ^ Match the input string of the start position. If the multiline property of the regexp object is set, ^ also matches '/ n' or '/ r' The next location. $ Match the end position of the input string. If the multiline property of the regexp object is set, $ also matches the position before '/ n' or '/ r'. * Match the previous sub-expression zero or multiple times For example, ZO * can match "z" and "zoo". * Equivalent to {0,}. Match the previous sub-expression once or more. For example, 'ZO ' can match "ZO" and "ZOO" However, it cannot match "Z". Equivalent to {1,}. • Match the previous sub-expression zero or once. For example, "Do (es)" can match "do" or "does" "" " Do ".? Is equivalent to {0,1}. {n} n is a non-negative integer. Match the N times. For example, 'o {2}' does not match" Bob "'o', but can Match two O. {n,} n is a non-negative integer. At least n times. For example, 'o {2,}' does not match 'O' in "Bob", but can match " All O.'o {1,} 'in fooood is equivalent to' o '.' o {0,} 'is equivalent to' o * '. {n, m} M and N are non-negative integers Where n <= m. Minimize n times and match M times. For example, "o {1, 3}" will match the top three O.'o {0, 1} 'in "foooood". 'o?'. Please note that there is no space between commas and two numbers.? When this character is tight in any other restriction (*, ,?, {n}, {n,}, {n, M}), when the matching mode is non-greedy. Non-greedy mode matches the search string as little as possible, and the default greed mode is as much as possible to match the search string. For example, for strings "OOOO ", 'O ?' Will match a single" O ", and 'o ' will match all 'o' .. Match any individual characters other than" / n ". To match any characters including '/ n' Please use the mode of '[./n]'. (Pattern) matches Pattern and get this match. The acquired match can be obtained from the generated Matches collection, using the Submatches collection in VBScript, using $ 0 in JScript ... $ 9 properties. To match the bracket character, use '/ (' or '/)'.(? :patte RN) Match Pattern but does not acquire matching results, that is, this is a non-acquired match, not for storage for storage. This is useful to use the "or" character (|) to combine a pattern.
For example, 'industr (?: Y | iES) is a smale of' Industry | Industries'. (? = pattern) Positive to check, match the lookup string at any string of Pattern. This is a non-acquisition match, that is, the match does not need to be used later. For example, 'Windows (? = 95 | 98 | NT | 2000)' Map "Windows" in Windows 2000, but does not match "Windows" in "Windows 3.1". It is not consumed by the character, that is, after a match occurs, start the next matching search immediately after the last match, not starting from the character containing the pre-check. (?! pattern) Negative review, match the lookup string at any string of Pattern. This is a non-acquisition match, that is, the match does not need to be used later. For example, 'Windows (?! 95 | 98 | NT | 2000) "can match" Windows "in Windows 3.1, but cannot match" Windows "in" Windows 2000 ". It is not consumed by the character, that is, after a match occurs, start the next matching search immediately after the last match, not the X | Y, which matches X or Y after the character containing the queue. For example, 'Z | Food' can match "z" or "food". '(z | f) OOD' matches "Zood" or "Food". [XYZ] Character collection. Match any of the included characters. For example, '[abc]' can match 'a' in "Plain". [^ XYZ] Negative character set. Match any of the characters that are not included. For example, '[^ ABC]' can match 'P' in "Plain". [A-Z] character range. Match any of the characters within the specified range. For example, '[a-z]' can match any lowercase alphabetic characters in the 'A' to 'Z' range. [^ a-z] Negative character range. Match any of any characters that are not within the specified range. For example, '[^ a-z]' can match any of any characters that are not in the 'A' to 'Z'. / b Match a word boundary, that is, the location of the words and spaces. For example, 'er / b' can match 'ER' in "Never", but do not match 'Er' in "Verb". / B matches non-word boundary. 'ER / B' can match 'Er' in "Verb", but cannot match 'Er' in "Never". / CX matches the control character indicated by x. For example, / cm matches a Control-M or an Enterprise. The value of x must be one of A-Z or A-Z. Otherwise, the C is treated as a primary 'c' character. / d Match a numeric character. Equivalent to [0-9]. / D Match a non-digital character. Equivalent to [^ 0-9]. / f Match a change page. Equivalent to / x0c and / cl. / n Match a newline. Equivalent to / x0a and / cj. / r Match a carriage return. Equivalent to / X0D and / cm.
/ s Match any blank character, including spaces, tabs, change page, and the like. Equivalent to [/ f / n / r / t / v]. / S Match any non-blank character. Equivalent to [^ / f / N / R / T / V]. / t matches a tab. Equivalent to / x09 and / ci. / v Match a vertical tab. Equivalent to / x0b and / ck. / w Match any word character that includes underscore. Equivalent to '[A-ZA-Z0-9_]'. / W Match any non-word characters. Equivalent to '[^ a-za-z0-9_]'. / XN matches n, where n is a hexadecimal escape value. The hexadecimal escape value must be a determined two numbers long. For example, '/ x41' matches "a". '/ x041' is equivalent to '/ x04' & "1". ASCII coding can be used in regular expressions. ./num matches NUM, where NUM is a positive integer. References to the acquired match. For example, '(.) / 1' matches two consecutive identical characters. / n identifies an octal escape value or a backward reference. If the sub-expression of at least n acquired before / N, N is backward reference. Otherwise, if n is an octal number (0-7), then n is an eight-input escape value. / nm identifies an octal escape value or a backward reference. If there is at least NM acquisition sub-expression before / nm, Nm is backward reference. If there is at least n acquisition before / nm, then n is a backward reference to the text M. If the previous conditions are not satisfied, if n and m are octal numbers (0-7), the / nm will match the eight-way escape value Nm. / Nml If n is an octal number (0-3), and M and L are eight-input numbers (0-7), match the eight-en-en-escaic value NML. / UN matches N, where N is a Unicode character represented by four hexadecimal numbers. For example, / u00A9 matches copyright symbol (?). 6. Some examples
Regular expression description // b ([AZ] ) / 1 / b / gi a word continuous position / (/ w ): ([^ /:] ) (: / d *)? ([^ # ] *) / Resolution of a URL to protocol, domain, port, and relative path / ^ (?: chapter | section) [1-9] [0-9] {0, 1} $ / Location Section location / [- AZ] / a to z total 26 letters plus one - number. / Ter / B / Match Chapter, not Terminal // BAPT / Match Chapter, not APTITUDE / Windows (? = 95 | 98 | NT) / Match Windows 95 or Windows 98 or Windows, after finding a match, The next search match is started after Windows.
7. Regular expression matching rules
7.1 Basic mode matches everything from the most basic beginning. Mode, is the most basic element of regular expression, which is a set of characters describing string feature. The pattern can be very simple, consisting of ordinary strings, can also be very complicated, often representing a range of characters in a range, repeated, or represents the context with a special character. E.g:
^ onCe
This mode contains a special character ^, indicating that this mode only matches those strings that are starting with ONCE. For example, this mode matches the string "ONCE UPON A TIME", which does not match "There Once Was A Man from NewYork. As the symbol indicates the beginning, the $ symbol is used to match those strings ending at a given mode.
Bucket $
This mode matches "WHO Kept All of this Cash In A Bucket" with "BUCKETS". When using characters ^ and $ simultaneously, indicate exact match (the string is the same as mode). For example: ^ Bucket $
Only match the string "bucket". If a pattern does not include ^ and $, then it matches any string containing the mode. For example: mode
ONCE
String
There Once Was A Man from NewYorkWho Kept All of His Cash in a bucket.
It is matched. The letter (O-N-C-E) in this mode is a literal character, that is, they indicate that the letter itself, the number is also the same. Others have some slightly complex characters such as punctuation and white characters (spaces, tabs, etc.) to use escape sequences. All escape sequences are headed in a backslash (/). The escape sequence of the tab is: / t. So if we want to detect if a string starts with the tab, you can use this mode:
^ / T
Similarly, use / n to indicate "new rows", / r represented the carriage return. Other special symbols can be used in front of the backslash, such as the reverse slash itself with //, the period. Use /. Indicate, in this class. 7.2 Character clusters In the internet program, regular expressions are usually used to verify the user's input. After the user submits an Form, it is necessary to determine whether the input phone number, address, email address, credit card number, etc. are valid, and use ordinary literally based characters. So use a more freely describing the way we want, it is a character cluster. To create a character cluster that represents all vow characters, put all the vow characters in a square bracket:
[Aaeeiiouu]
This mode matches any element, but only one character can be represented. Use a linked font size to represent a range of characters, such as:
[AZ] / / Match all lowercase letters [AZ] // match all uppercase letters [A-ZA-Z] // match all letters [0-9] // match all numbers [0-9 /. / -] // Match all numbers, number, and minus [/ f / r / t / n] // match all white characters
Similarly, these are only a character, which is a very important. If you want to match a string consisting of a lowercase letter and a digit, such as "Z2", "T6" or "G7", but not "AB2", "R2D3" or "B52", with this mode:
^ [a-z] [0-9] $
Although [A-Z] represents the range of 26 letters, it is only matching the first character to the first character. The front mentioned that ^ indicates the beginning of the string, but it still has another meaning. When using ^ in a set of square brackets, it means "non-" or "exclusion", often used to eliminate a character. In the previous example, we ask the first character that cannot be numbers:
^ [^ 0-9] [0-9] $
This mode is matched with "& 5", "G7" and "-2", but with "12", "66" is not matched. Here is a few examples of exclusion of specific characters:
[^ AZ] / / In addition to all characters other than lowercase letters [^ / ^] / / In addition to all characters other than (/) (^) [^ / "] / / In addition to double quotation marks (") And all characters outside of single quotes (')
Special characters "." (Point, junctions) are used in regular expressions to indicate all characters except "new rows". So the mode "^ .5 $" matches any two characters, ending with the number 5 and the other non-"new row" characters. Mode "." You can match any string, except for a string and only a "new row" string. The formal expression of PHP has some built-in general character clusters, the list is as follows: Character cluster enlightenment [[: alpha:]] any letter [[: DIGIT:]] [[: alnum:]] any letters and numbers [[[ : Space:]] Any white character [[: Upper:]] Any uppercase letters [[: Lower:]] Any lowercase letters [[: punct:]] Any punctuation [[: xdigit:]] any 16 Digital, equivalent to [0-9A-FA-F]
7.3 Determine that repeated appears so far, you already know how to match a letter or number, but more cases, you may have to match a word or a set of numbers. A word has several letters, a set of numbers have several singletons. Follow the rack ({}) behind the character or character cluster to determine the number of repetitive appearances of the previous content.
Character clustering meaning ^ [a-za-z _] $ all letters and underline ^ [[: alpha:]] {3} $ All 3 letters word ^ a $ Letter a ^ a {4} $ AAAA ^ a {2,4} $ AA, AAA or AAAA ^ a {1,3} $ A, AA or AAA ^ a {2,} $ contains more than two A ^ a {2,} such as: Aardvark and AAAB, but Apple is not a {2,} such as Baad and AAA, but nantucket is not / t {2} two tabs. {2} All two characters
These examples describe three different uses of curly brackets. A number, {x} means "the front character or character cluster only appears"; a digital plus comma, {x,} means "the previous content" X or more times "; two The comma-separated numbers, {x, y} representation "The front content appears at least X times, but does not exceed Y." We can extend the pattern to more words or numbers:
^ [A-ZA-Z0-9 _] {1,} $ // All contain one or more letters, numbers or underscore strings ^ [0-9] {1,} $ // all positive ^ / {0,1} [0-9] {1,} $ // All integers ^ / - {0, 1} [0-9] {0,} /. {0, 1} [0-9] { 0,} $ // All decimal
The last example is not very understanding, is it? That is to say: Heads with one optional negative (/ - {0, 1}), following 0 or more numbers ([0-9] {0,}), and one The selected decimal point (/. {0, 1}) followed up to 0 or more numbers ([0-9] {0,}), and there is no other things ($). Below you will know a simpler way to use. Special characters "?" Are equal to {0, 1}, which are all represented: "0 or 1 previous content" or "the previous content is optional." So just now the example can be simplified:
^ / -? [0-9] {0,} /.? [0-9] {0,} $
Special characters "*" are equal to {0,}, which are all representative "0 or more previous content." Finally, the characters " " are equal, indicating "1 or more previous content", so the four examples above can be written:
^ [A-ZA-Z0-9 _] $ / / all contain more than one letter, number or underscore string ^ [0-9] $ // all positive ^ / -? [0-9] $ // All integers ^ / -? [0-9] * /.? [0-9] * $ // All decimal
Of course, this is not to reduce the complexity of regular expressions, but can make them easier to read.
Reference: JScript and VBScript regular expressions
Examples on Microsoft MSDN (English): Scanning for Hrefs
Provides An Example That Searches An Input String and Prints Out All The Href = "..." Values and Their Locations in the string.
Changing Date Formats
Provides An Example That Replace Dates of the form mm / dd / yy with dates of the form DD-mm-yy.
Extracting URL INFORMATION
PROVIDES AN EXAMPLE That Extracts a protocol and port number from a string containing a url. For example, "http://www.contoso.com:8080/letters/readme.html" Returns "http: 8080".
Cleaning an input string
Provides An Example That Strips Invalid Non-Alphaumeric Characters from a string.
Confirming Valid E-mail Format
Provides An Example That You Can Use to Verify That A String IS in Valid E-mail Format.
latest update:
May 04, 2004 21:12
Copyright Notice: Using Creative Public Copyright Agreement
Quote address: Regular Expression
NAV: smile eight hours / smile eight hours outside data index