Learning regular expression (Regular Expression) Preface regular expressions are cumbersome, but strong, after learning, the application will allow you to bring you absolute sense of accomplishment in addition to increasing efficiency. Just read these information carefully, add a certain reference when applied, master the regular expression is not a problem. Index 1._ 引 2._ Regular expression history 3._ Regular expression definition 3.1_ Normal character 3.2_ Non-print character 3.3_ special character 3.4_ 限 符 3.5_ Locator 3.6_ Select 3.7_ Retirement 4 ._ Various operators Operation priority 5._ All symbols Interpret 6._ Some examples 7._ Regular expression matching rules 7.1_ Basic mode matching 7.2_ Character cluster 7.3_ Determination repeated appearance ------- -------------------------------------------------- ----------------------- 1. Attorney is currently in the current expression, it has been widely used in many software, including * NIX (Linux, UNIX, etc.) Operating systems such as HP, PHP, C #, Java, etc., and many applications, the shadow of regular expressions can be seen. The use of regular expressions can achieve powerful functions through a simple approach. In order to simply and effective, it is difficult to cause the regular expression code. It is not easy to learn, so it is necessary to pay some efforts, and after the entry will refer to a certain reference, it is also relatively simple and effective. Example: ^. @. .. This code has repeatedly returned me again. Many people may be scared by such code. Continue reading This article will allow you to freely apply this code. Note: The sections here seem to have repeated with the previous content, and the purpose is to re-describe the part in the front table, and the purpose is to make these contents easier to understand. 2. The historical regular expression of the regular expression "ancestors" can have been traceable to the early study of how the human nervous system works. Two neur physiologists of Warren McCulloch and Walter Pitts have studied a mathematical way to describe these neural networks. In 1956, a mathematician called Stephen Kleene published a paper title "Neural Network Emergencies" on the early work of McCulloch and Pitts, introduced the concept of regular expression. Regular expressions are used to describe expressions he called "regular set algebra", so the term "regular expression" is used. Subsequently, it is found that this work can be applied to some early studies using Ken Thompson's computing search algorithm, Ken Thompson is the main inventors of UNIX. The first practical application of the regular expression is the QED editor in UNIX. As they said, the rest is a well-known history. Since then, until now the regular expression is an important part of the text-based editor and search tool. 3. Regular expression definition regular expression (regular expression) describes a string matching mode that can be used to check if a string contains a skeed, replace the matching substring or take it from a string. Subsidence of a certain condition, etc. When the column directory, DIR * .TXT or LS * .TXT * .txt is not a regular expression because * is different from the meaning of regular *. Regular expression is a text mode composed of normal characters (such as characters a to z) and special characters (called metammatics). Regular expression As a template, a character mode matches the search string. 3.1 Ordinary characters consist of all those that are not explicitly specified as a metamorphic character, a non-printing character.
This includes all uppercase and lowercase letters characters, all numbers, all punctuation symbols, and some symbols. 3.2 Non-Printing Character Character Meaning CX Match The control character indicated by X. For example, CM matches a Control-M or ause. The value of x must be one of A-Z or A-Z. Otherwise, the C is treated as a primary 'c' character. f Match a change page. Equivalent to X0C and CL. n Match a newline. Equivalent to X0A and CJ. r Match a carriage return. Equivalent to X0D and cm. s Match any blank character, including spaces, tabs, change page, and the like. Equivalent to [FNRTV]. S Match any non-blank character. Equivalent to [^ Fnrtv]. t Match a tab. Equivalent to X09 and CI. v Matched a vertical tab. Equivalent to X0B and CK. 3.3 Special characters The so-called special characters, is some of the special meaning characters, as in "* .txt" said above, simply means that any string means. If you want to find a file in the file name, you need to escape *, that is, before it is added. Ls * .txt. Regular expressions have the following special characters. Special character description $ Match the end position of the input string. If the multiline property of the regexp object is set, $ N 'or' R 'is also matched. To match the worth itself, use $. () Mark the beginning and end position of a child expression. Sub-expressions can be used later. To match these characters, use (and). * Match the previous sub-expression zero or multiple times. To match * characters, use *. Match the previous sub-expression once or multiple times. To match characters, use . Match any single character outside of the resort N. To match., Please use it. [Marking a bracket expression. To match [, please use [. • Match the previous sub-expression zero or once, or indicate a non-greedy qualifier. To match? Characters, please use it? The next character is marked as or special characters, or the primary character, or backward reference, or eight-encyclopedifier. For example, 'n' matches character 'n'. 'n' matches the change line. Sequence '' match "", and '(', match "(". ^ Matches the start position of the input string unless used in square brackets, it means that it does not accept the character set. To match ^ character itself Please use ^. {Tag qualifier expression. To match {, use {. | Indicate two options between two items. To match |, please use |. Method of constructed regular expressions and create mathematical expression The same method is the same. That is, using a variety of metades to combine small expressions together to create a larger expression. Regular expressions can be single characters, character sets, character range, character room The selection or any combination of all of these components. 3.4 Limitator qualifier Use a given component to specify a regular expression that a given component must appear to meet the match. Is there * or or {n} or {n,} Or {n, m} Total 6 species. *, And? The qualifier is greedy because they will match the text as much as possible, only with one after they add a non-greed or minimum match. Regulators of the regular expression include: Character Description * Match the previous sub-expression zero or multiple times. For example, ZO * can match "Z" and "ZOO". * Equivalent to {0,}. Match front The child expression is once or more. For example, 'ZO ' can match "ZO" and "ZOO" but cannot match "Z". Equivalent {1,}.
• Match the previous sub-expression zero or once. For example, "Do (ES)" can match "do" in "do" or "does". Is equivalent to {0,1}. {n} n is a non-negative integer. Match the determined N times. For example, 'o {2}' does not match 'o' in "Bob", but can match two O in "Food". {n,} n is a non-negative integer. At least n times. For example, 'o {2,}' cannot match 'O' in "Bob", but can match all O in "fooOOD". 'o {1,}' is equivalent to 'o '. 'o {0,}' is equivalent to 'o *'. {N, M} M and N are non-negative integers, where n <= m. Match at least n times and matched M times. For example, "O {1, 3}" will match the top three O in "foooood". 'o {0,1}' is equivalent to 'o?'. Please note that there is no space between commas and two numbers. 3.5 Locator is used to describe the boundaries of strings or words, ^ and $ respectively refer to the start and end of strings, b Describe the front or rear boundary of the word, and B represents the non-word boundary. You cannot use a qualifier for the locator. 3.6 Selecting all the options in parentheses, binds between adjacent selection items. However, there will be a side effect with parentheses, which is the associated match. Is it available at this time?: Placing this side effect before the first option. Among them?: It is one of the non-captured elements. There are two non-arrested elements. = And?! The location of the pattern is matched to match the search string, the latter is a negative forecast, and the search string is matched without matching the regular expression mode. 3.7 Adding parentheses on a regular expression mode or partial mode to a regular expression mode or partial mode will result in related matching to a temporary buffer, each sub-match captured is encountered from left to right in the regular expression mode. Content storage. The buffer number of the storage sub-match starts from 1, continuous numbers up to the maximum 99 sub-expression. Each buffer can be accessed using a 'n', where n is a one or two digits of a specific buffer. You can use non-capture element characters '?:', '? =', Or '?!' To ignore the saving of related matches. 4. Operation priority of various operators The same priority is calculated from left to right, and the calculations of different priorities are first low. The priority of various operators is as low as follows: operator describes escape characters (), (? :), (? =), [] Parentheses and square brackets *, ,?, {N}, { n,}, {n, m} Limits ^, $, Anymetachacter Location and Sequence | "or" Operation 5. All Symbol Interpretation Character Description Tags the next character as a special character, or a primary character, or one direction After reference, or an octave. For example, 'n' matches characters "n". 'n' matches a newline. Sequence '' matching "" "and" match "(". ^ Match the input string's start position. If the demiline property of the regexp object is set, ^ also matches the position after 'N' or 'R'. $ Match Enter the end position of the string.
If the multiline property of the regexp object is set, the $ also matches the position before 'n' or 'r'. * Match the previous sub-expression zero or multiple times. For example, ZO * can match "Z" and "ZOO". * Equivalent to {0,}. Match the previous sub-expression once or multiple times. For example, 'ZO ' can match "ZO" and "ZOO" but cannot match "Z". Equivalent to {1,}. • Match the previous sub-expression zero or once. For example, "Do (ES)" can match "do" in "do" or "does". Is equivalent to {0,1}. {n} n is a non-negative integer. Match the determined N times. For example, 'o {2}' does not match 'o' in "Bob", but can match two O in "Food". {n,} n is a non-negative integer. At least n times. For example, 'o {2,}' cannot match 'O' in "Bob", but can match all O in "fooOOD". 'o {1,}' is equivalent to 'o '. 'o {0,}' is equivalent to 'o *'. {N, M} M and N are non-negative integers, where n <= m. Match at least n times and matched M times. For example, "O {1, 3}" will match the top three O in "foooood". 'o {0,1}' is equivalent to 'o?'. Please note that there is no space between commas and two numbers. • The matching mode is non-greed when the character is tightly followed by any other restriction (*, , {n}, {n,}, {n, m}). Non-greedy patterns match the search for strings as little as possible, and the default greed mode is as many as possible to match the search string. For example, for the string "OOOO", 'o ?' Will match a single "O", and 'o ' will match all 'o'. Match any individual characters other than "N". To match any characters including 'n', use the mode of '[.n]'. (Pattern) Match Pattern and get this match. The acquired match can be obtained from the generated Matches, using the Submatches collection in VBScript, using $ 0 ... $ 9 properties in JScript. To match the bracket characters, use '(' or ')'. (?: pattern) Match Pattern but does not acquire the matching result, that is, this is a non-acquired match, not to use it after storage. This is useful to use the "or" character (|) to combine a pattern. For example, 'industr (?: Y | iES) is a smale of' Industry | Industries'. (? = pattern) Positive to check, match the lookup string at any string of Pattern. This is a non-acquisition match, that is, the match does not need to be used later.
For example, 'Windows (? = 95 | 98 | NT | 2000)' Map "Windows" in Windows 2000, but does not match "Windows" in "Windows 3.1". It is not consumed by the character, that is, after a match occurs, start the next matching search immediately after the last match, not starting from the character containing the pre-check. (?! pattern) Negative review, match the lookup string at any string of Pattern. This is a non-acquisition match, that is, the match does not need to be used later. For example, 'Windows (?! 95 | 98 | NT | 2000) "can match" Windows "in Windows 3.1, but cannot match" Windows "in" Windows 2000 ". It is not consumed by the character, that is, after a match occurs, start the next matching search immediately after the last match, not the X | Y, which matches X or Y after the character containing the queue. For example, 'Z | Food' can match "z" or "food". '(z | f) OOD' matches "Zood" or "Food". [XYZ] Character collection. Match any of the included characters. For example, '[abc]' can match 'a' in "Plain". [^ XYZ] Negative character set. Match any of the characters that are not included. For example, '[^ ABC]' can match 'P' in "Plain". [A-Z] character range. Match any of the characters within the specified range. For example, '[a-z]' can match any lowercase alphabetic characters in the 'A' to 'Z' range. [^ a-z] Negative character range. Match any of any characters that are not within the specified range. For example, '[^ a-z]' can match any of any characters that are not in the 'A' to 'Z'. b Match a word boundary, that is, the location of the words and spaces. For example, 'erb' can match 'Er' in "Never", but do not match 'Er' in "Verb". B Match the non word boundary. 'Erb' can match 'Er' in "Verb", but cannot match 'Er' in "Never". The CX matches the control character indicated by x. For example, CM matches a Control-M or ause. The value of x must be one of A-Z or A-Z. Otherwise, the C is treated as a primary 'c' character. D Match a numeric character. Equivalent to [0-9]. D Match a non-digital character. Equivalent to [^ 0-9]. f Match a change page. Equivalent to X0C and CL. n Match a newline. Equivalent to X0A and CJ. r Match a carriage return. Equivalent to X0D and cm. s Match any blank character, including spaces, tabs, change page, and the like. Equivalent to [FNRTV]. S Match any non-blank character. Equivalent to [^ Fnrtv]. t Match a tab. Equivalent to X09 and CI. v Matched a vertical tab. Equivalent to X0B and CK. W Match any word character to the underscore. Equivalent to '[A-ZA-Z0-9_]'. W Match any nonword word characters.
Equivalent to '[^ a-za-z0-9_]'. XN matches n, where n is a hexadecimal escape value. The hexadecimal escape value must be a determined two numbers long. For example, 'x41' matches "a". 'X041' is equivalent to 'X04' & "1". ASCII coding can be used in regular expressions. NUM matches NUM, where NUM is a positive integer. References to the acquired match. For example, '(.) 1' matches two consecutive identical characters. n Identify an octal escape value or a backward reference. If N gets at least n acquired sub-expressions, n is backward reference. Otherwise, if n is an octal number (0-7), then n is an eight-input escape value. Nm identifies an octal escape value or a backward reference. If there is at least NM obtaining sub-expression before NM, nm is backward reference. If there is at least n acquisition before NM, then n is a backward reference with the text M. If the previous conditions are not satisfied, if n and m are octal numbers (0-7), the nm will match the eight-en-en-escaic value nm. Nml If n is an octal number (0-3), and M and L are eight feed numbers (0-7), match the eight-encentric escape value NML. UN matches n, where N is a Unicode character represented by four hexadecimal numbers. For example, U00A9 matches copyright symbol (?). 6. Some example regular expressions Description / B ([AZ] ) 1b / gi a word continuous position / (w ): // ([^ /:] ) (: D *)? ([^ # ] *) / Resolution of a URL to protocol, domain, port, and relative path / ^ (?: chapter | section) [1-9] [0-9] {0, 1} $ / Location Section location / [- AZ] / a to z total 26 letters plus one - number. / terb / can match Chapter, not Terminal / BAPT / Match Chapter, without APTITUDE / Windows (? = 95 | 98 | NT) / Match Windows95 or Windows98 or Windows, after finding a match, start after Windows Perform the next search match. 7. Regular expression matching rules 7.1 Basic mode matches everything from the most basic start. Mode, is the most basic element of regular expression, which is a set of characters describing string feature. The pattern can be very simple, consisting of ordinary strings, can also be very complicated, often representing a range of characters in a range, repeated, or represents the context with a special character. For example: ^ ONCE This mode contains a special character ^, indicating that this mode matches only those strings starting with ONCE. For example, this mode matches the string "ONCE UPON A TIME", which does not match "There Once Was A Man from NewYork. As the symbol indicates the beginning, the $ symbol is used to match those strings ending at a given mode. BUCKET $ This mode matches "WHO Kept All of this Cash in A Bucket" does not match "BUCKETS". When using characters ^ and $ simultaneously, indicate exact match (the string is the same as mode). For example: ^ Bucket $ Match strings "bucket". If a pattern does not include ^ and $, then it matches any string containing the mode.
For example: Mode ONCE and strings There ONCE WAS A MAN from Newyork Who Kept All of His Cash in a bucket. It is matched. The letter (O-N-C-E) in this mode is a literal character, that is, they indicate that the letter itself, the number is also the same. Others have some slightly complex characters such as punctuation and white characters (spaces, tabs, etc.) to use escape sequences. All escape sequences are headed in a backslash (). The escape sequence of the tab is: T. So if we want to detect if a string starts with the tab, you can use this mode: ^ T is similar, using n to indicate "new row", and R is a carriage return. Other special symbols can be used in front of the backslash, such as the reverse slash itself, the address. Due to this class. 7.2 Character clusters In the internet program, regular expressions are usually used to verify the user's input. After the user submits an Form, it is necessary to determine whether the input phone number, address, email address, credit card number, etc. are valid, and use ordinary literally based characters. So use a more freely describing the way we want, it is a character cluster. To create a character cluster that represents all vow characters, put all the vow characters in a square bracket: [aaeeiiouu] This mode matches any flyer character, but can only represent a character. With a linked font size, the range of characters can be represented, such as: [AZ] // Match all lowercase letters [AZ] // match all uppercase letters [A-ZA-Z] // match all letters [0-9] // Match all numbers [0-9.-] // Match all numbers, complex [/frtn] // match all white characters, these only means a character, this is a very important . If you want to match a string consisting of a lowercase letter and a digit, such as "Z2", "T6" or "G7", but not "AB2", "R2D3" or "B52", with this mode: ^ [AZ] [0-9] $ although [AZ] represents the range of 26 letters, but here it can only match the first character is a string of lowercase letters. The front mentioned that ^ indicates the beginning of the string, but it still has another meaning. When using ^ in a set of square brackets, it means "non-" or "exclusion", often used to eliminate a character. Also use the previous example, we ask the first character to not be numbers: ^ [^ 0-9] [0-9] $ This mode is matched with "& 5", "G7" and "-2", but with "12", "66" is not matched. Below is a few examples of exclusion of specific characters: [^ az] // In addition to all characters other than lowercase letters [^ / ^] / / In addition to all characters outside of () (^) [^ "] / / In addition to all characters exceeded in double quotes (") and single quotation (')". "(Point, junction) in regular expressions, except for all characters other than" New Rows ". So the mode "^ .5 $" matches any two characters, ending with the number 5 and the other non-"new row" characters. Mode "." You can match any string, except for a string and only a "new row" string.
The formal expression of PHP has some built-in general character clusters, the list is as follows: Character cluster enlightenment [[: alpha:]] any letter [[: DIGIT:]] [[: alnum:]] any letters and numbers [[[ : Space:]] Any white character [[: Upper:]] Any uppercase letters [[: Lower:]] Any lowercase letters [[: punct:]] Any punctuation [[: xdigit:]] any 16 Numbers, equivalent to [0-9A-FA-F] 7.3 Determine that repeated appearance, you already know how to match a letter or number, but more cases, you may have to match a word or a set of numbers. A word has several letters, a set of numbers have several singletons. Follow the rack ({}) behind the character or character cluster to determine the number of repetitive appearances of the previous content. Character clustering meaning ^ [a-za-z _] $ all letters and underline ^ [[: alpha:]] {3} $ All 3 letters word ^ a $ Letter a ^ a {4} $ AAAA ^ a {2,4} $ AA, AAA or AAAA ^ a {1,3} $ A, AA or AAA ^ a {2,} $ contains more than two A ^ a {2,} such as: Aardvark and AAAB, but Apple does not line a {2,} such as: Baad and AAA, but Nantucket does not line T {2} two tabs. {2} All two characters describe three different uses of curly brackets. A number, {x} means "the front character or character cluster only appears"; a digital plus comma, {x,} means "the previous content" X or more times "; two The comma-separated numbers, {x, y} representation "The front content appears at least X times, but does not exceed Y." We can extend the pattern to more words or numbers: ^ [A-ZA-Z0-9 _] {1,} $ // All contain more than one letter, number or underscore string ^ [0-9] { 1,} $ // All positive ^ - {0,1} [0-9] {1,} $ // All integers ^ - {0,1} [0-9] {0,}. { 0, 1} [0-9] {0,} $ // All the last example is not very good, is it? That's here: with all the 1 optional negative (- {0, 1}), followed by 0 or more numbers ([0-9] {0,}), and an optional The decimal point (. {0, 1}) follows up to 0 or more numbers ([0-9] {0,}), and there is no other things ($). Below you will know a simpler way to use. Special characters "?" Are equal to {0, 1}, which are all represented: "0 or 1 previous content" or "the previous content is optional." So just now the example can be simplified: ^ -? [0-9] {0,}.? [0-9] {0,} $ special characters "*" and {0,} are equal, they all represent "0 or more previous content". Finally, characters " " and {1,} are equal, indicating "1 or more previous contents", so the above four examples can be written: ^ [A-ZA-Z0-9 _] $ // All positive numbers containing more than one letter, digital or underscore ^ [0-9] $ / / all positive numbers ^ -? [0-9] $ // All integers ^ -? [0-9] *. "[0-9] * $ // All scales Of course, this does not significantly reduce the complexity of regular expressions, but can make them easier to read.