Regular expression (1)

xiaoxiao2021-03-06 62

First, introduction

Regular expression of this noun, I believe that many people have heard that this noun originated in 1956, a US mathematician called Stephen Kleene published a title based on the early work of McCulloch and Pitts. The paper of the incident, introduced the concept of regular expressions. Regular expressions are used to describe expressions he called "regular set algebra", so the term "regular expression" is used.

Subsequently, it is found that this work can be applied to some early studies using Ken Thompson's computing search algorithm, Ken Thompson is the main inventors of UNIX. The first practical application of the regular expression is the QED editor in UNIX.

Q: Regular expression, what can we do for us?

A: An important part of the text-based editor and search tool. Regular expression allows users to build a matching mode by using a series of special characters, then compare the matching mode with data files, program input, and web pages, whether or not to include matching mode in the comparison object, perform corresponding program of.

Let's introduce the use of regular expressions in one step by step.

Second, the initial contact regular expression

Let's first understand some of the basic concepts of regular expressions. Regular expression as a representation language that defines its own set of descriptions to describe a wide variety of character classes. The following is taken below a paragraph in the MSDN. (MS-Help: //ms.vscc/ms.msdnvs.2052/cpgenref/html/cpconcharacterclasses.htm)

Character escape table

Character class

meaning

Match any character other than / N. If you modify via the Singleline option (see the regular expression option), the description character matches any character.

[aeiou]

Match with any individual characters contained in the specified character set.

[^ aeiou]

Match with any single character in the specified character set.

[0-9A-FA-F]

Use a linked font size (-) to allow the specified continuous character range.

/ p {name}

Match any of the characters in the name character class specified by Name. The supported name is a Unicode group and block range. For example, LL £? ND £? Z £? Isgreek £? Isboxdrawing.

/ P {name}

Text matching the text that is not included in the group and block range specified in {Name}.

/ w

Match with any word character. Isometric Unicode character category

[/ p {ll} / p {lu} / p {lt} / p {lo} / p {nd} / p {pc}]. If you specify a behavior that meets ECMAScript by the ECMAScript option, / W is equivalent to [A-ZA-Z_0-9].

/ W

Match with any non word character. Equivalent to Unicode category [^ / p {ll} / p {lu} / p {lt} / p {lo} / p {nd} / p {pc}]. If you specify a behavior that meets ECMAScript via the ECMAScript option, / W is equivalent to [^ A-ZA-Z_0-9].

/ s

Match with any blank character. Equivalent to Unicode character category [/ f / n / r / t / v / x85 / p {z}]. If you specify a behavior that meets ECMAScript by the ECMAScript option, / s is equivalent to [/ f / n / r / t / v].

/ S

Match with any non-blank character. Equivalent to Unicode character category [^ / F / N / R / T / V / X85 / P {z}]. If you specify a behavior that meets ECMAScript by the ECMAScript option, / s is equivalent to [^ / f / N / R / T / V].

/ d

Match with any decimal number. Like Unicode / P {nd} and non-Unicode's [0-9], and ECMAScript behavior.

/ D

Match with any non-numeric. Like Unicode / P {ND} and non-Unicode's [^ 0-9], and ECMAScript behavior. The above table lists, the most basic syntax definition in the regular expression, understands this, we can define some simple rules, for example:

1. Match all characters

Of course, don't write anything (@ _ @)

2. Match all English characters

a) / w

b) [A-ZA-Z_0-9]

3. Match ten credit numbers

a) / d

b) [0-9]

Look at the example, is it very simple, but so far, the rule written in this, there is a big defect, that is, there is no number of matching characters?

Q: I want to match characters to 5 English letters

A:? ? ?

Light understands the above knowledge is that this L is unable to solve this. How do I solve this problem in the regular expression? Let's see the following table:

(MS-Help: //ms.vscc/ms.msdnvs.2052/cpgenref/html/cpconquantifiers.htm)

Limit table

Default

Description

Specifies zero or more match; for example / w * or (abc) *. The same as {0,}.

Specify one or more matchs; for example / w or (abc) . The same as {1,}.

Specify zero or one match; for example / w? Or (abc)? The same as {0, 1}.

{n}

Specifies just n matching; for example (PIZZA) {2}.

{n,}

Specifies at least n matching; for example (ABC) {2,}.

{n, m}

Specifies at least n but not more than M matching.

Specifies to use repeated first matching as little as possible.

Specifies to use repeated but at least once as possible (Lazy ) as possible.

Specifies to use zero repetition (if possible) or repetition (lazy?).

{n}?

Equivalent to {n} (lazy {n}).

{n,}?

Specifies to use repetition as little as possible, but at least N times (lazy {n,}).

{n, m}?

Specifies between N times and M times and uses it as little as possible (Lazy {N, M}).

Listed in the above table, the regular expression of the regular expression, with the use of these characters, we can easily write more powerful regular expressions.

E.g:

1. Match zero or multiple all characters

2. Match one or more characters

3. Match zero or multiple English characters

/ w *

4. Match one or more English characters

[A-ZA-Z0-9]

5. Match 3 decimal numbers

/ d {3}

6. Match at least 3 decimal numbers

/ d {3,}

7. Match 3 to 6 decimal numbers

/ d {3, 6}

Now we can answer the above question:

Q: I want to match characters to 5 English letters

A: / w {5}

Very happy, we have solved the above problems, but new problems are always constant. How do I limit the matching character?

Q: I want to match the string starting with DOC

A:???

In order to solve this problem, let's take a look at this table:

(MS-Help: //ms.vscc/ms.msdnvs.2052/cpgenref/html/cpconatomiczero-widthassertions.htm)

Atomic zero width assertion

assertion

Description

Specifies that the match must appear on the beginning or row of strings. For more information, see the Multiline option in the regular expression option. $

The specified match must appear in the following position: the end of the string, the end of the string / N or the end of the line. For more information, see the Multiline option in the regular expression option.

/ A

Specifies that the match must appear in the beginning of the string (ignore the multiline option).

Specifies that the match must appear before / n at the end of the string or the end of the string (ignore the multiline option).

Specifies that the match must appear on the end of the string (ignore the multiline option).

/ G

The specified match must appear in the current search (this location is usually the first character after the last search end position). For example, consider a series string consisting of separate character groups, where each set is n characters. When searching in each character group, if the regular expression is matched in the character position such as 0, n, 2n, 3n, the regular expression is successful. It will be successful only when the match appears on the positioning group boundary.

/ B

Specifies that the match must appear on the boundary between / w (alphanumeric) and / w (non-alphanumeric) characters. Matching must appear on the word boundary, that is, appearing on the first or last character in the word separated by the space.

/ B

Specifying match must not appear on the / B boundary.

I believe everyone noticed that the first assertion characters in this table are @ _ @.

For example, ^ specifies the current position at the beginning of the row or string. Therefore, the regular expression ^ ftp will only return the match item of the string "FTP" that appears at the beginning of the row.

It seems that the problems encountered above, can solve it, let us solve the above problem:

Q: I want to match the string starting with DOC

A: ^ DOC

We initially understand what is the regular expression, which has been known for its most basic syntax, as warm-up @ _ @, next, only officially entered the topic, we will be in depth from the second article to discuss the use of regular expressions.

In the previous article, some of the basic concepts of preliminary regular expressions have been introduced. I believe that many people have aware of the basic knowledge of regular expressions. Next, we combine some actual programming examples to cover up the regular expression. effect.

First, let's take a few practical examples first:

1. Verify that the input character is all English characters

JavaScript:

VAR EX = "^ // w $";

Var RE = New Regexp (EX, "I");

Return Re.Test (STR);

VBScript

Dim Regex, Flag, EX

EX = "^ / w $"

SET Regex = New Regexp

Regex.ignorecase = TRUE

Regex.global = TRUE

Regex.pattern = EX

Flag = regex.test (STR)

C #

System.String EX = @ "^ / w $";

System.Text.RegularExpressions.Regex reg = new regex (ex); BOOL flag = reg.ismatch (str);

2. Verify mail format

C #

System.String EX = @ "^ / w @/w ./w ";

System.Text.RegularExpressions.Regex reg = new regex (ex); BOOL flag = reg.ismatch (str);

3. Change the date of the date (replacing the date form of mm / dd / yy with DD-MM-YY)

C #

String MDYTODMY (String Input)

{

Return Regex.Replace (Input,

"// b (? // D {1, 2}) / (? // d {1, 2}) / (? // D {2, 4}) // b ",

"$ {day} - $ {month} - $ {year}");

}

4. Extract protocol and port number from URL

C #

String Extension (String URL)

{

Regex r = new regex (@ "^ (? / w ): // [^ /] ? (? : / d )? /",

Regexoptions.compiled;

Return R.Match (URL) .result ("$ {Proto} $ {port});

}

The example here may be that some regular expressions we usually encounter in web page development, especially in the first example, give the implementation of different languages such as JavaScript, Vbscript, C #, everyone is not difficult to see, For different languages, the regular expression is not different, but the implementation of the regular expression is different. And how to play the public, but also to see the support of the class.

(Excerpted from MSDN: Microsoft .NET Framework SDK provides a large number of regular expression tools that enable you to efficiently create, compare, and modify strings, and quickly analyze a lot of text and data to search, remove and replace text mode. Ms- Help: //ms.vscc/ms.msdnvs.2052/cpgenref/html/cpconregularExpressionsLanguageElements.htm)

Let's analyze these examples one by one:

1-2, these two examples are simple, just a simple verification string compliant with the format specified by the regular expression, where the syntax used, has been introduced in the first article, here is a simple description.

The expression of the first example: ^ / W $

^ - Indicates that the qualified match begins with the start of the string

/ w - means matching English characters

- means 1 or more times in the matching character

$ - indicates that the match is over the end of the string end.

Verify strings such as AsgasDFS

The expression of the second example: ^/w @/w ./w

^ - Indicates that the qualified match begins with the start of the string

/ w - means matching English characters

- means 1 or more times in the matching character

@ - Match ordinary character @

- Match a normal character. (Note. For special characters, therefore add / translate)

$ - indicates that the match is over the end of the string end.

Verify that the format of Dragontt@sina.com

In the 3rd example, use the replace, so we will first take a look at the definition of the replacement in the regular expression:

(MS-Help: //ms.vscc/ms.msdnvs.2052/cpgenref/html/cpconsubstitudes.htm)

replace

character

meaning

$ 123

Replace the last sub-string that matches the group number 123 (decimal).

$ {name}

Replace the last sub-string of matching (? ). $$

Replace a single "$" character.

$ & &

Replace a copy of the itself completely matches itself.

$ `

Replace all text of the input string before matching.

$ '

Replace all text of the entered input string.

Replace the final captured group.

$ _

Replace the entire input string.

Group construct

(MS-Help: //ms.vscc/ms.msdnvs.2052/cpgenref/html/cpcongroupingconstructs.htm)

Group construct

definition

()

Capture the matching sub-string (or non-capture group; for more information, see the EXPLICITCAPTURE Options in the regular expression option.) Use () capture to start automatic numbers from 1 according to the order of the left bracket. The first capture of capture element numbered zero is the text that matches the entire regular expression pattern.

(? )

Capture the matching substring to a group name or number name. A string for NAME cannot contain any punctuation, and cannot begin with a number. You can use single-quoted replacement angle brackets, such as (? 'Name').

(? )

Balance group definition. Delete the definition of previously defined Name2 groups and store the interval between the previously defined Name2 group and the current group in the Name1 group. If the NAME2 group is not defined, the match will be back. Since the last definition of the deletion of Name2 displays the previous definition of Name2, the configuration allows the Name2 group of capture stacks to be used as a counter to track nested structures (such as parentheses). In this configuration, NAME1 is optional. You can use single-quoted replacement angle brackets, such as (? 'Name1-Name2').

(?:)

Non-capture groups.

(? IMNSX-IMNSX:)

Apply or disable the options specified in the sub-expression. For example, (? I-S:) will open uncountment and disable a single line mode. For more information, see the regular expression option.

(? =)

Zero width is predicting the first line assertion. Conveuing only when the sub-expression matches the right side of this location. For example, / w (? = / D) matches the word followed by the number without matching the number. This construct does not retrore.

(?!)

Zero width negative prediction first line assertion. You can continue to match only if the sub-expression does not match the right side of this location. For example, / b (?! Un) / w / b matches the word not starting with UNN.

(? <=)

Zero width is reviewing the assertion. Conveuing only when the sub-expression matches the left side of this location. For example, (? <= 19) 99 matches an example of 99 followed by the 19. This construct does not retrore.

The zero width is negative after review. Match only when the sub-expression does not match the left side of this location.

(?>)

Non-retractable expression (also known as greedy expressions). The sub-expression is only completely matched once, and then participate in the backtrack. (That is, the sub-expression is only matched with a string that can be individually matched by the sub-expression.)

We are still a simple understanding of these two concepts:

Group construct:

The most basic configuration is (), part of the left and right parentheses is a packet;

Further grouping is a packet method such as: (? ), this method is different from the first way, which is named after the part of the packet, so that information can be obtained by naming the group. ;

(Is there a packet structure such as (? =), Etc., there is no use in the example of this, and next time we are here)

replace:

Both basic construction packets () and (? ) are mentioned above, and we can get matching results such as $ 1, $ {name} by these two packets. This way, may be conceptually blurred, we still combine the above example:

The regular expression of the third example is: // b (? // D {1, 2}) / (? // D {1, 2}) / (? // D {2,4}) // b

(Explan, why is it // here: here is an example of the C #, in the C # language / translation character, you want to use // not translated in the string, you need to use // or the beginning of the entire string Plus @ 标,, ie equivalent

@ "/ B (? / d {1, 2}) / (? / d {1, 2}) / (? / D {2,4} / b")

/ b - is a special case. In the regular expression, the / b represents the word boundary (between / w and / w characters) in addition to the retracted character in the [] character class. In the replacement mode, / b always indicates the retracter.

(? / D {1, 2} - Constructs a group called Month, this packet matches a number of length 1-2

/ - Match ordinary / characters

(? / d {1, 2} - Constructing a group named DAY, this packet matches a number of length 1-2

/ - Match ordinary / characters

(? / D {2, 4} / b ") - Constructs a group called Year, this packet matches a number of length 2-4

I can't see the role of these groups, we will look at this sentence.

$ {day} - $ {month} - $ {year}

$ {day} - Get information after the top-named DAY group matches

- - Ordinary - character

$ {month} - get information after packet matching Month as constructed above

- - Ordinary - character

$ {year} - Get the information of the packet matching of the above constructed Year

for example:

Alternatively replace the method of using an example of using an example of 04/02/2003

(? / D {1, 2} group will match 04 by $ {month} to this match value

(? / d {1, 2}) packets will match 02 by $ {day} to get this matching value

(? / D {1, 2}) Packet will match to 2003 by $ {year} to get this match value

After understanding this example, we are looking at the fourth example.

The regular basis of the fourth example

^ (? / w ): // [^ /] ? (? : / d )? /

^ - Indicates that the qualified match begins with the start of the string

(? / w ) - Construct a packet named Proto, match one or more letters

: - Ordinary: Character

// - Match two / characters

[^ /] - means that this is not allowed / character

? - Indicates that specifies to use repetitions as little as possible but at least one match

(? : / d ) - Construct a packet called Port, matching shapes such as: 2134 (colon one or more numbers)

? - Indicates that the matching character appears 0 times or 1 time

/ - Match / character

Finally, get the matching content of two packet constructs through $ {prooto} $ {port}.

(References for regex objects

MS-help: //ms.vscc/ms.msdnvs.2052/cpref/html/frlrfsystemTexTregulaRexpressionSregexmemberstopic.htm) Ok, a few examples of this introduced, I have almost got, I hope everyone will gain, next time, Further, in some special requirements, further explore the implementation of regular expressions.

In the previous article, the basic syntax of the regular expression, as well as some simple examples. But these are not all of the issues we will encounter, and we have to write some more complex regular expressions to solve our practical problems.

Here, I will take a few questions first, then we use the knowledge of regular expressions by using the regular expression.

1. Equipped with one of the two conditions, for example: is a pure number or pure character

123 (TRUE), Hello (True), 234. Test23 (False)

2. To get a character combination that does not start with a number

Such as: how2234do> you234do, I hope to get how and you not do, do, do

3. Get the character combination with numbers

In the above example, you get DO and DO

4. To get a combination of characters that do not end with numbers

Still the situation above, to get HO, DO, YO, DO

5. Get the character combination of numbers

In the same example, get HO, DO, YO, DO

6. Do not allow AB in characters to appear at the same time

Example: Nihaoma (True), ABOVE (FALSE), Agoodboy (True)

Below we started to solve these problems:

The first one: Established one of the two conditions.

This requirement may represent a common requirement, let's take a look at this table.

Replacement structure

definition

Matching any of the terms separated by a | (vertical strip) character; for example, Cat | Dog | Tiger. Use the leftmost success match.

(? (Expression) YES | NO)

If the expression matches this location, match the "YES" section; otherwise, match the "NO" section. "NO" section can be omitted. The expression can be any effective expression, but it will become zero width assertions, so the syntax is equivalent to (? (? = Expression) YES | NO). Note that if the expression is the name of the naming group or the capture group number, the replacement structure will be interpreted as a capture test (described in this table). To avoid confusion in these cases, you can explicitly spell an internal (= expression).

(? (Name) Yes | NO)

If the name capture string matches, match the "YES" section; otherwise, match the "NO" section. "NO" section can be omitted. If a given name does not correspond to the name or number of the capture group used in this expression, the replacement structure will be interpreted as an expression test (described in the previous form of this table).

(Ms-help: //ms.vscc/ms.msdnvs.2052/cpgenref/html/cpConalternationConstructs.htm)

In this table, we see that in order to solve this type of problem, define the relationship of | to represent or the relationship, just like a common or operator, now let's take a look at how to use | to solve our problem.

1. First write expressions for selectable expressions:

a) pure numbers - [0-9] *

b) Pure letters - [A-ZA-Z] *

2. Use optional conditions | Connecting is what we need

^ [0-9] * $ | ^ [A-ZA-Z] * $ (here I specially add ^ and $ qualifiers to the two conditions, this is necessary when verifying the string is fully compliant, it is necessary If you don't add these two qualifiers, interested friends can try their own effects.

The following four problems are actually a class, so we put them together. Next we come to solve the second to fourth questions:

First, we review the group construct introduced last time:

(? =)

(?!)

(? <=)

The zero width is negative after review. Match only when the sub-expression does not match the left side of this location.

It can be seen that these four rules of this table can solve our problems.

@ _ @ First solve our problem and say:

Second: To get a character combination that does not start with a number

[A-ZA-Z] {2,} - Description Match 2 or more letters

(Note: This is a practical practice, because, according to our logic how2234do> you234do's O letter is also in line with it, but this is not what we want, of course, there are other solutions, can be based on actual The situation is handled, here is to explain this method @ _ @)

Third example: get a character combination at the beginning of the number

(? <= / d) [A-ZA-Z]

(? <= / d) - Character that is limited to the beginning of the number matches

[A-ZA-Z] - Description Match 1 or more letters

The fourth example: To get a character combination that is not ending

[A-ZA-Z] (?! / d)

[A-ZA-Z] - Description Match 1 or more letters

(?!! / d) - Limited the letters that are not ending the numbers

The fifth example: Get the character combination of numbers

[A-ZA-Z] (? = / d)

[A-ZA-Z] - Description Match 1 or more letters

(? = / d) - the letter that is limited to the end of the number matches

Sixth Case: Do not allow AB in characters to appear simultaneously

^ (?!. *? ab). * $

(?!. *? ab) - Limited the character that does not allow AB-connected characters

* - any character

Introducing here, our question is also solved. Although the example is simple, but complex things are also based on a simple basis. In fact, the key to writing regular expressions is to be good at customization rules, describe the most concise correct words, and then write it with the syntax of the regular expression, you can rely on everyone to accumulate experience.

转载请注明原文地址:https://www.9cbs.com/read-54040.html

9cbs

New Post(0)