Interpretation of regular expressions in C # [from online]

xiaoxiao2021-03-06  55

For many years, many programming languages ​​and tools contain support for regular expressions, and a series of namespaces and a range of classes that make full playback of rule expressions, and they are also with future Perl. The rule expression in 5 is compatible.

In addition, the Regexp class can also complete some other functions, such as editing from the right left binding mode and expression.

In this article, I will briefly introduce the classes and methods in System.Text.RegularExpression, some string matching, and replacement examples, and the details of the group structure, and finally, some of the common you may use. Expression.

Basic knowledge that should be mastered

Knowledge of rules expressions may be one of the knowledge of many programmers "often forget". In this article, we will assume that you have mastered the usage of rule expressions, especially the use of expressions in Perl 5. The .NET's regexp class is a supercharge in the expression in Perl 5, so it is theoretically as a good starting point. We also assume that you have the basic knowledge of the C # syntax and .NET architecture.

If you don't have a rule expression, I suggest you start learning from Perl 5's grammar. The authoritative book in rule expressions is a book written by Jeffrey Freder, which we strongly recommend reading this book for readers who wish to understand expressions.

RegularExpression assembly

The Regexp rule class is included in the System.Text.RegularExpressions.dll file, you must reference this file when compiling the application, for example:

CSC r: system.text.regulaRexpressions.dll foo.cs

The command will create a foo.exe file, which references the System.Text.RegularExpressions file.

Name Space Introduction

Only 6 classes and one definition in the namespace, they are:

Capture: Contains a result of a match;

CaptureCollection: Sequence of Capture;

Group: The result of a group record is inherited by Capture;

Match: The matching result of a time expression is inherited by group;

MatchCollection: a sequence of match;

Matchevaluator: The agent used when performing a replacement operation;

Regex: instance of compiled expressions.

There are also some static methods in the Regex class:

Escape: Side essentials in the REGEX in the string;

Ismatch: If the expression matches in a string, the method returns a boolean value;

Match: Returns the instance of Match;

Matches: Returns a series of matches;

Replace: replace the matching expression with replacement strings;

Split: Returns a series of strings determined by expressions;

UNESCAPE: Do not escape the escape character in the string.

Simple match

We first start learning from the simple expression of Regex, Match classes.

code:

Match m = regex.match ("Abracadabra", "(A | B | R) ");

We now have an example of a Match class that can be used to test, for example: if (M.Success) ...

If you want to use the matching string, you can convert it into a string:

code:

Console.writeline ("match =" m.toString ());

This example can get the following output: match = abra. This is the matching string.

The replacement of the string is very intuitive. For example, the following statement:

code:

String s = regex.replace ("Abracadabra", "Abra", "zzzz");

It returns string zzzcadzzzzz, all matching strings are replaced with zzzzz.

Now let's look at an example of a complicated string:

code:

String s = regex.replace ("Abra", @ "^ / s * (. *?) / s * $", "$ 1");

This statement returns a string Abra, and its preamble and suffix are removed.

The above mode is very useful for deleting leading and subsequent spaces in any string. In C #, we often use alphanuce strings, in an alphanumeric string, the compiler does not treat characters "/" as an escape character. When using characters "/" specifies the escape character, @ "..." is very useful. Also worth mentioning $ 1 is used in a string replacement, it indicates that the replacement string can only contain the alternative string.

Detail of the matching engine

Now, we understand a slightly complicated example through a group structure. Look at the example below:

code:

String text = "AbracAbra1abracadabra2abracadabra3";

String Pat = @ "

(# 1 start

Abra # matching string Abra

(# The beginning of the second group

CAD # matching string CAD

)? # 第二 组 组 (optional)

) # The first group ends

# Match once or more

"

// ignore the comment using X modifier

Regex R = New Regex (PAT, "X");

// Get a list of group numbers

int [] gnums = r.getGroupNumBers ();

// First match

Match m = r.match (text);

While (M.Success)

{

// Starting from the group 1

For (int i = 1; i

{

Group g = m.group (gnums);

// Get this matching group

Console.writeline ("Group" gnums "= [" g.toString () "]");

/ / Calculate the starting position and length of this group

CaptureCollection cc = g.captures;

For (int J = 0; j

{

CAPTURE C = CC [J];

Console.writeLine ("Capture" J "= [" C.TOString ()

"] Index =" C.index "length =" C.Length);

}

}

/ / Next match

m = m.nextmatch ();

}

The output of this example is as follows:

code:

Group1 = [Abra]

Capture0 = [Abracad] index = 0 length = 7

Capture1 = [abra] index = 7 length = 4

Group2 = [CAD] capture0 = [CAD] index = 4 length = 3

Group1 = [Abra]

Capture0 = [Abracad] index = 12 length = 7

Capture1 = [abra] index = 19 length = 4

Group2 = [CAD]

Capture0 = [CAD] INDEX = 16 Length = 3

Group1 = [Abra]

Capture0 = [Abracad] index = 24 length = 7

Capture1 = [abra] index = 31 length = 4

Group2 = [CAD]

Capture0 = [CAD] Index = 28 Length = 3

We first started from the exam string PAT, and the PAT contains an expression. The first CAPTURE starts from the first parentheses and the expression will match an Abra. The second Capture group starts from the second parentheses, but the first Capture group has not ended, which means that the result of the first group match is Abracad, and the matching result of the second group is only CAD. So if you are used? The symbol is made to make CAD an optional match, and the result of the match may be ABRA or ABRACAD. Then, the first group will end, and the expression is required for multiple matching by specifying the symbol requirements.

Now let's take a look at the situation where the matching process occurs. First, an instance of the expression is established by calling the Regex's constructor method, and various options are specified. In this example, since there is a comment in the expression, the X option is selected, and some spaces are used. Open the X option, the expression will ignore the annotation and spaces where there is no escape.

Then, acquire the list of numbers of the group defined in the expression. You can of course use these numbers to be used, where used is a programming method. If a named group is used, it is also very effective as a way to establish a fast index.

Next is to complete the first match. By a cyclic test, whether the current match is successful, the next is to repeat the group list from Group 1. The reason for using Group 0 in this example is that group 0 is a fully matched string, and if you want to collect all the matching strings as a single string, you will use Group 0.

We track the CaptureCollection in each group. Usually, each match can be matched, each group can only have a Capture, but the Group1 in this example has two capture: Capture0 and Capture1. If you only need Group1's toString, you will only get Abra, of course it will also match AbracAD. The value of TString in the group is the value of the last Capture in its CaptureCollection, which is exactly what we need. If you want the entire process, you should delete the symbol from the expression, so that the Regex engine knows that we only need to match the expression.

Comparison based on process and expression method

In general, users using rule expressions can be divided into the following two categories: The first class users do not use rule expressions, but use the process to perform some operations that need to be repeated; the second type of user makes full use of rules express The function and power of the processing engine, and use the process as little as possible.

For most users, the best solution is more than two. I hope this article can explain the role of the REGEXP class in the .NET language and its excellent, inferior point between performance and complexity.

Process mode

One function we need to use in programming is to match a part of the string or some other patriants. The following is an example in which words in the string are matched: program code:

String text = "The Quick Red Fox Jumped Over The Lazy Brown Dog.";

System.console.writeline ("text = [" text "]");

String result = "";

String pattern = @ "/ w | / w ";

Foreach (Match M in Regex.matches (Text, Pattern))

{

/ / Get a string of matching strings

String x = m.toString ();

// If the first character is lowercase

IF (Char.islower (x [0])))

// turn to uppercase

X = char.toupper (x [0]) x.substring (1, x.length-1);

/ / Collect all characters

Result = x;

}

System.console.writeline ("Result = [" result "]");

As shown in the example above, we use the Foreach statement in the C # language to handle each match character, and complete the corresponding processing, in this example, a Result string is created. The output of this example will now:

code:

Text = [The Quick Red Fox Jumped Over The Lazy Brown Dog.]

Result = [The Quick Red Fox Jumped Over The Lazy Brown Dog.]

Expression-based mode

Another way to complete the functions in the above example is through a Matchevaluator, the new code is as follows:

code:

Static String Captext (Match M)

{

/ / Get a string of matching strings

String x = m.toString ();

// If the first character is lowercase

IF (Char.islower (x [0])))

// Convert to uppercase

Return char.toupper (x [0]) x.substring (1, x.length-1);

Return X;

}

Static void main ()

{

String text = "The Quick Red Fox Jumped Over The Lazy Brown Dog.";

System.console.writeline ("text = [" text "]");

String pattern = @ "/ w ";

String Result = regex.replace (text, pattern, new matchevaluator (test.captext));

System.console.writeline ("Result = [" result "]");

}

It should also be noted that this mode is very simple because it only needs to be modified to words without any words.

Common expression

In order to better understand how to use rule expressions in the C # environment, I wrote some rules expressions that may be useful to you. These expressions have been used in other environments. I hope to have some to you. help. Roman numerals

code:

String p1 = "^ m * (d? c {0, 3} | C [DM])" "(L? x {0, 3} | x [lc]) (v? i {0, 3} | i [vx]) $ ";

String t1 = "vii";

Match m1 = regex.match (t1, p1);

Two words before exchange

code:

String T2 = "The Quick Brown Fox";

String p2 = @ "(/ s ) (/ s ) (/ s )";

Regex x2 = new regex (p2);

String r2 = x2.replace (t2, "$ 3 $ 2 $ 1", 1);

Guan Jian word = value

code:

String T3 = "myval = 3";

String p3 = @ "(/ w ) / s * = / s * (. *) / s * $";

Match m3 = regex.match (t3, p3);

Implement 80 characters per line

code:

String T4 = "*******************"

"***************************"

"******************************"

String p4 = ". {80,}";

Match m4 = regex.match (T4, P4);

Month / Day / Years: Score: Second time format

code:

String T5 = "01/01/01 16:10:01";

String p5 = @ "(/ d ) / (/ d ) / (/ d ) (/ d ): (/ d ): (/ d )";

Match m5 = regex.match (t5, p5);

Change the directory (only for Windows platforms)

code:

String T6 = @ "C: / Documents and Settings / User1 / Desktop /";

String r6 = regex.replace (t6, @ "// user1 ////", @ "// user2 //);

Expand 16 transfers

code:

String T7 = "% 41"; // Capital A

String p7 = "% ([0-9A-FA-F] [0-9A-FA-F])";

String r7 = regex.replace (T7, P7, HEXCONVERT);

Delete annotations in C language (to be improved)

code:

String T8 = @ "

/ *

* Traditional annotation

* /

"

String p8 = @ "

// * # Match the delimiter of the comment start

. *? # Match comments

/ * / # Match Note End delimited

"

String r8 = regex.replace (T8, P8, "", "XS");

Delete spaces from starting and ending in strings

code:

String T9A = "Leading";

String p9a = @ "^ / s ";

String r9a = regex.replace (T9A, P9A, "");

String T9B = "trailing";

String p9b = @ "/ s $";

String r9b = regex.replace (T9B, P9B, "");

Add character n in characters / post, make it a real new line

code:

String T10 = @ "/ ntest / n";

String r10 = regex.replace (t10, @ "// n", "/ n");

Convert IP address

code:

String T11 = "55.54.53.52";

String p11 = "^"

@ "([01]? / D / d | 2 [0-4] / d | 25 [0-5]) /."

@ "([01]? / D / d | 2 [0-4] / d | 25 [0-5]) /."

@ "([01]? / D / d | 2 [0-4] / d | 25 [0-5]) /."

@ "([01]? / D / d | 2 [0-4] / d | 25 [0-5])"

"$";

Match m1 = regex.match (t11, p11);

Delete the file name contains the path

code:

String T12 = @ "c: /file.txt";

String p12 = @ "^. * //";

String r12 = regex.replace (t12, p12, "");

Link multi-line string

code:

String T13 = @ "this is

a split line ";

String p13 = @ "/ s * / r? / n / s *";

String r13 = regex.replace (t13, p13, "");

Extract all numbers in strings

code:

String T14 = @ "

Test 1

TEST 2.3

Test 47

"

String p14 = @ "(/ d /.? / d * | /. / d )";

Matchcollection MC14 = Regex.matches (T14, P14);

Find all uppercase letters

code:

String T15 = "this is a test of all caps";

String p15 = @ "(/ b [^ / WA-Z0-9 _] / b)";

Matchcollection MC15 = Regex.matches (T15, P15);

Look at lowercase words

code:

String T16 = "this is a test of lowercase";

String p16 = @ "(/ b [^ / WA-Z0-9 _] / b)";

Matchcollection mc16 = regex.matches (T16, P16); Find the first word for uppercase

code:

String T17 = "this is a test of initial caps";

String p17 = @ "(/ b [^ / wa-z0-9 _] [^ / WA-Z0-9 _] * / b)";

Matchcollection MC17 = Regex.matches (T17, P17);

Find links in a simple HTML language

code:

String T18 = @ "

first tag text

Next tag text

"

String p18 = @ "] *? href / s * = / s * [" "" "" " @" ([^ "">] ?) ['""]?> " ;

Matchcollection mc18 = regex.matches (t18, p18, "si");


New Post(0)