Regular expression in C #
Jeffrey E.F. Friedl wrote a book "proficient in regular expression" on the regular expression. In order to make readers better understand and master regular expressions, the author makes a story. The language of the book is based on Perl. As far as I know the regular expression in C # is also based on Perl5. So they should have many common situations.
In fact, I don't intend to translate the content of the book. One of this book is too much. I simply don't compete for this work; two If I really translate this book At the same time, the code inside is changed to C #, and if there is no original author, it may be suspected. Therefore, the right is a good reading note.
I have skip priorities, we can enter the first chapter directly:
Introduction Regular expression
The authors say this chapter is prepared for the absolute rookie of the regular expression, the purpose is to lay a solid foundation for future chapters. So if you are a rookie, you can ignore this chapter.
Story scene:
The head of your archives wants a tool to check the repeated words (such as: this this), which usually encounters problems when editing documents. Your job is to create a solution:
Accept any number of files to be checked, report those rows with repeating words in each file, highlighting these duplicate words while ensuring that the original file name and these rows appear in the report.
Cross-bank inspection, find the last word of a line and the first word of the next line, the first word is repeated.
Find out duplicate words, whether they are case sensitive (such as: THE the), and allowing all number of blank characters (spaces, tabs, new lines, etc.) between these duplicates
Find repetitive words, even these words are separated by HTML tags. (Such as: IT IS Very b> Very Important.)
To solve the above practical problems, we must first write the regular expression, find the text we want, ignore the text we don't need, and then use our C # code to process the obtained text.
Before using the regular expression, you may know what is regular expression. Even you don't know, you can almost certainly be familiar with the basic concepts.
You know that Report.txt is a specific file name, but if you have any UNIX or DOS / Windows experience, you also know "* .txt" can be used to select multiple files. This form of file name, some characters have special meanings. The asterisk means that matching anything, the question mark means matching a character. Such as: "*. Txt" means that any file name is ended with .txt.
The file name is mode match, using a limited match. There is also a search engine on the current network, also allows you to perform content search using some specified matches. Regular expression uses a wealth of match characters to handle various complex issues.
First we introduce two position matching characters:
^: Represents the beginning of a line
$: Indicates the end of a line
Such as: Expression: "^ cat", the matching word CAT appears at the beginning of the row, pay attention to a location character, not to match the characters itself.
Similarly, the expression: "CAT $" matches the end of the word CAT.
Next, we introduce the square bracket "[]" in the expression, which represents one of characters in parentheses. Such as:
Expression: "[0123456789]" will match any one of numbers 0 to 9.
For example: We have to find text, all containing GRAY or GREY, the expression can be written like this: "Gr [EA] Y" [EA] represents one of the Matches EA, not the entire EA.
If we want to match the "Tag" of
"
With the range symbol, we only need to give a range of boundary characters, the above HTML example, we can write: "
And the expression: "[0-9a-za-z]" means now it is clear? It matches numeric characters, lowercases 26 letters and one of 26 letters.
"^" Symbol in []
If you see expressions such as "[^ 0-9]", "^" is no longer the position symbols mentioned earlier, here it is a negative symbol, indicating the meaning of the exclusion, the expression above, indicating that Contains characters of numbers 0 to 9.
Thinking 1: Expression "q [^ u]" mean. If there is the following words, those will match?
Iraqi
Iraqian
MIQRA
Qasida
Qintar
QOph
Zaqqum
In addition to the representation of the range character, there is a point character ".", The point character appears in the expression, indicates that any character is matched.
Such as expressions: "07.04.76" will match:
Shaped as: 07/04/76, 07-04-76, 07.04.76.
If we need to choose in some characters, we can adopt an option character "|":
Option characters have "or" meaning, such as expressions: "[bob | robert]" means BOB or Robert will be matched.
Now look at the expressions mentioned earlier: "GR [EA] Y", using the option character we can write "Grey | Gray", which is the same.
The use of parentheses: parentheses are also used as metamatics in the expression, as in front of the expression, we can write: "GR (e | a) y", the parentheses here is necessary, if there is no parentheses Then, the expression "GRE | AY" will match GRE or AY, which is not what we want. If you are not very clear, let's take a look at the example below:
Find all from from: or subject: or date: out of email, we compare the following two expressions:
Expression 1: "^ from | Subject | DATA:"
Expression 2: "^ (from | Subject | DATA):"
Which one is what we want?
Obviously, the results of expression 1 are not what we want, and it will match: from or Subjec or Data:, Expressions 2 use crackers to meet our needs.
Word boundary
We can already match characters that appear in the first and row, then if we want to position, it is not just the first or walk. We need to introduce the word boundary symbol, the word boundary symbols are: "/ b", slash is not omitted, otherwise becomes a matching letter B. Using a word boundary symbol, we can position the matching location must appear on the beginning or end of a word, not the intermediate of the word. For example: "/ BIS / B" expression in string "this is a cat." Will match the word "IS" without match "IS" in words "this". String boundary symbol
In addition to the above position symbols, if we have to match the entire string (including multiple words), then we can use the following two symbols:
/ A: Indicates the beginning of the string;
/ z: Indicates the end of the string.
Expression: "/ athis is a cat / z" will match this string "this is a cat".
Use boundary positioning symbols, here is an important concept, that is, word characters, word characters represent characters that can constitute words, which are any of [A-ZA-Z0-9]. Therefore, the above expression will also match the sentence "this is a cat.". The result of the match does not include the end.
Repeat quantity symbol
Let us see the expression: "COLOU? R", this expression has not seen the question mark yet, (this question mark and the question mark of the file name are different), which indicates that a character in front of the symbol can be repeated. Number, "?" Indicates 0 or 1 time, the question mark in the previous expression indicates that U can appear 0 or 1 times, so it will match "color" or "color".
Here are other repetitive quantities:
: Represents 1 time or multiple times
*: Indicates 0 times or multiple times
For example, we have to represent one or more spaces, we can write expressions: " ";
What if you want to indicate a specific number? We introduce the sputum {}.
{n}: n is a specific number, indicating repeating N times.
{n, m}: Indicates minimum that, up to M times.
These symbols define the number of matches of a character in front of the symbol. But if you want to repeat multiple characters, such as a word, what should I do? We use parentheses again, we take parentheses as the range symbol of the option, here is a plug-of other use method, which is expressed as a group, such as an expression: "(this)" this is a group Then, the problem is good, the repeated quantity symbol can be used to indicate the number of repetitions in front of it.
Now returning to the problem of looking repeating words, if we have to find "the the", we can write an expression based on the knowledge we have learned so far:
"/ bthe the / b"
The expression means that there is one or more spaces between the two THEs.
Similarly, we can also write:
"/ b (THE ) {2}"
But if you want to find all possible repeating words? Our current knowledge is not enough to solve this problem. Here we introduce the concept of reverse reference, we have seen parentheses can be used as the boundary of the group, and there are multiple groups defined by parentheses, depending on them. The order, these groups of default were allocated a group number, the first group number was No. 1, and pushed. Then the reverse reference is to reference this group using "/ N" in the position of the subsequent expression, where n is a referenced group number. The reverse reference is like the variables in the program. Let's see specific examples: the previous words repeat expressions, now we use reverse references to write:
"/ b (the) / 1 / b"
Now, if we need to match all repetition words, we can rewrite the expression: "/ b ([A-ZA-Z] ) / 1 / b"
The last question is that if the characters we have to match are symbols in regular expressions, what should I do? Yes, use the escape symbol "/", for example, if you want to match a decimal point, then you can: "/.", But also note that if you use the expression in the program, "/" must also follow the string regulations Change "//" or add @ in front of the expression.
This chapter is only a basics for the rookie about the regular expression. It is just some of them. We still have many things to learn, which will introduce one by one in the following chapters. In fact, the learning of regular expressions is not difficult, you need to be patient and practice, if you want to master it. Perhaps someone said: "I don't want to know the details of the car, I just want to learn how to drive." If you think so, then you never know how to use the regular expression to solve your problem, and then you will always Do not understand the true power of the regular expression.