Create a solution:
Accept any number of files to be checked, report those rows with repeating words in each file, highlighting these duplicate words while ensuring that the original file name and these rows appear in the report.
Cross-bank inspection, find the last word of a line and the first word of the next line, the first word is repeated.
Find out duplicate words, whether they are case sensitive (such as: THE the), and allowing all number of blank characters (spaces, tabs, new lines, etc.) between these duplicates
Find repetitive words, even these words are separated by HTML tags. (Such as: IT IS Very b> Very Important.)
To solve the above practical problems, we must first write the regular expression, find the text we want, ignore the text we don't need, and then use our C # code to process the obtained text.
Before using the regular expression, you may know what is regular expression. Even you don't know, you can almost certainly be familiar with the basic concepts.
You know that Report.txt is a specific file name, but if you have any UNIX or DOS / Windows experience, you also know "* .txt" can be used to select multiple files. This form of file name, some characters have special meanings. The asterisk means that matching anything, the question mark means matching a character. Such as: "*. Txt" means that any file name is ended with .txt.
The file name is mode match, using a limited match. There is also a search engine on the current network, also allows you to perform content search using some specified matches. Regular expression uses a wealth of match characters to handle various complex issues.
First we introduce two position matching characters:
^: Represents the beginning of a line
$: Indicates the end of a line
Such as: Expression: "^ cat", the matching word CAT appears at the beginning of the row, pay attention to a location character, not to match the characters itself.
Similarly, the expression: "CAT $" matches the end of the word CAT.
Next, we introduce the square bracket "[]" in the expression, which represents one of characters in parentheses. Such as:
Expression: "[0123456789]" will match any one of numbers 0 to 9.
For example: We have to find text, all containing Gray or Grey, then express this: "GR [EA] Y"
[EA] represents one of the matching EA, not the entire EA.
If we want to match the "Tag" of
"
With the range symbol, we only need to give a range of boundary characters, the above HTML example, we can write: "
And the expression: "[0-9a-za-z]" means now it is clear? It matches numeric characters, lowercases 26 letters and one of 26 letters.
"^" Symbol in []
If you see expressions such as "[^ 0-9]", "^" is no longer the position symbols mentioned earlier, here it is a negative symbol, indicating the meaning of the exclusion, the expression above, indicating that Contains characters of numbers 0 to 9. Thinking 1: Expression "q [^ u]" mean. If there is the following words, those will match?
Iraqi
Iraqian
MIQRA
Qasida
Qintar
QOph
Zaqqum
In addition to the representation of the range character, there is a point character ".", The point character appears in the expression, indicates that any character is matched.
Such as expressions: "07.04.76" will match:
Shaped as: 07/04/76, 07-04-76, 07.04.76.
If we need to choose in some characters, we can adopt an option character "|":
Option characters have "or" meaning, such as expressions: "[bob | robert]" means BOB or Robert will be matched.
Now look at the expressions mentioned earlier: "GR [EA] Y", using the option character we can write "Grey | Gray", which is the same.
The use of parentheses: parentheses are also used as metamatics in the expression, as in front of the expression, we can write: "GR (e | a) y", the parentheses here is necessary, if there is no parentheses Then, the expression "GRE | AY" will match GRE or AY, which is not what we want. If you are not very clear, let's take a look at the example below:
Find all from from: or subject: or date: out of email, we compare the following two expressions:
Expression 1: "^ from | Subject | DATA:"
Expression 2: "^ (from | Subject | DATA):"
Which one is what we want?
Obviously, the results of expression 1 are not what we want, and it will match: from or Subjec or Data:, Expressions 2 use crackers to meet our needs.
Word boundary
We can already match characters that appear in the first and row, then if we want to position, it is not just the first or walk. We need to introduce the word boundary symbol, the word boundary symbols are: "/ b", slash is not omitted, otherwise becomes a matching letter B. Using a word boundary symbol, we can position the matching location must appear on the beginning or end of a word, not the intermediate of the word. For example: "/ BIS / B" expression in string "this is a cat." Will match the word "IS" without match "IS" in words "this".
String boundary symbol
In addition to the above position symbols, if we have to match the entire string (including multiple words), then we can use the following two symbols:
/ A: Indicates the beginning of the string;
/ z: Indicates the end of the string.
Expression: "/ athis is a cat / z" will match this string "this is a cat".
Use boundary positioning symbols, here is an important concept, that is, word characters, word characters represent characters that can constitute words, which are any of [A-ZA-Z0-9]. Therefore, the above expression will also match the sentence "this is a cat.". The result of the match does not include the end. Repeat quantity symbol
Let us see the expression: "COLOU? R", this expression has not seen the question mark yet, (this question mark and the question mark of the file name are different), which indicates that a character in front of the symbol can be repeated. Number, "?" Indicates 0 or 1 time, the question mark in the previous expression indicates that U can appear 0 or 1 times, so it will match "color" or "color".
Here are other repetitive quantities:
: Represents 1 time or multiple times
*: Indicates 0 times or multiple times
For example, we have to represent one or more spaces, we can write expressions: " ";
What if you want to indicate a specific number? We introduce the sputum {}.
{n}: n is a specific number, indicating repeating N times.
{n, m}: Indicates minimum that, up to M times.
These symbols define the number of matches of a character in front of the symbol. But if you want to repeat multiple characters, such as a word, what should I do? We use parentheses again, we take parentheses as the range symbol of the option, here is a plug-of other use method, which is expressed as a group, such as an expression: "(this)" this is a group Then, the problem is good, the repeated quantity symbol can be used to indicate the number of repetitions in front of it.
Now returning to the problem of looking repeating words, if we have to find "the the", we can write an expression based on the knowledge we have learned so far:
"/ bthe the / b"
The expression means that there is one or more spaces between the two THEs.
Similarly, we can also write:
"/ b (THE ) {2}"
But if you want to find all possible repeating words? Our current knowledge is not enough to solve this problem. Here we introduce the concept of reverse reference, we have seen parentheses can be used as the boundary of the group, and there are multiple groups defined by parentheses, depending on them. The order, these groups of default were allocated a group number, the first group number was No. 1, and pushed. Then the reverse reference is to reference this group using "/ N" in the position of the subsequent expression, where n is a referenced group number. The reverse reference is like the variable in the program, and we look at the specific example:
The word repeated expression in front, now we can write in reverse reference:
"/ b (the) / 1 / b"
Now, if we need to match all repetition words, we can rewrite the expression: "/ b ([A-ZA-Z] ) / 1 / b"
The last question is that if the characters we have to match are symbols in regular expressions, what should I do? Yes, use the escape symbol "/", for example, if you want to match a decimal point, then you can: "/.", But also note that if you use the expression in the program, "/" must also follow the string regulations Change "//" or add @ in front of the expression.