Regular expression

xiaoxiao2021-03-06 31

Regular expression

Original: Steve Mansour Sman@scruznet.com Revised: June 5, 1999 (Copied By JM / At / Jmason.org from http://www.scruz.net/%7ssman/regexp.htm, after the original disappead!)

Translation: neo leeneo.lee@gmail.com October 16, 2004

English version of the original translator is pressed by the original text because many links in the text have expired (mainly about the introduction and manual of VI, SED, etc.), such links have been deleted, if you need to check these links to view the above The original text of the link. In addition to this, the basic translation of the original text, the "translator pressed" is a description of the translator's supplement. If you have any content, please contact Steve Mansor directly, of course, if you only write Chinese, you can contact me.

What is the regular expression example? Simple intermediate (magical spell) difficult (magical spell) Difficulty in different tools

What is regular expression, a regular expression is to match a formula of a class of strings in some mode. Many people don't dare to use because they look more quasi-and-complicated, and this article cannot change this. However, after a little exercise, I will start thinking that these complex expressions are actually written. Quite simple, and once you understand them, you can do a few hours of hard and wrong text to complete the work compression in a few minutes (or even a few seconds). Regular expressions are widely supported by various text editing software, class libraries (such as Rogue Wave's Tools.h ), script tools (like awk / grep / sed), and interactive IDE like Microsoft Visual C also begins support It's. We will use some examples in the following chapters to explain the usage of regular expressions. Most of the examples are written based on text replacement commands in the VI and the GREP file search command, but they are more typical examples, where they are The concept can be used in SED, AWK, Perl, and other programming languages that support regular expressions. You can look at this section of the regular expression in different tools, some of which use regular expressions in other tools. There is also a brief description of the text replacement command (s) in the VI to attach a reference. Regular expression base regular expressions consist of some normal characters and some metachacters. Ordinary characters include large-sized letters and numbers, and metamorphisms have special meaning, and we will explain it below. In the simplest case, a regular expression looks an ordinary lookup string. For example, the regular expression "Testing" is not included in any metamorphic character, which can match "Testing" and "123Testing" strings, but cannot match "Testing". To really use a good regular expression, the correct understanding of the character is the most important thing. The following table lists all metammatics and a short description of them.

Metacity description

Match any single character. For example, regular expression R.T matches these strings: RAT, RUT, R T, but does not match root.

Match line end of the match. For example, regular expressions Weasel $ can match the end of the string "He's a weasel", but cannot match the string "They Are A Bunch of Weasels.".

Match the beginning of a row. For example, regular expressions ^ WHEN IN can match the beginning of string "WHEN in the courts", but cannot match "What and when". *

Match 0 or more characters just before it. For example, a regular expression. * Means that any number of characters can be matched.

This is the reference to the preparation, which is used to match these metad characters listed here as normal characters. For example, regular expression / $ is used to match the dollar symbol, not the tail, similar, regular expression /. Used to match the point character, not a wildcard of any character.

[]

[C1-C2]

[^ C1-C2]

Match any of the characters in brackets. For example, regular expression r [aou] t matches RAT, ROT and RUT, but does not match RET. You can use the cell in parentheses to specify the interval of characters, such as regular expressions [0-9] can match any numeric characters; it can also make multiple intervals, such as regular expressions [A-ZA-Z], can match any Case written letters. Another important usage is "exclude", to match characters except the specified interval - the so-called replenishment - use between parentheses and first characters on the left, such as regular expressions [ ^ 269A-Z] will match any characters other than 2, 6, 9 and all uppercase letters.

The start (/ <) and end (/>) of the matching word (Word). For example, regular expressions /

/ (/)

The expression between / (and /) is defined as "Group" and saves the character of this expression to a temporary area (up to 9 in a regular expression), which can be used / 1 to / 9 symbols are referenced.

The two matching conditions are logically "or" (OR) operations. For example, regular expressions (HIM | HER) matches "IT Belongs to Him" and "IT Belongs to Her", but cannot match "IT Belongs to the THEM.". Note: This element is not supported by all software.

Match 1 or more characters just before it. For example, regular expression 9 matches 9, 99, 999, etc. Note: This element is not supported by all software.

Match 0 or 1 characters before it. Note: This element is not supported by all software.

/ {I /}

/ {i, j /}

Match the specified number of characters, these characters are defined before it. For example, regular expressions A [0-9] / {3 /} can match the character "a" followed by a string just 3 digital characters, such as A123, A348, etc., but do not match the A1234. Regular expression [0-9] / {4, 6 /} matches continuous four, 5 or 6 numeric characters. Note: This element is not supported by all software.

The simplest metadature is a point, it can match any single character (note does not include new rows). Assume that there is a file Test.txt containing the following lines:

HE is a rat

HE is in a rut

The food is rotten

I like root beer

We can use the grep command to test our regular expressions, the grep command uses the regular expression to try to match each row of the specified file and display at least one of the rows of the expression. command

GREP R.T Test.txt

Search for regular expression R.T in each line in the Test.txt file, and print the output matching. Regular expression R.T matches a R followed by any one of the characters. So it will match the RAT and RUT in the file without matching Rot, because the regular expression is sensitive. To match your uppercase and lowercase letters, you should use the character interval character (square bracket). Regular expression [RR] can simultaneously match R and R. So, if you want to match a uppercase or lower-write R, then any one of the characters will be used to use this expression: [rr] .t. To match the character of the line to use the prostitute character (^) - is also called the insert. For example, I want to find the "HE" head of TEXT.TXT, you may first use a simple expression he, but this will match the third line of the THE, so use the regular expression ^ He, it only matches the row The first H. Sometimes the "except ×× × matches" will be easier to achieve, when the inhibitory character (^) appears in square brackets, it means "excluding", for example, to match HE, but exclude the front of T or S in front Sex (that is, THE and SHE), you can use: [^ st] he. Multiple character intervals can be specified using square brackets. For example, regular expressions [A-ZA-Z] match any letters, including uppercase and lowercase; regular expression [A-ZA-Z] [A-ZA-Z] * matches one letter after 0 or more letters ( Copy or lowercase). Of course, we can also do the same thing with metals : [A-ZA-Z] , and [A-ZA-Z] [A-ZA-ZA-Z] * Complete equivalent. But pay attention to element characters is not supported by all procedures that support regular expressions. About this point can be referred to the regular expression syntax support. To specify a specific number of matching, use braces (note must use a backslash to escape). To match all 100 and 1000 instances, 10 and 1000 can be used, you can use: 10 / {2, 3 /}, this regular expression matches the number 1 follows 2 or 3 0 mode. One useful change in the use of this metamorphic is to ignore the second number, such as regular expression 0 / {3, /} will match at least 3 consecutive 0. Simple example here has some representative, relatively simple examples.

VI command role

:% s / * / / g replacing one or more spaces into one space. :% s / * $ // Go to all spaces at the end. :% s / ^ / / Add a space on each line. :% s / ^ [0-9] [0-9] * // Remove all of the digital characters of the lead. :% S / B [AEIO] g / bug / g will change all Bag, BEG, BIG and BOG to bug. :% S / T / ([AOU] /) G / h / 1T / g Replace all TAG, TOG and TUG to HAT, HOT and HUG, respectively (note the usage of Groups and Use / 1 reference the previously matched characters ). Intermediate Example (Magic Mantra) Example 1 Changes examples of FOO (A, B, C) to FOO (B, A, C). Here, the A, B, and C may be any parameters supplied to the method foo (). That is to say, we have to achieve this conversion:

Before, Foo (10, 7, 2) foo (7, 10, 2) FOO (x 13, Y-2, 10) Foo (Y-2, X 13, 10) FOO (BAR (8), X Y Z, 5) FOO (X Y Z, BAR (8), 5) The following replacement commands can achieve this magic:

:% s / foo (/ ([^,] * /), / ([^,] * /), / ([^)] * /)) / foo (/ 2, / 1, / 3) / g

Let us now disperse it to analyze. The basic idea of writing this expression is to identify the position of the three parameters in the foo () and its parentheses. The first parameter is to identify :: / ([^,] * /) with this expression, we can analyze it from mutually: [^,] In addition to any characters outside the comma [^,] * 0 Or multiple non-comma characters / ([^,] * /) labeled these non-comma characters as / 1, which can be referenced in later replacement mode expressions, we must find 0 or more non-comma characters followed by a comma, and non-comma characters that are numbered to be marked and used. It is now to point out the best time to use a regular expression. Why do we use [^,] * such an expression, not more simple and straightforward, for example:. *, To match the first parameter? Imagine us to use the pattern. * To match the strings "10, 7, 2", it should match "10," or "10, 7,"? In order to solve this erroruity, the regular expression regulates the longest string. In the above example, "10, 7,", apparently, find two parameters instead of our expectations . So, we want to use [^,] * to force the part before the first comma. This expression we have already analyzed: foo (/ ([^,] * /), this paragraph can be translated into "When you find foo (just mark it until the first comma is marked as / 1" Then we use the same way to mark the second parameter is / 2. The same is true for the marking method of the third parameter, just we want to search all the characters until the right brackets. We don't need to search for the third parameter, because We don't need to adjust its location, but this model can guarantee that we only replace those foo () methods calls that have three parameters, this clear mode is often compared when foo () is an overoading method. Insurance. Then, in the replacement section, we find the corresponding instance of the foo (), and then replace with a tagged portion, is the first and second parameter switched positions. Example 2 assumes that there is a CSV (COMMA SEPARATED VALUE) Document, there are some information we need, but the format has problems, the current data column order is: name, company name, state abbreviation, postal code, now we want to tell these data to reorganize, so that in one of us The software is used, the format needs: Name, State abbreviation - Postal code, company name. That is, we want to adjust the column order, but also merge two columns to form a new column. In addition, our software cannot accept There is any space behind the comma (including spaces and tabs) so we must also remove all spaces before and after the comma. Here is our current data:

Bill Jones, Hi-Tek Corporation, CA, 95011

Sharon Lee Smith, Design Works Incorporated, CA, 95012

B. AMOS, HILL Street Cafe, CA, 95013

Alexander WeatherWorth, The Crafts Store, CA, 95014

...

We want to turn it into this:

Bill Jones, CA 95011, HI-TEK CORPORATION

Sharon Lee Smith, CA 95012, Design Works Incorporated

B. AMOS, CA 95013, Hill Street Cafe

Alexander WeatherWorth, CA 95014, The Crafts Store ...

We will use two regular expressions to solve this problem. The first mobile column and merge columns, the second is used to remove spaces. Below is the first replacement command:

:% s // ([[^ ,]*/) ,/( [^ ,]*/) ,/( [^ ,]*/), / (.*/ )// 1, / 3/4, / 2/

The method here is basically the same as an example 1, the first column (name) matches this expression: / ([^,] * /), that is, all characters before the first comma, and the name of the name is used / 1 Mark down. The company name and state-famous abbrevite field are marked as / 2 and / 3 in the same way, and the last field matches /(.*/) ("matches all characters until the end of the row"). The replacement section references those contents of the above tags for construction. Below this replacement command is used to remove spaces:

:% s / [/ t] *, [/ t] * /, / g

We still decompose to see: [/ t] match space / tab, [/ t] * Match 0 or multiple space / tab, [/ t] *, match 0 or multiple space / tab Add a comma, final, [/ t] *, [/ t] * Match 0 or multiple spaces / tabs, then a comma is then followed by 0 or more spaces / tab. In the replacement section, we simply find all things we find to replace it with a comma. Here we use the optional G parameters ending, which means replacing all matching strings in each row (instead of replacing only the first matching string). Example 3 Assume that there is a multi-character piece repeatedly, for example:

Billy Tried Really Hard

Sally Tried Really Really Hard

Timmy Tried Really Really Really Hard

Johnny Tried Really Really Really Really HARD

And you want to convert "real", "really real", and any number of "real" strings that appear continuously into a simple "simple is good!), Then the following command:

:% s // (Really /) / (really /) * / VERY /

The above text will become:

Billy Tried Very Hard

Sally Tried Very Hard

Timmy Tried Very Hard

Johnny Tried Very Hard

Expression / (realy /) * Match 0 or more consecutive "really" (pay attention to the end of a space), / (really /) / (really /) * Match 1 or more consecutive "really" instances . Difficulty example (Magic pictograph) coming soon.

Regular expression OK in different tools, you are ready to use RE (Regular Expressions, regular expression), but you are ready to use VI. So, here we give some examples of using RE in other tools. In addition, I will summarize the difference between you use REs in different programs. Of course, you can also use RE in the Visual C editor. Select Edit-> Replace, then select the "Regular Expression" selection box, the Find What input box corresponds to the VI commands described above:% S / PAT1 / PAT2 / G in the PAT1 section, and the Replace input box corresponds to the PAT2 section. However, in order to get the VI's execution range and G option, you want to use Replace All or the appropriate manual Find next and replace (the translator is pressed: I know why someone is slightly smart, although I can choose a range of text in the VC, Then replace it in it, but in short enough, it is flexible and elegant. Sedsed is the abbreviation of stream editor. It is a file-based and pipe-based editing tool that UNIX, which can be obtained in the manual for details on the SED. Here is some interesting SED scripts, assuming that we are processing a file called Price.txt. Note that these edits do not change the source file, and the SED is just a row of the source file and display the result in the standard output (of course, it is easy to use redirection): SED script description

Sed 's / ^ $ / d' price.txt Deletes all space SED 's / ^ [/ t] * $ / d' price.txt Delete All lines of lines containing spaces or tabs / "/ / g 'price.txt Delete All Quotation Number AWKawk is a programming language that can be used to complicate text data. You can get detailed information about AWK in the manual. This weird name is its surname Abbreviations (Aho, Weinberger and Kernighan). In Aho, Weinberger and Kernighan's books There are many very good AWK examples, please don't let the following slight script examples limit your understanding of AWK power. We also assume that we process the price.txt file, like SED, AWK only displays the results on the terminal.

AWK script description

AWK '$ 0! ~ / ^ $ /' price.txt Deletes all space line awk 'nf> 0' price.txtawk in a better delete all rows of all rows AWK '$ 2 ~ / ^ [Jt] / {Print $ 3} 'price.txt Print All the second field is the third field of' J 'or' t 'head of the third field awk' $ 2! ~ / [mm] ISC / {Print $ 3 $ 4} 'price.txt For all The second field does not contain the 'Misc' or 'Misc' row, prints 3 and 4 of the sum (assuming digital) awk '$ 3! ~ /^ [0-9] /.[0-9] * $ / {Print $ 0} 'price.txt Print All the third field is not a digital line, here the number refers to the form of DD or D, where D is any number AWK' $ 2 ~ / john | fred by 0 to 9 / {Print $ 0} 'price.txt If the second field contains' John' or 'Fred', printing the entire line GrepGrep is a program used to find RE in one or more files or input streams. Its NAME programming language can be used to process files and pipelines. Full information about GREP can be obtained in the manual. This same weird name comes from a command of VI, g / re / p, means Global Regular Expression Print. In the example below, we assume that the following text is included in the file phone.txt, and the format is added a comma, then the name, then a tab, then the phone number: Francis, John 5-3871 Wong , Fred 4-4123 Jones, Thomas 1-4122 Salazar, Richard 5-2522

GREP command description

GREP '/T5-... 1' phone.txt prints all phone numbers at 5 out of the end of the end, pay attention to the grep '^ s [^] * r' phone represented by / t. TXT prints all the surnames with S head and name Row Grep '^ [jw]' phone.txt prints all the last name is a row of or W's line grep ', .... / t' phone.txt Print all surnames It is a four-character line, pay attention to the grep -V '^ [jw]' phone.txt using / t, all the line Grep '^ [MZ]' phone.txt prints not starting with J or W The beginning of all the beginnings of M to Z is row GREP '^ [MZ]. * [12]' phone.txt Print all the beginning of the last name is any character between M to Z, and the point number end Is 1 or 2 row EGRepegrep is an extension version of GREP, which supports more metamatics in its regular expression. In the example below we assume that the following text is included in the file phone.txt, - its format is a currency, then the name, then a tab, then the phone number:

Francis, John 5-3871

Wong, Fred 4-4123

Jones, Thomas 1-4122

Salazar, Richard 5-2522

EGREP COMMAND DESCRIPTION

EGREP '(John | Fred)' Phone.txt Print All rows containing Name John or Fred EGREP 'JOHN | 22 $ | ^ w' Phone.txt Print All Contains John or EGREP 'NET with the end of W. WORK? S 'Report.txt Find all lines that contain networks or Nets from Report.txt Symbolic support

Command or environment. [] ^ $ / (/) / {/}? | () Vi x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Tcl X X X X X X X X X ex X X X X X X grep X X X X X X egrep X X X X X X X X X fgrep X X X X X perl X X X X X X X X X

VI Alternative Command Introduction VI's replacement command:

Range

S /

PAT1

PAT2

/ g

among them

: This is the command execution interface of the VI.

Range is specified by the command execution, you can use the percent sign (%) to represent all rows, use the point (.) to represent the current line, using the dollar sign ($) to represent the last line. You can also use the line number, for example

10, 20 shows the 10th to 20 lines,

., $, The current line, the last line,

. 2, $ - 5 indicate two lines of the current line until the full number of countdown fifth lines, and so on.

s indicates that it is an alternate command.

PAT1 This is a regular expression to find, and there is a lot of example in this article.

PAT2 This is a regular expression that wants to turn the matching string, which has a large number of examples in this article.

G Optional flag, with this flag indicates that the replacement will be made for each matching string in the row, otherwise only the first matching string in the row.

There are many online manuals on the Internet, you can access them to get more complete information.

转载请注明原文地址:https://www.9cbs.com/read-50306.html

9cbs

New Post(0)