Unveiling the mystery of regular expressions

xiaoxiao2021-03-06  47

Regular expressions (RES) are often mistakenly considered to be a mysterious language that is only a few people understand. On the surface, they do look messy, if you don't know its grammar, then its code is just a bunch of text garbage in your eyes. In fact, the regular expression is very simple and can be understood. After reading this article, you will know the general syntax of the regular expression.

Support multiple platforms

Regular expressions were first proposed by mathematician Stephen Klene in 1956. He is proposed on the basis of the increasing research results of natural language. Regular expressions with full syntax use in terms of format matching of characters, later applied to the field of melting information technology. Since then, the regular expression has been developed through several periods, and the current standard has been approved by ISO (International Standards Organization) and is identified by the Open Group organization.

Regular expressions are not a dedicated language, but it can be used to find and replace text in a file or character. It has two standards: basic regular expressions (BRE), extended regular expressions (ERE). ERE includes BRE function and other concepts.

Regular expressions are used in many programs, including XSH, EGREP, SED, VI, and programs under UNIX platforms. They can be adopted in many languages, such as HTML and XML, which is usually only a subset of the entire standard.

More ordinary than you think

This function is also increasingly complete as the regular expression of the procedure of the crossed platform is transplanted to the cross platform. The search engine on the network uses it, the E-mail program is also used, even if you are not a UNIX programmer, you can also use rule language to simplify your program and shorten your development time.

Regular expression 101

Many regular expressions look similar, because you have not studied them before. Wildcard is a structural type of RE, that is, repeated operation. Let's take a look at the most common basic syntax type of ERE standard. In order to provide examples of specific purposes, I will use several different programs.

Character match

The key to the regular expression is to determine what you want to search, if there is no concept, RES will be useless.

Each expression contains instructions that you need to find, as shown in Table A.

Table A: Character-Matching Regular Expressions

Operation explanation example result

Match Any One Grep .ord Will Match "Ford",

Character Sample.txt "Lord", "2ORD",

Etc. in the file

Sample.txt.

[] Match Any One Grep [CNG] ORD WILL MATCH ONLY "CORD",

Character Sample.txt "Nord", And "gord"

Listed Between

The Brackets

[^] Match Any One Grep [^ cn] ORD WILL MATCH "LORD",

Character Not Sample.txt "2ORD", ETC.BUT NOT

Listed Between "Cord" or "nord"

The Brackets

GREP [A-ZA-Z] ORD WILL MATCH "AORD",

Sample.txt "Bord", "AORD",

"Bord", ETC.

GREP [^ 0-9] ORD WILL MATCH "AORD",

Sample.txt "Aord", ETC. But

NOT "2ORD", ETC.

Repeat operator

Duplicate operators, or quantity words describe the number of times to find a particular character. They are often used in character matching syntax to find multi-line characters, see table B. Table B: Regular Expression Repetition Operators

Operation explanation example result

? Match Any egrep "Erd" Will Match "Berd", "

Charactere Sample.txt Herd ", etc." ERD "

One Time,

IF IT EXISTS

* Match decl- egrep "n. * Rd" Will Match "Nerd", "

Ared Element Sample.txt NRD "," NEARD ", ETC.

Multiple Tim-

ES, IF IT

exists

Match decl- egrep "[N] ERD" Will Match "Nerd", "

SAMPLE.TXT NNERD ", ETC., But not

One or more "ERD"

Times

{n} match decl- egrep "[a-z] {2} ERD" Will Match "Cherd", "

Ared Element Sample.txt Blerd ", ETC. But Not"

Exactly N Nerd "," ERD ",

Times "Buzzerd", ETC.

{n,} match decl- egrep ". {2,} ERD" Will Match "Cherd"

Ared Element Sample.txt and "Buzzerd", But

AT Least N Not "Nerd"

Times

{n, n} match decl- EGREP "N [E] {1, 2} RD" Will Match "Nerd"

Ared Element Sample.txt and "Neerd"

AT Least N

Times, But

Not more Than

N Times

anchor

Anchor refers to the format it to match, as shown in Figure C. Use it to make it easy for you to find a merger of universal characters. For example, I use the VI line editor command: s represents substeute, the basic syntax of this command is:

S / PATTERN_TO_MATCH / PATTERN_TO_SUBSTITUTE /

Table C: Regular Expression ANCHORS

Operation explanation example result

^ Match at the beginningning / ^ / Blah / Inserts "Blah"

of a line at the beginning

Of the line

$ Match at the end of / $ / blah / inserts "Blah" at The

A lines end of the line

/

EGREP "/

Sample.txt, etc.

/> Match at the end of ///> / blah / inserts "blah"

At the end

A Words of the Word

Egrep "/> Blah" matches "Soupblah

Sample.txt ", ETC.

/ b match at the beginning of egrep "/ bblah" matches "blackcake"

OR end of a word sample.txt and "countblah"

/ B match in the middle egrep "/ bblah" matches "sublahper"

OF A Word Sample.txt, ETC.

interval

Another possibility in RES is interval (or insert) symbol. In fact, this symbol is equivalent to an OR statement and represents | symbols. The following statement returns the handle of "Nerd" and "Merd" in the file Sample.txt:

EGREP "(n | m) Erd" Sample.txt

The interval is very powerful, especially when you look for different spelling, but you can get the same results in the following example:

egrep "[nm] ERD" Sample.txt

When you use the interval function to connect with the advanced features of the RES, it is more reflected in it.

Some reserved characters

The last most important feature of RES is to keep characters (also known as specific characters). For example, if you want to find the characters of "NE * RD" and "Ni * RD", the format matching statement "n [ei] * rd" is in line with "NeeeeerD" and "NieieierD", but it is not you want to find. character. Because '*' (asterisk) is a reserved character, you must replace it with a backslash symbol, namely: "n [ei] / * rd". Other reserved characters include:

^ (CARAT)

(PERIOD)

(Left Bracket}

$ (DOLLAR SIGN)

(Left Parenthesis)

Right Parenthesis

| (PIPE)

* (Asterisk)

(Plus Symbol)

(Question Mark)

{(Left Curly Bracket, or Left Brace)

/ Backslash

Once you put the above characters, there is no doubt that RES is very hard to read. For example, the EREGI search engine code in the following PHP is hard to read.

EREGI ("^ [_ a-z0-9 -] (/. [_ a-z0-9 -] ) * @ [A-Z0-9 -] (/. [A-Z0-9 -] ) * $ ", $ Sendto)

You can see that the intent of the program is difficult to grasp. But if you leave the reserved character, you often mistakenly understand the meaning of the code.

to sum up

转载请注明原文地址:https://www.9cbs.com/read-59593.html

New Post(0)