Unveiling the mystery of the regular expression syntax

zhaozj2021-02-08  480

Unveiling the mystery of the regular expression syntax

Regular expressions (RES) are often mistakenly considered to be a mysterious language that is only a few people understand. On the surface, they do look messy, if you don't know its grammar, then its code is just a bunch of text garbage in your eyes. In fact, the regular expression is very simple and can be understood. After reading this article, you will know the general syntax of the regular expression.

Support multiple platforms

Regular expressions were first proposed by mathematician Stephen Klene in 1956. He is proposed on the basis of the increasing research results of natural language. Regular expressions with full syntax use in terms of format matching of characters, later applied to the field of melting information technology. Since then, the regular expression has been developed through several periods, and the current standard has been approved by ISO (International Standards Organization) and is identified by the Open Group organization.

Regular expressions are not a dedicated language, but it can be used to find and replace text in a file or character. It has two standards: basic regular expressions (BRE), extended regular expressions (ERE). ERE includes BRE function and other concepts.

Regular expressions are used in many programs, including XSH, EGREP, SED, VI, and programs under UNIX platforms. They can be adopted in many languages, such as HTML and XML, which is usually only a subset of the entire standard.

More ordinary than you think

This function is also increasingly complete as the regular expression of the procedure of the crossed platform is transplanted to the cross platform. The search engine on the network uses it, the E-mail program is also used, even if you are not a UNIX programmer, you can also use rule language to simplify your program and shorten your development time.

Regular expression 101

Many regular expressions look similar, because you have not studied them before. Wildcard is a structural type of RE, that is, repeated operation. Let's take a look at the most common basic syntax type of ERE standard. In order to provide examples of specific purposes, I will use several different programs.

Character match

The key to the regular expression is to determine what you want to search, if there is no concept, RES will be useless. Each expression contains instructions that you need to find, as shown in Table A.

Table A: Character-Matching Regular Expressions

operating

Explanation

example

result

.

Match Any One Character

Grep .ord sample.txt

Will Match "Ford", "Lord", "2ORD", etc. In the file sample.txt.

[]

Match Any One Character Listed Between THE BRACKETS

GREP [CNG] ORD SAMPLE.TXT

Will Match Only "Cord", "NORD", And "Gord"

[^]

Match Any One Character Not listed Between The Brackets

GREP [^ cn] ORD SAMPLE.TXT

Will Match "Lord", "2ORD", etc. But not "cord" OR "NORD"

GREP [A-ZA-Z] ORD SAMPLE.TXT

Will Match "Aord", "Bord", "Aord", "BORD", ETC.

GREP [^ 0-9] ORD SAMPLE.TXT

Will Match "Aord", "Aord", ETC. But Not "2ORD", ETC.

Repeat operator

Duplicate operators, or quantity words describe the number of times to find a particular character. They are often used in character matching syntax to find multi-line characters, see table B.

Table B: Regular Expression Repetition Operators

operating

Explanation

example

result?

Match Any Character ONE TIME, IF IT EXISTS

EGREP "? Erd" Sample.txt

Will Match "Berd", "Herd", etc. And "ERD"

*

Match Declared Element Multiple Times, IF IT EXISTS

EGREP "n. * rd" Sample.txt

Will Match "Nerd", "NRD", "NEARD", ETC.

Match declared element one or more Times

egrep "[n] ERD" Sample.txt

Will Match "Nerd", "Nnerd", ETC., but not "ERD"

{n}

Match Declared Element EXACTLY N TIMES

EGREP "[a-z] {2} ERD" Sample.txt

Will Match "Cherd", "Blerd", ETC. But not "Nerd", "Erd", "Buzzerd", etc.

{n,}

Match Declared Element At Least N Times

"{2,} ERD" Sample.txt

Will Match "Cherd" and "buzzerd", but not "Nerd"

{n, n}

Match Declared Element At Least N Times, But Not More Than N Times

EGREP "N [E] {1, 2} rd" Sample.txt

Will Match "Nerd" and "neerd"

anchor

Anchor refers to the format it to match, as shown in Figure C. Use it to make it easy for you to find a merger of universal characters. For example, I use the VI line editor command: s represents substeute, the basic syntax of this command is:

S / PATTERN_TO_MATCH / PATTERN_TO_SUBSTITUTE /

Table C: Regular Expression ANCHORS

operating

Explanation

example

result

^

Match at the beginning of a line

S / ^ / Blah /

INSERTS "Blah" at the beginning of the line

$

Match at the end of a line

S / $ / blah /

INSERTS "Blah" at the end of the line

/ <

Match at the beginning of a word

S //

INSERTS "Blah" at the beginning of the word

EGREP "/

Matches "Blahfield", ETC.

/>

Match at the end of a word

S //> / blah /

INSERTS "Blah" at the end of the word

EGREP "/> black" Sample.txt

Matches "Soupblah", ETC.

/ B

Match at the beginning of End of a Word

Egrep "/ bblah" Sample.txt

Matches "Blahcake" and "countblah"

/ B

Match in the Middle of A Word

Egrep "/ bblah" Sample.txt

Matches "SUBLAHPER", ETC.

interval

Another possibility in RES is interval (or insert) symbol. In fact, this symbol is equivalent to an OR statement and represents | symbols. The following statement returns "Nerd" and "Merd" handle in the file Sample.txt: egrep "(n | m) Erd" Sample.txt

The interval is very powerful, especially when you look for different spelling, but you can get the same results in the following example:

egrep "[nm] ERD" Sample.txt

When you use the interval function to connect with the advanced features of the RES, it is more reflected in it.

Some reserved characters

The last most important feature of RES is to keep characters (also known as specific characters). For example, if you want to find the characters of "NE * RD" and "Ni * RD", the format matching statement "n [ei] * rd" is in line with "NeeeeerD" and "NieieierD", but it is not you want to find. character. Because '*' (asterisk) is a reserved character, you must replace it with a backslash symbol, namely: "n [ei] / * rd". Other reserved characters include:

^ (CARAT)

(PERIOD)

(Left Bracket}

$ (DOLLAR SIGN)

(Left Parenthesis)

Right Parenthesis

| (PIPE)

* (Asterisk)

(Plus Symbol)

(Question Mark)

{(Left Curly Bracket, or Left Brace)

/ Backslash

Once you put the above characters, there is no doubt that RES is very hard to read. For example, the EREGI search engine code in the following PHP is hard to read.

EREGI ("^ [_ a-z0-9 -] (/. [_ a-z0-9 -] ) * @ [A-Z0-9 -] (/. [A-Z0-9 -] ) * $ ", $ Sendto)

You can see that the intent of the program is difficult to grasp. But if you leave the reserved character, you often mistakenly understand the meaning of the code.

to sum up

In this article, we unveiled the mystery of regular expressions and list the general syntax of the ERE standard. If you want to read the full description of the rules of the Open Group organization, you can see: Regular Expressions, you are welcome to express your questions or views in the discussion district.

转载请注明原文地址:https://www.9cbs.com/read-668.html

New Post(0)