Getting started 4

xiaoxiao2021-03-06  19

The core 3 points are RS expressions. This connection operation, it seems that Tauro is. Write a single character to form a character line, a character to a character. In Java, there are string String, Charsequence, although they are a rope, but they are not a thing - String is class, Charsequence is an interface. OK, we don't discuss the Java API.

It should be noted here that

Remember when using the connection operation:

1 Connection operation is only a higher priority than element characters |

Boy is a regex, which is semantic to b, followed by one o, then keeping with one y. When including metamic characters, such as B [AO] y is semantically b, followed by one A or O, followed by Y.

2 All panes (in addition to Ba | OY form) merely matches a single character, for example:

A (SD | F) g Match IASDGBBASGBBAFGBB, and A [SD | F] g matches AsdgoAAASGOOAFGOA | GoOadg [There are still many things to learn. 】

3 About.

There is a typical example - the date of matching, we should use it with caution. The date format is typically YYYY-MM-DD. Of course, there is also form in YYYY.MM.DD, if you use /D/d/d./d/d/./d/d, although you can match the date division of users, but it also matches 2005A02B02, 9876543210 Not we hope. Perhaps use [- /.] Replacement. Better. It allows A Dash, Space, Dot and Forward Slash as a date division. Of course it is not perfect because it matches 3005/13/50 and 0000/00. And it does not match the date format I wrote today 05/2/2. In fact, Regex's construct relies on our goals - if we enforce users using YYYY / MM / DD format, we can use:

(19 | 20) / d / d / (0 [1-9] | 1 [012]) / (0 [1-9] | [12] [0-9] | 3 [01]). 】

4 other forms of connection:

As we know, a {3} is an abstract of AAA. A {1,3} is an abstract of AA | AAA. We often discuss them with closure operations.

§6 closure

The core 4 is R * expression. We first review the various ways of writing, and then understand the rules of the regular expression engine.

L (r *) = {ε, r, rr, ...} is an infinite set, which matches any poor connection of the R string.

In Java:

R? Is the regular expression of {ε, r};

R is the regular expression of L (r *) - {ε};

R {3,} is the regular expression of L (r *) - {ε, r, rr};

When we involve closing operation, we will encounter several important concepts - Greediness (greed), Lookahead and Lookbehind (looking forward - forward-looking, post-look). Let's take a few examples first:

RegexString Replace ⊙A SAAAAASGAAAFGAS⊙SG⊙FG⊙AB? / WABC AAABC GABBBF GBBAAAAG A? FAR [there are four matching items] ⊙f⊙⊙r⊙ [ab] back About bar bb AAA BAC [AB] [RC] Back About Bar BBB AAA BAC [AB] * Back About Bar BBB AAA BAC ⊙⊙ ⊙k⊙ ⊙⊙ou⊙t⊙ ⊙⊙r ⊙ ⊙ ⊙ c [[0-9] 123456654321999ok⊙ok [3-6] 111333555888 ([3-9]) // 1 12355555551999OK [ab] {3,} ABC AAABC Gabbbf GBBAAAG Explanation: Use A alone? It is very troublesome with a *.

([3-6]) // 1 [Regex Exercise Unlike the Java source program! 】

§7 REGEX engine mechanism

The Regex engine is a software that handles regular expressions and attempts to match the given string with templates. In general, we don't call the engine directly, but use them through some APIs. For different languages ​​and development environments, they will not be exactly the same, where Perl 5's regex is the foundation, it is also the most used. The Java language Regex has some differences from the regex flavor of Perl 5. But the mechanism is consistent.

1. Two roads that match:

There are two regex engines: text boot (Text-Directed) engine and regex boot engine. they

2, the Regex engine is an acute child, it always returns the leftmost matching item

It is important to remember that the REGEX boot engine always returns the leftmost match. I am very careful to say that a regex can match those 咚咚, because we use some Java language methods, the Regex engine always starts matching from the character serial, once found a match, It will rush to report: "I found matching item." Unless you ask it Again.

For text good and god, regex is Go {1, 2} D, then return to the leftmost good. We explain the process in detail.

The Regex engine matches the character serial head, and 1G obviously matches NO. 0 G. 2O {1, 2} how to do, the engine is also smart, it first studies the semantics of O {1, 2}, represents one or two O, so it first matches NO. 1. It found NO. 2 O can also match, so the task of O {1, 2} is completed. 3 Now, the engine starts to match D, NO. Is 3 D. OK, the engine has completed its work. It will not continue to match.

What if we let it continue to match? It thinks that the road has already passed without repeating, so the new beginning is No. 4, the space, the And God, 1G does not match the space, 2G does not match A, N, D, Space. 3G obviously matches NO. 5 (new No.) G. The engine is also smart. It has studied the semantics of O {1, 2}. It first matches NO. When the D it knows that the task of o {1, 2} is over, it finds that D matches the No. 7. OK, the engine has completed its work.

3, why is the REGEX engine is an acute child?

Now we ask to match the setName method, assume that regex is get | set | setName, what is its return match item? Yes, just set. How do we look at how the engine works:

1 The engine first studied the regex and understood various possible combinations. So it starts to match, first it takes G match S, fail, it knows that Get has failed. But it knows other possible combinations. 2 This starts S match S, success, so SET matches SET success. 3 I found the match. It does not continue to match the "better" option after matching the string with SET. Obviously, we didn't find the setName method we hope, how to fix it? The easiest way is regex to get | setName | SET. According to the above process, we can match SetName. This is what we hope.

4. Why is the Regex engine be a little smart?

It is a bit smart, not just before it starts to match, it will study regex. More manifies a little memory. Regex is required to match SetGame for GET | SetName | SET. Talk about the matching process. The previous nonsense does not say, when n matches G, fail. So the Regex engine matches No. 0 S, the regex engine actually knows from NO. 0. Think now, what is the result of matching setGameSetName? Set or setname.

5, GREEDINESS match:

GET | SetName | SET can be modified, become Get | Set (Name)? It will match SetName. Two regex is exactly the same. BA? The operation first matches Ba, and then match A. This is the usage of the Greedy Quantify. It is greedy or maximum match.

Note: Although in the JDK document, x {n}, x {n, m} is listed below Greedy Quantify, which is just a connection, and there is no relationship with the maximum match.

X?, X *, x , x {n,} is the biggest match.

Maximum matching. When using it, it often brings trouble. Match i love "java" and "C " in ". *", It returns "Java" and "C ". Maybe we hope that it matches "java". We use negative replacement. The situation is very common, such as "[^" / r / n] * ", so we can match" Java "we hope.

For Java's keywords, how do we test their correctness (do not consider keywords and unicode cases)? It can be written like this [A-ZA-Z _ $] [0-9A-ZA-Z_ $] *. It can also be written in this way [A-ZA-Z _ $] [/ w $] *.

6, backtrack, (from the original road)

Let's take a look at <. > How to match the HTML tags A Java CB. According to the way we discussed earlier: 1 Java CB, when the maximum match is made, it matches two>, which has always matched all characters until the last B. of the text cannot match. 3 At this time> Start match, the maximum match in front has matched to the last B of the text, of course> match failed. 4Regex engine understands, the matches too much. However, the Regex engine is not reported at this time, it returns a character by the original road, it thinks. Match TR> Java c to use> Match B, failed Again; it returns a character, think . Match TR> Java , with> Match C, fail ... Finally, finally use> match>, big work. The Regex engine is an acute child, it doesn't want to retreat again. Now in <. > AA matching a AAVA ABB, you can think of thinking, I said, my mouth is going to break. Don't say it. If you match at <. > Aa. A, a AAVA ABB is sinful.

StringRegexa AAVA AAVA AAVA AAA AAVA ABB <. > AA. AA AAVA ABB |

7, minimum match

If you want to match the tags of various HTML, | is obviously not. So we need some non-greed matching.

X?, X *, x , x {n,} is the biggest match. Ok, add one? It became a laziness match. For example, X ??, x *?, X ?, x {n,}? All minimally matched, in fact, X {n, m}? And x {n}? Some extra.

StringRegexa AAVA abb <. ?> a aava abb <. > a? AA AAVA ABB <. *?> a AAVA Abba {1,2}? a aava abba {2}?

The minimum matching means,. ? After matching a character, try the matching of the match now, fail, then. ? Match a character, then try it right away. Match. Greedy and reluctant in the JDK document, it is metaphor with EAT, so translating into a greedy and (reluctant) anorexia is the most appropriate. But I like the biggest match, the minimum matching statement.

8, full match

Unlike the maximum match, there is also a matching form: x? , X * , x , x {n,} , etc., becomes a full match. As with the biggest match, it always matches all characters until the end of the text, but it is not returned by the original road. In other words, a match is matched, if it is, it is simply, I like it. StringRegexa AAVA abb [no match found.] a. b [blank line can also match, ε]. * aabb [match AABB and ε]. * a aava abba {1,2}? a aava abba {2}?

Use a Possessive Quantifier for Situations Where You Want To Seize All of Sometting Without Ever BACKING OFF; It Will Outperform The Equivalent Greedy Quantifier In Cases Where The match is not immediately found.

转载请注明原文地址:https://www.9cbs.com/read-47368.html

New Post(0)