Senior characteristics of regular expressions

zhaozj2021-02-16  74

Non-greedy

Regular expression Quantifier is Greedy. That is, if there are multiple matches in a string (these Match's starting positions must be the same), the regular expression will match the longest one. Because "." Can match "<" and ">", so ". " Will match the second character. So here you need to use non-greedy [1]. The so-called non-greedy means that the regular expression will select the shortest Match when matching.

>>> txt = " hello world >>> PAT = r '<. ?> >>> Re.match (PAT, TXT) .Group ()' '

Let's take another example:

>>> txt = "abcdddd" >>> code = "" "m = re.match (pat, txt) for s in m.groups (): Print S" ">>> PAT =" (A / W ) ">> --> EXEC CODE ABCDDDD >>> PAT = "(A / W ?)" >>> EXEC CODE AB >>> PAT = "(A / W *?)" >>> EXEC CODE A

After using Non-Greedy, "A / W *?" As "A", "A / W ?" Has no difference in "A / W", so its main purpose is to match the punctuation of the appearance aspect. So this regular expression should be:

SINGLE-LINE and MULTI-LINE

Single-line and multi-line are all for "/ n". The process of "/ n" is also involved in "^" and "$" match. In short, in Single-Line mode, '.' Can match "/ n":

>>> Match = lambda Pat, Str: Re.match (PAT, STR) .Group () >>> match (".", "/ n") Traceback (MOST Recent Call Last): ... >>> Match ("(? s).", "/ n") '/ n' >>> match ("(? m).", "/ n") Traceback (MOST RECENT CALL Last): ... >> > search = lambda pat, str: re.search (PAT, STR) .Group () >>> txt = "" 123ABC DEF456 "" ">>> Search = Lambda Pat, str: Re.search (PAT, STR ) .Group () >>> Search ("(? s) / w $", txt) '6' >>> PAT = "(? s) ^. $" >>> Search (PAT, TXT) ' / N123ABC / NDEF456 / N '>>> Search ("(? M) / w $", txt)' C '>>> Search (' (? m) ^. $ ', txt)' 123abc'multiline The main function is to let "N" act as the beginning and end of a row. By default, "/ n" will be treated as a normal, which can match "/ s" characters.

>>> Search ("^ [A-ZA-Z]. * $", "Hello World! / N") 'Hello World!' >>> Search ("^ [A-ZA-Z]. * $" "/ NHELLO World! / N") Traceback (Most Recent Call Last): ... >>> Search ("^ / S * [A-ZA-Z]. * $", "/ NHELLO World! / N ") '/ NHELLO World!' >>> Search (" (? M) ^ [A-ZA-Z]. * $ "," / NHELLO World! / N ") 'Hello World!'

In short, MultiLine and Single-Line are for "/ n", in Multi-Line mode, '/ n' can be considered "^" and "$"; in Single -Line mode, "." You can match '/ n'. There is also some changes in the working mode of this regular expression. But Multi-Line and Single-Line are not a relationship that is not that is.

>>> match ('.', '/ n') Traceback (MOST Recent Call Last): ... >>> match ('(? s).', '/ n') '/ n' >>> Search ('(? m) ^. $', '/ NHELLO / NWORLD / N') 'Hello' >>> Search ('^. $', '/ NHELLO / NWORLD / N') TRACEBACK (MOST Recent Call Last): ... The former proves that by default, the regular expression is not Single-line; the latter proves that it is not multiline by default.

Backreference

The front uses parentheses to capture a regex, and then you can use / i (i representing numbers) to reference this regex.

>>> str = " Hello World " >>> PAT = r "<(. ) >> >>> Re.match (PAT, STR <_SRE.SRE_MATCH OBJECT AT 0x009F0F08> >>> m = Re.match (PAT, STR) >>> Code = "" "" ">>> EXEC CODE TAG Hello World

Again, add to three times the same letter to capture:

>>> PAT = R '([A-ZA-Z]) / 1/1'

Note that when making backreferece, be sure to use Raw_String, otherwise the anti-slant bar will be dropped by Escape.

If there are multiple groups in an expression, Backrefence will become very confusing. Can I use (? :) to exclude the candidate range of BackRefence.

>>> TXT = "

Title >>> PAT = R '<(. ?)> (:. ?) >>> m = Re.match (PAT, TXT) >>> m.Groups () ('h1',)

If the expression is really complicated, you can also name the group. Name the group syntax is more complicated. First of all, use "? P

", Use" P = Name "when referenced, use" / g

". In addition, after the match is successful, you can also check the substrings with the name of Group.

>>> txt = "

Title " >>> PAT = R "<(? P . )> ">>> m = Re.match (PAT, TXT) >>> Print M.Group ()

title >>> print m.group (" tag ") H1 >>> Re .SUB (PAT, ' / g ', txt) ' title ' assertion

Python's RE has two Assertions, which are Lookahead assertion and LookBehind assertions. The so-called lookahead means that this assertion is served for the first half of the regular expression, so its Assert should be the latter half of the regular expression. Lookbehind is also the same [1].

>>> DEF RE_SHOW (PAT, S): Print Re.Compile (PAT, RE.M) .SUB ("{/ g <0>}", s.rstrip ()), '/ n' >>> txt = "Micheal Jordan and Micheal Jackson" >>> PAT = "(Micheal) (? = Jordan) >>> Re_Show (PAT, TXT) {Micheal} Jordan and Micheal Jackson >>> PAT =" (Micheal) (( ? i) (? = / s jordan) ">>> re_show (pat, txt) {micheal} Jordan and Micheal Jackson

From this example, Lookahead Assertion does not participate in Match, in addition to the match of the group. In fact, it is mainly used to replace.

>>> txt = "Micheal Jordan and Micheal Jackson" >>> PAT = "(Micheal) >>> Re.sub (PAT," Phil ", TXT) 'Micheal Jordan and Phil Jackson'

Lookahead assertion Positions Positation and Negative, the front of the POSTIVE is Positive, Negative is very similar to POSTIVE, just turning the symbol "?!"

>>> TXT = "Bill Clinton, Bill Joy, Bill Gates" >>> PAT = "(?! JOY)" >>> Re_Show (PAT, TXT) {Bill} Clinton, Bill Joy, {BILL} Gates >>> Re.Sub (PAT, "William", TXT) 'William Clinton, Bill Joy, William Gates'lookbehind Assertion is very similar to Lookahead Assertion, but it has a major limit, and the pattern must be a length, can't There is a quantifier. Its POSTIVE symbol is "? <="; NEGATIVE symbol is "?

>>> txt = "cpython, jpython, and python.net" >>> PAT_POSTIVE = "(? <= j)" >>> PAT_NEGATIVE = "(? >> RE_SHOW (PAT_POSTIVE, TXT) CPYTHON, J {Python}, and Python.Net >>> Re_Show (PAT_NEGATIVE, TXT) C {Python}, jpython, and {python} .net

TIPS

Let's take a few small examples.

Tag = lambda s: r "(? s) <(? p % s) / b (? p . ?)> (? P . ?) " % (S, S)

Tag is specifically used to generate regular expressions that match Element. If you want to generate a regex that matches the H1 element, you can call tag ('h1'). Let's demonstrate it:

>>> TG = Tag ('anytag') >>> TXT = R "" " hello world " ">>> Re.match (TG , txt) <_SRE.SRE_MATCH OBJECT AT 0x009E4660> >>> m = Re.match (TG, TXT) >>> M.GROUP ('tag') 'anytag' >>> m.Group ('attribute') ' A = "yes" b = "no" >>> m.Group ('content') 'Hello World'

Tag2 = lambda s: r "(? s) <% s / b. ?> (. ) % (s, s) grabentries = lambda s, tg: Re.findall (Tag2 (TAG2) TG), S)

Let's take an example below. For example, there will be many TRs in Table, TR, and there are many TDs. If there is a function, it passes it to a Table Element, it returns to TR's list. Here is this function. However, this blog is a bit problem, can't really take a real gun on Table, so change it: >>> txt = "" "

123 ABC
Red White "" >>> GrabENTRIES (TXT, 'TD>) [' / N / T 123 / N / T ABC / N ',' / N / T Red / N / T White / N ']

[1] It seems that I have a problem with the original understanding. I just think that Lookahead Assertion means that this regular expression has made a break in the first half of its, it means that it means that this regular expression is an assert, and Lookahead means it only cares about the first half of the expression. . That is to say, it Assert's rear portion of the expression.

转载请注明原文地址:https://www.9cbs.com/read-15588.html

New Post(0)
CopyRight © 2020 All Rights Reserved
Processed: 0.035, SQL: 9