Non-greedy
Regular expression Quantifier is Greedy. That is, if there are multiple matches in a string (these Match's starting positions must be the same), the regular expression will match the longest one. Because "." Can match "<" and ">", so ". " Will match the second character. So here you need to use non-greedy [1]. The so-called non-greedy means that the regular expression will select the shortest Match when matching.
>>> txt = "
Let's take another example:
>>> txt = "abcdddd" >>> code = "" "m = re.match (pat, txt) for s in m.groups (): Print S" ">>> PAT =" (A / W ) ">> --> EXEC CODE ABCDDDD >>> PAT = "(A / W ?)" >>> EXEC CODE AB >>> PAT = "(A / W *?)" >>> EXEC CODE A
After using Non-Greedy, "A / W *?" As "A", "A / W ?" Has no difference in "A / W", so its main purpose is to match the punctuation of the appearance aspect. So this regular expression should be:
SINGLE-LINE and MULTI-LINE
Single-line and multi-line are all for "/ n". The process of "/ n" is also involved in "^" and "$" match. In short, in Single-Line mode, '.' Can match "/ n":
>>> Match = lambda Pat, Str: Re.match (PAT, STR) .Group () >>> match (".", "/ n") Traceback (MOST Recent Call Last): ... >>> Match ("(? s).", "/ n") '/ n' >>> match ("(? m).", "/ n") Traceback (MOST RECENT CALL Last): ... >> > search = lambda pat, str: re.search (PAT, STR) .Group () >>> txt = "" 123ABC DEF456 "" ">>> Search = Lambda Pat, str: Re.search (PAT, STR ) .Group () >>> Search ("(? s) / w $", txt) '6' >>> PAT = "(? s) ^. $" >>> Search (PAT, TXT) ' / N123ABC / NDEF456 / N '>>> Search ("(? M) / w $", txt)' C '>>> Search (' (? m) ^. $ ', txt)' 123abc'multiline The main function is to let "N" act as the beginning and end of a row. By default, "/ n" will be treated as a normal, which can match "/ s" characters.
>>> Search ("^ [A-ZA-Z]. * $", "Hello World! / N") 'Hello World!' >>> Search ("^ [A-ZA-Z]. * $" "/ NHELLO World! / N") Traceback (Most Recent Call Last): ... >>> Search ("^ / S * [A-ZA-Z]. * $", "/ NHELLO World! / N ") '/ NHELLO World!' >>> Search (" (? M) ^ [A-ZA-Z]. * $ "," / NHELLO World! / N ") 'Hello World!'
In short, MultiLine and Single-Line are for "/ n", in Multi-Line mode, '/ n' can be considered "^" and "$"; in Single -Line mode, "." You can match '/ n'. There is also some changes in the working mode of this regular expression. But Multi-Line and Single-Line are not a relationship that is not that is.
>>> match ('.', '/ n') Traceback (MOST Recent Call Last): ... >>> match ('(? s).', '/ n') '/ n' >>> Search ('(? m) ^. $', '/ NHELLO / NWORLD / N') 'Hello' >>> Search ('^. $', '/ NHELLO / NWORLD / N') TRACEBACK (MOST Recent Call Last): ... The former proves that by default, the regular expression is not Single-line; the latter proves that it is not multiline by default.
Backreference
The front uses parentheses to capture a regex, and then you can use / i (i representing numbers) to reference this regex.
>>> str = "
Again, add to three times the same letter to capture:
>>> PAT = R '([A-ZA-Z]) / 1/1'
Note that when making backreferece, be sure to use Raw_String, otherwise the anti-slant bar will be dropped by Escape.
If there are multiple groups in an expression, Backrefence will become very confusing. Can I use (? :) to exclude the candidate range of BackRefence.
>>> TXT = "
If the expression is really complicated, you can also name the group. Name the group syntax is more complicated. First of all, use "? P
", Use" P = Name "when referenced, use" / g
". In addition, after the match is successful, you can also check the substrings with the name of Group.
>>> txt = "
Python's RE has two Assertions, which are Lookahead assertion and LookBehind assertions. The so-called lookahead means that this assertion is served for the first half of the regular expression, so its Assert should be the latter half of the regular expression. Lookbehind is also the same [1].
>>> DEF RE_SHOW (PAT, S): Print Re.Compile (PAT, RE.M) .SUB ("{/ g <0>}", s.rstrip ()), '/ n' >>> txt = "Micheal Jordan and Micheal Jackson" >>> PAT = "(Micheal) (? = Jordan) >>> Re_Show (PAT, TXT) {Micheal} Jordan and Micheal Jackson >>> PAT =" (Micheal) (( ? i) (? = / s jordan) ">>> re_show (pat, txt) {micheal} Jordan and Micheal Jackson
From this example, Lookahead Assertion does not participate in Match, in addition to the match of the group. In fact, it is mainly used to replace.
>>> txt = "Micheal Jordan and Micheal Jackson" >>> PAT = "(Micheal) >>> Re.sub (PAT," Phil ", TXT) 'Micheal Jordan and Phil Jackson'
Lookahead assertion Positions Positation and Negative, the front of the POSTIVE is Positive, Negative is very similar to POSTIVE, just turning the symbol "?!"
>>> TXT = "Bill Clinton, Bill Joy, Bill Gates" >>> PAT = "(?! JOY)" >>> Re_Show (PAT, TXT) {Bill} Clinton, Bill Joy, {BILL} Gates >>> Re.Sub (PAT, "William", TXT) 'William Clinton, Bill Joy, William Gates'lookbehind Assertion is very similar to Lookahead Assertion, but it has a major limit, and the pattern must be a length, can't There is a quantifier. Its POSTIVE symbol is "? <="; NEGATIVE symbol is "?
>>> txt = "cpython, jpython, and python.net" >>> PAT_POSTIVE = "(? <= j)" >>> PAT_NEGATIVE = "(? >> RE_SHOW (PAT_POSTIVE, TXT) CPYTHON, J {Python}, and Python.Net >>> Re_Show (PAT_NEGATIVE, TXT) C {Python}, jpython, and {python} .net
TIPS
Let's take a few small examples.
Tag = lambda s: r "(? s) <(? p
Tag is specifically used to generate regular expressions that match Element. If you want to generate a regex that matches the H1 element, you can call tag ('h1'). Let's demonstrate it:
>>> TG = Tag ('anytag') >>> TXT = R "" "
Tag2 = lambda s: r "(? s) <% s / b. ?> (. ) % s>% (s, s) grabentries = lambda s, tg: Re.findall (Tag2 (TAG2) TG), S)
Let's take an example below. For example, there will be many TRs in Table, TR, and there are many TDs. If there is a function, it passes it to a Table Element, it returns to TR's list. Here is this function. However, this blog is a bit problem, can't really take a real gun on Table, so change it: >>> txt = "" "