Java.util.regex.pattern

xiaoxiao2021-03-06  63

These two days just want to learn the regular expression, but I can't find anything online, I have to see the API, this morning.

I have translated some, because I have a lot of mistakes, I hope that you don't finish correct:

Java.util.regex.pattern

A regular expression of a compiled implementation.

Regular expressions are usually appeared in the form of strings, and it must first be compiled into an instance of the pattern class.

The result model can be used to generate a Matcher, which (generated Macher instance) can match the basis

This regular expression generated any character sequence. Include when implementing the match in a match

Any number of situations, and multiple matches can share the same match mode.

Here is a typical call order:

Pattern P = Pattern.Compile ("a * b");

Matcher m = p.matcher ("aaaaab");

Boolean b = m.matches ();

For convenience, the Pattern class also defines the matches () method.

Because sometimes a regular expression is used only once.

In a call, this method first compiles the expression and then matches the input sequence.

Below this sentence:

Boolean B = Pattern.matches ("A * B", "AAAAAAb");

Equivalent to three sentences above. However, since it does not allow the future mode to be reused, it is necessary to repeat the match.

It is obvious that the efficiency is lower than the above method.

The instance of the Pattern class cannot be changed and the thread is safe. Note that the Matcher class is not a thread.

.

Regular expression Structure Description:

character:

X character x

// Reverse slope

/ 0N decimal number (0 <= n <= 7)

/ 0nn decimal number 0NN (0 <= n <= 7)

/ 0Mnn decimal number of 0Mnn (0 <= m <= 3, 0 <= n <= 7)

/ xhh hexadecimal number 0xhh

/ uhhhh hexadecimal number 0xhhhh

/ T playor ('/ u0009')

/ N wrap ('/ u000a')

/ R Enter ('/ u000d')

/ f the Form-feed character ('/ u000c')

/ a the alert (Bell) Character ('/ u0007')

/ e ESC symbol ('/ u001b')

/ CX X corresponds to the control

Character class

[ABC] A, B, or C (simple string)

[^ ABC] In addition to any character other than A, B, or C (negation)

[A-ZA-Z] From A to Z or from A to Z (including A, Z, A, Z) (range)

[A-D [m-p]] from A to D, or from M to P: [A-DM-P] (parallel)

[A-Z && [DEF]] D, E, or F (intersection)

[A-Z && [^ bc]] from A to Z, but except for B and C: [AD-Z] (subset)

[A-Z && [^ m-P]] From A to Z, excluding from M to P: [A-LQ-Z] (subset)

Predefined character sequence

Any character (or may not include line end compact)

/ d Number: [0-9]

/ D Nonord: [^ 0-9]

/ S empty characters: [/ t / n / X0b / f / r]

/ S non-empty characters: [^ / s]

/ W Single-character character: [A-ZA-Z_0-9] / W non-single character: [^ / W]

POSIX Character Class (US-ASCII ONLY)

/ p {loWer} lowercase alphabetic characters: [A-Z]

/ p {Upper} uppercase letters characters: [A-Z]

/ p {ascii} all ASCII: [/ x00- / x7f]

/ p {alpha} single alphabetic characters: [/ p {loWer} / p {upper}]

/ p {Digit} Decimal number: [0-9]

/ p {alnum} single character: [/ p {alpha} / p {DIGIT}]

/ p {punct} Data symbol: included! "# $% & '() * , -. /: <=>? @ [/] ^` {|} ~

/ p {graph} Visual Character: [/ P {alNum} / p {punct}]

/ p {print} Printing characters: [/ p {graph}]

/ p {blank} space or tab: [/ t]

/ p {cntrl} control character: [/ x00- / x1f / x7f]

/ p {xdigit} 16-year number: [0-9A-FA-F]

/ p {space} empty character: [/ t / n / X0b / f / r]

Unicode character class

/ p {ingreek} Character of Greek language (SIMPLE BLOCK)

/ p {lu} uppercase letter (Simple Category)

/ p {sc} currency symbol

/ P {ingreek} In addition to the greek species (Negation)

[/ p {l} && [^ / p {lu}]] Subtraction in addition to uppercase letters (Subtraction)

Boundary matching

^ Start

End of $

/ b Word Border

/ B non-word boundary

/ A input

/ G At the end of the current match

/ Z the end of the input but for the final terminator, if Any

/ z input

Greedy Quantifiers

Greedy Quantifiers (I don't know if translation is right)

X? X does not appear or once

X * X does not appear or have repeated

At least once X X

X {n} x N times appeared

X {n,} x at least n times

X {n, m} x occurs at least n times, but will not exceed M times

Reluctant Quantifiers

X ?? x, no appearance or once

X *? X, do not appear or have repeated

X ? X, at least once

X {n}? X, N times appeared

X {n,}? X, at least N times

X {n, m}? X, at least n times, but not more than M

Possessive Quantifiers

X? X, no appearance or once

X * X, no or multiple times

X x, at least once

X {n} x, N times

X {n,} x, at least n times

X {n, m} x, at least n times, but not more than M

Logical Operators

XY Y follows behind

X | Y X or Y

(X) x, as a capturing group

Reverse reference

/ n whatver the nth capturing group match matched

Quotation

/ Reference characters behind

/ Q Reference all characters until / e appears

/ E End the reference to / Q

Special Constructs (non-capturing)

(?: X) x, as a non-capturing group (? Idmsux-idmsux) match flag switch

(? IDMSUX-IDMSUX: X) x, as a non-capturing group with the given flags on

- off

(? = X) x, via zero-width Positive Lookahead

(?! X) x, via zero-width Negative Lookahead

(? <= X) x, via zero-width Positive Lookbehind

(?

(?> X) x, as an independent, non-capturing group

Backslashes, escapes, and quoting

The backslash character ('/') is used to escape, just as defined above, if not doing so, may result

Ambiguity. Therefore, expression // match

Single backslash, expression / {matching a single left flower bracket.

If the backslash is placed in front of any letter symbols that do not define the transfer structure, these will be retained.

The expansion in the subsequent regular expression. The backslash can be placed in any

Non-alphabetic symbols, even if it does not define escar configurations, there will be no errors.

In the Java language specification, it is pointed out that the backslash in the autonomous string in the Java code is necessary, no matter the unicode transfer

Righteousness, or for ordinary characters escapes. therefore,

In order to maintain the integrity of the regular expression, two backslash is written in the Java string. For example, in regular expression

Chinese characters '/ b' represents the retracted, '// b' represents the word boundary. '/ (hello /)' is invalid, and will be compiled

When error, you must use

'// (hello //)' to match (Hello).

Character Classes

The character class can appear inside the other character class and can be constructed from the operator and the operator (&&). And collect up

As a result, any of the characters is definitely at least once in at least one of the operands.

The results of the intersection include any character that occurs simultaneously in each operand.

The priority of the character class operator is as follows: (from high to low)

1 text escape / x

2 episodes [...]

3 range A-Z

4, set [A-E] [I-U]

5 intersection [A-Z && [aeiou]]]

Please pay attention to the valid character set of each character class. For example, in a character class, a regular expression. Lost it's special meaning

And - becomes a range indication of metamodes.

Line Terminators

The line end of the line is one or two character sequences to identify the end of the input character sequence. The following are considered

Is the end of the line:

Removal ('/ n'),

Enter the return line ("/ r / n"),

Enter ('/ R'),

Next line ('/ u0085'),

Line separator ('/ u2028'), or

Segment separator ('/ u2029).

If UNIX_LINES mode is activated, the only row end value is a newline.

Unless you specify a Dotall flag, a regular expression. Match any character, only the end of the row.

In the entire input queue, regular expression ^ and $ ignore the row end compact in the entire input queue, only match the start and end. If the multiline mode is activated, then the start of the input and all the row end characters are matched, except for the entire input.

The end.

In MultiLine mode, the $ matches all row ends, and the end of the entire input.

Groups and capturing

The packet capture is sorted according to the number of parentheses from the order of left to right. For example, in expression ((a) (b (c)))

There are four groups:

1 ((a) (b (c)))))

2 (a)

3 (B (c))

4 (c)

0 group represents the entire expression.

The reason why the group capture is called because during the matching process, each of the input sequences is in the subset of packet matches.

Columns will be saved. By rearward, the captured subsequence can be used again in the back expression.

.

Moreover, you can also re-found by the match after the matching operation ends.

The captured input associated with a packet is usually saved with a queue that matches this packet.

Column. If a packet is evaluated for the second time, even if it fails, its last captured value will be saved.

E.g,

Expression (A (b)?) Match "ABA", "B" is set as a subpatch. When starting to match, the previously captured input

Will be cleared.

With (? Started packets are complete, no capture packets do not capture any text, nor does it calculate the total number of packets.

Unicode Support

Unicode Technical Report # 18: Unicode Regular Expression Guidelines implemented a deeper support through a slight syntax change.

In Java code, the escape sequence like / u2014, in the Java language specification

? 3.3 Provide processing method

.

In order to facilitate the use of Unicode escape characters read from the file or keyboard, the regular expression parser is directly implemented.

Transfer. Therefore, the string "/ u2014" and "// u2014" are not equal, but compiled into the same mode,

match

Hexadecimal number 0X2014.

In Perl, UNICODE blocks and categories are written / p, / p. If you have a PROP attribute, / p {prop} will match,

And / p {prop} will not match. The block is specified by the prefix in the NMONGOLIAN.

The classification is specified by any prefix IS: / p {l} and / p {isl} reference Unicode letters. Block and classification

It is used inside or outside the character class.

The Unicode Standard, Version 3.0 pointed out supported blocks and categories. The name of the block in Chapter 14 and Unicode Character

The block-3.txt file definition in Database,

But space is removed. For example, Basic Latin becomes "BasicLatin". The name of the classification is defined on page 88

Table 4-5.

Comparison to Perl 5

PERL constructs not supported by the Pattern class:

Conditional structure (? {X}) and (CONDition X | Y),

Embedding code structure (? {Code}) and (?? {code}),

Embed a comment syntax (? #Comment), and

Operating pretreatment / L / u, / L, AND / U.

PATTERN class constructs not supported by Perl:

All power words, it is greedy to match any more.

Character class intersection and panelates.

NOTABLE DIFFERENCES from Perl:

转载请注明原文地址:https://www.9cbs.com/read-115006.html

New Post(0)