Java regular expressions detailed cactus studio 01-7-31 04:13:03
If you have used Perl or any other internal toch, you must know how simple it is to handle text and match mode with regular expressions. If you are not familiar with this term, "Regular Expression) is a string of characters, which defines a mode for searching for matching strings.
Many languages, including Perl, PHP, Python, JavaScript, and JScript, both support text with regular expressions, some text editors implement advanced "Search-Replace" feature with regular expressions. So what about java? When writing this article, a Java specification requirement containing text processing with regular expressions has been recognized, and you can look forward to it in the next version of JDK.
However, if you need to use a regular expression now, what should I do? You can download the Jakarta-ORo library that the source code open from Apache.org. The content next this article briefly introduces the entry knowledge of the regular expression, and then describes how to use the regular expression as an Jakarta-Oro API.
First, regular expression basics
Let's start with your simple start. Suppose you want to search for a string containing characters "cat", the regular expression for search is "cat". If the search is insensitive to case sensitive, words "Catalog", "Catherine", "Sophisticated" can match. That is to say:
1.1 period symbol
Suppose you are playing English spelling games, want to find three letters of words, and these words must begin with "T" letters, ending with "n" letters. In addition, suppose there is an English dictionary, you can search all of its content with regular expressions. To construct this regular expression, you can use a wildcard-period symbol ".". In this way, the complete expression is "TN", which matches "Tan", "Ten", "Tin", and "TON", also match "T # N", "TPN" and even "TN", there are many other no A combination of meaning. This is because the sentence symbols match all characters, including spaces, TAB characters, and even wraps:
1.2 square bracket symbol
In order to solve the problem of the sentence symbol matching, you can specify meaningful characters in square brackets ("[]"). At this point, only the characters specified in square brackets participate in the match. That is, the regular expression "T [AEIO] N" matches "Tan", "Ten", "TiN" and "Ton". But "Ton" does not match, because in square brackets, you can only match a single character:
1.3 "or" symbol
If you want to match "Toon" in addition to all words you match, you can use the "|" operator. The basic meaning of "|" operator is "or" operation. To match "TOON", use "T (A | E | I | O | OO) N" regular expression. You can't use the square bookmap here because square brackets only allow matching of individual characters; cracker "()" must be used here. Parentheses can also be used to group, please refer to it later.
1.4 Represents symbols of the number of matches
Table 1 shows the symbols indicating the number of matches, which are used to determine the number of times the symbols on the left side of the symbol:
Suppose we have to search for American social security numbers in a text file. The format of this number is 999-99-9999. The regular expression used to match it is shown. In the regular expression, even characters ("-") have special significance, which represents a range, such as from 0 to 9. Therefore, when the symbols in the social security number are matched, it is necessary to add a escape character "/".
Figure 1: Matching all 123-12-1234 Social Security Number assumptions When searching, you want the linker symbol to appear, or no appearance - ie, 999-99-9999 and 99999999 belong to the correct format. At this time, you can add "?" Quantity definition symbols behind the lunker symbol, as shown in Figure 2:
Figure 2: Social Security Number Matching all 123-12-1234 and 123121234
Let's take another example below. A format of a US car license is four numbers plus two letters. Its regular expression front is a digital portion "[0-9] {4}", plus the letter part "[a-z] {2}". Figure 3 shows a complete regular expression.
Figure 3: Match typical US car license numbers, such as 8836kV
1.5 "No" symbol
"^" Symbol is called "No" symbol. If used in square brackets, "^" indicates the character that does not want to match. For example, the regular expression of Figure 4 matches all words, except for the words starting with "X" letters.
Figure 4: Match all words, except "X"
1.6 parentheses and blank symbols
Suppose you want to extract the monthly sections from the birthday date of "June 26, 1951", which can match the regular expression of the date can be shown in Figure 5:
Figure 5: Date of matches all Moth DD, YYYY format
The new "/ s" symbol is blank symbol, matches all blank characters, including TAB characters. If the string is correctly matched, how do you get the one of the months? Simply add a group of parentheses around the month, then extract its value with the Oro API (described in detail later). The regular regular expression is shown in Figure 6:
Figure 6: Match the date of all MONTH DD, YYYY format, defining the month value of the first group
1.7 Other symbols
For the sake of simplicity, you can use a shortcut symbol created for common regular expressions. As shown in Table 2:
Table 2: Commonly used symbols
For example, in the example of the previous social security number, all the "[0-9]" places we can use "/ d". The regular regular expression after modification is shown in Figure 7:
Figure 7: Social security number matching all 123-12-1234 format
Second, JAKARTA-ORO library
Regular expressions with many source code open for Java programmers, and many of them support Perl 5 compatible regular expressions syntax. I am here to choose the Jakarta-ORO regular expression library, which is one of the most comprehensive regular expressions APIs, and it is fully compatible with the PERL 5 regular expression. In addition, it is also one of the best APIs optimized.
The Jakarta-Oro library is called Oromatcher before, and Daniel Savarese gives it to Jakarta Project. You can download it according to the end of this article.
I first briefly introduce the object you must create and access when using Jakarta-Oro libraries, and then describe how to use Jakarta-Oro API.
▲ PatternCompiler object
First, create an instance of a Perl5Compiler class and assign it to the PatternCompiler interface object. Perl5Compiler is an implementation of the PatternCompiler interface, allowing you to compile regular expressions into the Pattern object used to match.
▲ Pattern object
To compile the regular expression into a Pattern object, call the Compiler object's Compile () method, and specify the regular expression in the call parameter. For example, you can compile the regular expression "T [AEIO] N" as follows:
By default, the compiler creates a case-sensitive mode (Pattern). Therefore, the above code is compiled and the mode matches "TiN", "Tan", "Ten" and "TON" but do not match "Tin" and "TAN". To create a case where you are not sensitive mode, you should specify an extra parameter when calling the compiler: After creating the Pattern object, you can match the PatternMatcher class with the Pattern object.
▲ patternmatcher object
The PatternMatcher object matches the pattern according to the Pattern object and the string. You have to instantiate a Perl5MATCher class and assign the result to the PatternMatcher interface. The Perl5Matcher class is an implementation of the PatternMatcher interface, which matches the PERL 5 regular expression syntax:
With the PatternMatcher object, you can match multiple methods, and the first parameters of these methods are strings that need to be matched according to regular expressions:
· Boolean Matches (String Input, Pattern Pattern): Use when entering strings and regular expressions to match exactly. In other words, the regular expression must completely describe the input string.
· Boolean Matchesprefix (String Input, Pattern Pattern): When the regular expression matches the input string starting part.
· Boolean Contains (String Input, Pattern Pattern): When the regular expression is used to match a part of the input string (ie, it must be a substring).
In addition, in three ways to call, you can also use the PatternMatCherInput object to replace String objects as parameters; at this time, you can continue to match the last matching position in the string. When a string may have multiple substrings to match a given regular expression, it is useful to use the PatternMatCherInput object as a parameter. When using the PatternMatCherInput object as a parameter to replace String, the syntax of the above three methods is as follows:
· Boolean Matches (Pattern MatcherInput Input, Pattern Pattern)
· Boolean Matchesprefix (Pattern Mazput Input, Pattern Pattern)
· Boolean Contains (Pattern MatcherInput Input, Pattern Pattern)
Third, the application example
Let's take a look at some application examples of the Jakarta-Oro library.
3.1 log file processing
Task: Analyze a web server log file to determine how each user spends on the website. In a typical BEA WebLogic log file, the format of the logging is as follows:
Analyze this log record, you can find that there are two items extracted from this log file: IP address and page access time. You can extract IP addresses and time tags from logging from logging (parentheses).
First let's take a look at the IP address. There are 4 bytes of IP addresses, each byte between 0 and 255, each byte separated by a sentence. Therefore, each byte in the IP address has at least one, up to three numbers. Figure 8 shows the regular expression written for the IP address:
Figure 8: Match the IP address
The sentence character in the IP address must perform escape processing (previously coupled "/"), because the period in the IP address has its original meaning, rather than adopting a special meaning in the regular expression syntax. Special meanings in regular expressions have been described in the foregoing in the context. The time part of the log record is surrounded by a pair of brackets. You can extract all the contents inside from the scratches as follows: First search the start square bracket character ("["), extract all the content that does not exceed the end square bracket character ("]"), looking forward until finding the end Square bracket character. Figure 9 shows the regular expression of this part.
Figure Nine: Match at least one character until "]"
Now, the above two regular expressions are combined into a single expression after the packet symbol (parentheses), so that IP addresses and time can be extracted from logging. Note that "/ S- / S- / S" is added in the middle of the regular expression in order to match "- -" (but not extracting it). Complete regular expressions are shown in Figure 10.
Figure 10: Match IP address and time tag
Now the regular expression has been written, then you can write Java code that uses the regular expression library.
To use the Jakarta-ORO library, first create a regular expression string and to be analyzed logging strings:
The regular expression used here is almost identical to the regular expression of Figure 10, but there is a little exception: In Java, you must perform escape processing for each forward slash ("/"). Figure 10 is not a representation of Java, so we have to add one "/" in front of each "/" to avoid compile errors. Unfortunately, the escape processing process is easy to have an error, so be careful. You can first enter the regular expression that unsidaled processed, then replace each "/" from left to right into "//". If you want to check, you can try it to the screen.
After initializing the string, instantiate the PatternCompiler object and compile the regular expression with PatternCompiler to create a Pattern object:
Now create a PatternMatcher object, call the contAin () method of the PatternMatcher interface to check the match:
Next, the MatchResult object returned by the PATTERNMATCHER interface, outputs the matching group. Due to the LoGENTRY string contains the matching content, you can see the following output:
3.2 HTML treatment example one
One of the following tasks is to analyze all properties of the Font tag within the HTML page. The typical font tag in the HTML page is shown below:
The program will output the properties of each FONT tag according to the following form:
In this case, I suggest you use two regular expressions. The first is shown in Figure 11, which extracts from font tag "" Face = "arial, serif" size = " 2" color = "red" ".
Figure 11: Match all attributes of the Font tag
The second regular expression is shown in Figure 12, which divides each attribute into a name-value pair.
Figure 12: Match a single attribute and divide it into a name-value pair
The segmentation result is:
Now let's take a look at the Java code that completes this task. First create two regular expression strings, compile them into Pattern objects with Perl5Compiler. When compiling regular expressions, specify the perl5compiler.case_insensitive_mask option so that the matching operation is not case sensitive.
Next, create a Perl5Matcher object that performs a matching operation.
Suppose there is a string type variable HTML, which represents a line of content in the HTML file. If the HTML string contains the Font tag, the match will return true. At this point, you can get the first group with the MatchResult object returned by the matchr object, which contains all properties of the Font: Next, create a PatternmatCherInput object. This object allows you to continue matching operations from the last match location, so it is well suited to extract the name-value pair of font tag internal properties. Create a PatternMatCherInput object to pass in the parameter form to the string to be matched. The properties of each FONT are then extracted with a match instance. This repeatedly calls the contAins () method of the PatternMatcher object by specifying the PatternMatCherInput object (not a string object) as a parameter. Each iteration in the PatternMatCherInput object will move its internal pointer, and the next detection starts from the back of the previous match position.
The output of this example is as follows:
3.3 HTML Treatment Example 2
Let's take a look at another example of handling HTML. This time, we assume that the web server moves from widgets.acme.com to newserver.acme.com. Now you have to modify the links in some pages:
The regular expression of the execution of this search is shown in Figure 13:
Figure 13: Link before the modification
If you can match this regular expression, you can replace the link between the following content:
Attention #, after the back, add $ 1. Perl regular expressions syntax with $ 1, $ 2, etc., which have been matched and extracted. Figure 13 of the expression attached all the contents as a group matching and extracted to the link.
Now return Java. As we do in front, you have to create a test string, create objects necessary to compile regular expressions into the Pattern object, and create a patternmatcher object:
Next, replace the substetute () static method of the Util class with the com.oroinc.text.Regex package, output result string:
The syntax of the util.substitute () method is as follows:
The first two parameters of this call are previously created Patternmatcher and Pattern objects. The third parameter is a Substiution object that determines how the replacement operation is performed. This example uses Perl5Substitution objects that can be replaced by Perl5 style. The fourth parameter is a string that wants to replace the operation, and the last parameter allows you to specify all matching sub-strings (util.substitute_all) of whether the mode is replaced, or only replace the specified number of times.
[Conclusion] In this article, I introduced you the powerful function of the regular expression. As long as it is properly used, the regular expression can play a lot in string extraction and text modification. In addition, I also introduced how to use the JAKARTA-ORO library in the Java program to use regular expressions. As for the final use of vintage string processing (using StringTokenizer, Charat, and Substring), or use regular expressions, which is to be determined by you. This site I am a service has a lot of user data input through form, and all data must be checked before being sent to the database. I know that the regular expression function of PHP3 can solve my problem, but I don't know how to build this regular expression. At that time, I need some examples. I naturally went to see the PHP3 manual and POSIX 1002.3's instructions but they didn't help much as an example. So I spent a lot of time to find the material in this area online. I finally finally got it, mainly through experiments.
Since there is not so much about this material, I decided to write it down: I know about grammar and how to create a regular expression to verify the Money and E-Mail addresses. I hope it can be clear Your fog, let you and your partners clarify this problem. Basic syntax of the regular expression: First, let's take a look at the two special symbols: '^' and '$'. Their role is to specify one The beginning and end of the string. Symptoms like this: "^ the": Corresponding to any string "of despair $" starting with "the": The string "^ ABC $" ended with "of despair": one "ABC" starts and ends the string - it is "ABC" yourself! "notice": A string containing "notice". You can see if you are not used, just like the last example You are equal to the expression: in the arbitrary position of the string, it can be, that is, you don't care in the head or the end. There are several symbols '*', ' ', and '?', They represent characters or The number of strings appear. They mean: "0 or more (arbitrary)", "1 or more (at least 1 time)", and "0 or 1 time (up to 1 time). There are some Example: "ab *": Corresponding to a string ("A", "AB", "ABBB", and more, "AB ": similar, but at least one B ( "ab", "abbb", etc.); "ab?": either a b is either; "a? b $": There may be a A at the end section, or there may be no, which is more than 1 B. You can also use the curly brackets, the numbers inside will indicate the range of the previous characters: "AB {2}": The corresponding string with two B ("ABB") is included behind; "AB {2,}" : At least 2 B ("abb", "abbbb), etc.);" AB {3, 5} ": 3 to 5 B (" abbb "," abbbb ", or" abbbb "). Note that you must pay attention to the first number. (For example: "{0,2}", can not be "{, 2}"). You may have already noticed, characters '*', ' ', and '?' And "{0,}", "{1,}", and "{0, 1}" function is the same. Now, some character sequence / small string is to put them into parentheses In: "A (BC) *": The corresponding string containing any "BC" in A; "A (BC) {1,5}": 1 to 5 "BC" In. And '|' characters, functions like OR, used to choose: "Hi | Hello": Corresponding to a string with "hi" or "hello"; "(B | CD) EF": a "BEF "Or" CDEF "string;" (a | b) * c ": A string has a combination of A and B, then at the end of a C; a sentence ('.') Means any individual characters:" a. [0-9] ": Indicates a string that has a character and a number behind it;" ^. {3} $ ": There are 3 characters of the string. Bacon clearly indicates which characters can appear A single character location: "
[ab] ": Corresponds to a A or a B (equivalent to" a | b ");" [ad] ": a string has lowercase letters 'A' to 'D' (equivalent" A | B | C | D "even" [abcd] ");" ^ [A-ZA-Z] ": A start character is a string of English letters;" [0-9]% ": has a number in percent sign The front string; ", [A-ZA-Z0-9] $": A string end is a comma back to follow a number or letter. You can use a list to eliminate the characters you don't want - just use a '^ 'The first location in your square bracket (for example, "% [^ a-za-z]%" means a character between the two percent sign is not an English letter). In addition, you must pay attention, At some point, you don't have to add a backslash to indicate a special character failure, such as when the character class is the first location. Look: "($ | ¥) The meaning of [0-9] " can be expressed as EREG ( "(/ $ | ¥) [0-9] ", $ STR) (What is the string?) Don't forget, all special characters in square brackets will lose special meaning (Note: '^ 'And' - 'exceptions), including backslashes, such as "[* ? {}.]" Is any one of these symbols. Regex Man Pages tell us: If you contain a'] ', you can put It places it in the first character position, you can also put a backslash in front of its front, and I should mention it as well as: collating sequences, character classes, and Equivalence Classes, I will not mention their details, because this is not in the in-depth relationship with this article, you can find more content in Regex Man Pages. / Pattern / Results. Match all characters X? Match 0 times or One X-string x * matches 0 or multiplex X-strings, but matches the least least number of times X matches 1 time or multiplex, but matches the least amount of possible. * Match 0 or once any character. Match 1 or more characters {m} match just M, N} matches the specified string of the specified string of M,} matches M or more specified characters String [] matching characters in [] matching characters in [] matching characters in [] Match [0-9] match all numeric characters [AZ] match all lowercase letters characters [^ 0-9] match all non-digital characters [ ^ AZ] Match all non-written alphanumeric characters ^ matching characters Character $ Match the character / d of the character / d Match a number, and the [0-9] syntax is the same / d match multiple numeric strings, and the [0-9] syntax is the same / D non-figures, others D / D non-digital, other strings of the same / d / w English letters or numbers, and [A-ZA-Z0-9] syntax, same / W and [A-ZA-ZA-Z0-9] syntax same / W English letters or numbers, and [^ A-ZA-Z0-9] syntax, same / w , and [^ a-ZA-ZA-Z0-9] syntax like / s space, and [/ n / t / r / f] syntax, same / s and [/ n / t / r] same / s non-space, and [^ / N / T / R / F] syntax like / s and [^ / N / T / R / f] syntax, the same / b matches the character string / b matching in English letters, numbers that do not use English letters, values as boundaries, match matching A characters or B characters or c characters String ABC matches string containing ABC () This symbol remembers the string found and is a very practical syntax.