Reprinted:
http://buy.ccw.cn/htm/app/aprog/01_7_31_4.asp
If you have used Perl or any other internal toch, you must know how simple it is to handle text and match mode with regular expressions. If you are not familiar with this term, "Regular Expression) is a string of characters, which defines a mode for searching for matching strings.
Many languages, including Perl, PHP, Python, JavaScript, and JScript, both support text with regular expressions, some text editors implement advanced "Search-Replace" feature with regular expressions. So what about java? When writing this article, a Java specification requirement containing text processing with regular expressions has been recognized, and you can look forward to it in the next version of JDK.
However, if you need to use a regular expression now, what should I do? You can download the Jakarta-ORo library that the source code open from Apache.org. The content next this article briefly introduces the entry knowledge of the regular expression, and then describes how to use the regular expression as an Jakarta-Oro API.
First, regular expression basics
Let's start with your simple start. Suppose you want to search for a string containing characters "cat", the regular expression for search is "cat". If the search is insensitive to case sensitive, words "Catalog", "Catherine", "Sophisticated" can match. That is to say:
1.1 period symbol
Suppose you are playing English spelling games, want to find three letters of words, and these words must begin with "T" letters, ending with "n" letters. In addition, suppose there is an English dictionary, you can search all of its content with regular expressions. To construct this regular expression, you can use a wildcard-period symbol ".". In this way, the complete expression is "TN", which matches "Tan", "Ten", "Tin", and "TON", also match "T # N", "TPN" and even "TN", there are many other no A combination of meaning. This is because the sentence symbols match all characters, including spaces, TAB characters, and even wraps:
1.2 square bracket symbol
In order to solve the problem of the sentence symbol matching, you can specify meaningful characters in square brackets ("[]"). At this point, only the characters specified in square brackets participate in the match. That is, the regular expression "T [AEIO] N" matches "Tan", "Ten", "TiN" and "Ton". But "Ton" does not match, because in square brackets, you can only match a single character:
1.3 "or" symbol
If you want to match "Toon" in addition to all words you match, you can use the "|" operator. The basic meaning of "|" operator is "or" operation. To match "TOON", use "T (A | E | I | O | OO) N" regular expression. You can't use the square bookmap here because square brackets only allow matching of individual characters; cracker "()" must be used here. Parentheses can also be used to group, please refer to it later.
1.4 Represents symbols of the number of matches
Table 1 shows the symbols indicating the number of matches, which are used to determine the number of times the symbols on the left side of the symbol:
Suppose we have to search for American social security numbers in a text file. The format of this number is 999-99-9999. The regular expression used to match it is shown. In the regular expression, even characters ("-") have special significance, which represents a range, such as from 0 to 9. Therefore, when the symbols in the social security number are matched, it is necessary to add a escape character "/".
Figure 1: Matching all 123-12-1234 Social Security Number assumptions When searching, you want the linker symbol to appear, or no appearance - ie, 999-99-9999 and 99999999 belong to the correct format. At this time, you can add "?" Quantity definition symbols behind the lunker symbol, as shown in Figure 2:
Figure 2: Social Security Number Matching all 123-12-1234 and 123121234
Let's take another example below. A format of a US car license is four numbers plus two letters. Its regular expression front is a digital portion "[0-9] {4}", plus the letter part "[a-z] {2}". Figure 3 shows a complete regular expression.
Figure 3: Match typical US car license numbers, such as 8836kV
1.5 "No" symbol
"^" Symbol is called "No" symbol. If used in square brackets, "^" indicates the character that does not want to match. For example, the regular expression of Figure 4 matches all words, except for the words starting with "X" letters.
Figure 4: Match all words, except "X"
1.6 parentheses and blank symbols
Suppose you want to extract the monthly sections from the birthday date of "June 26, 1951", which can match the regular expression of the date can be shown in Figure 5:
Figure 5: Date of matches all Moth DD, YYYY format
The new "/ s" symbol is blank symbol, matches all blank characters, including TAB characters. If the string is correctly matched, how do you get the one of the months? Simply add a group of parentheses around the month, then extract its value with the Oro API (described in detail later). The regular regular expression is shown in Figure 6:
Figure 6: Match the date of all MONTH DD, YYYY format, defining the month value of the first group
1.7 Other symbols
For the sake of simplicity, you can use a shortcut symbol created for common regular expressions. As shown in Table 2:
Table 2: Commonly used symbols
For example, in the example of the previous social security number, all the "[0-9]" places we can use "/ d". The regular regular expression after modification is shown in Figure 7:
Figure 7: Social security number matching all 123-12-1234 format
Second, JAKARTA-ORO library
Regular expressions with many source code open for Java programmers, and many of them support Perl 5 compatible regular expressions syntax. I am here to choose the Jakarta-ORO regular expression library, which is one of the most comprehensive regular expressions APIs, and it is fully compatible with the PERL 5 regular expression. In addition, it is also one of the best APIs optimized.
The Jakarta-Oro library is called Oromatcher before, and Daniel Savarese gives it to Jakarta Project. You can download it according to the end of this article.
I first briefly introduce the object you must create and access when using Jakarta-Oro libraries, and then describe how to use Jakarta-Oro API.
▲ PatternCompiler object
First, create an instance of a Perl5Compiler class and assign it to the PatternCompiler interface object. Perl5Compiler is an implementation of the PatternCompiler interface, allowing you to compile regular expressions into the Pattern object used to match.
▲ Pattern object
To compile the regular expression into a Pattern object, call the Compiler object's Compile () method, and specify the regular expression in the call parameter. For example, you can compile the regular expression "T [AEIO] N" as follows:
By default, the compiler creates a case-sensitive mode (Pattern). Therefore, the above code is compiled and the mode matches "TiN", "Tan", "Ten" and "TON" but do not match "Tin" and "TAN". To create a case where you are not sensitive mode, you should specify an extra parameter when calling the compiler: After creating the Pattern object, you can match the PatternMatcher class with the Pattern object.
▲ patternmatcher object
The PatternMatcher object matches the pattern according to the Pattern object and the string. You have to instantiate a Perl5MATCher class and assign the result to the PatternMatcher interface. The Perl5Matcher class is an implementation of the PatternMatcher interface, which matches the PERL 5 regular expression syntax:
With the PatternMatcher object, you can match multiple methods, and the first parameters of these methods are strings that need to be matched according to regular expressions:
· Boolean Matches (String Input, Pattern Pattern): Use when entering strings and regular expressions to match exactly. In other words, the regular expression must completely describe the input string.
· Boolean Matchesprefix (String Input, Pattern Pattern): When the regular expression matches the input string starting part.
· Boolean Contains (String Input, Pattern Pattern): When the regular expression is used to match a part of the input string (ie, it must be a substring).
In addition, in three ways to call, you can also use the PatternMatCherInput object to replace String objects as parameters; at this time, you can continue to match the last matching position in the string. When a string may have multiple substrings to match a given regular expression, it is useful to use the PatternMatCherInput object as a parameter. When using the PatternMatCherInput object as a parameter to replace String, the syntax of the above three methods is as follows:
· Boolean Matches (Pattern MatcherInput Input, Pattern Pattern)
· Boolean Matchesprefix (Pattern Mazput Input, Pattern Pattern)
· Boolean Contains (Pattern MatcherInput Input, Pattern Pattern)
Third, the application example
Let's take a look at some application examples of the Jakarta-Oro library.
3.1 log file processing
Task: Analyze a web server log file to determine how each user spends on the website. In a typical BEA WebLogic log file, the format of the logging is as follows:
Analyze this log record, you can find that there are two items extracted from this log file: IP address and page access time. You can extract IP addresses and time tags from logging from logging (parentheses).
First let's take a look at the IP address. There are 4 bytes of IP addresses, each byte between 0 and 255, each byte separated by a sentence. Therefore, each byte in the IP address has at least one, up to three numbers. Figure 8 shows the regular expression written for the IP address:
Figure 8: Match the IP address
The sentence character in the IP address must perform escape processing (previously coupled "/"), because the period in the IP address has its original meaning, rather than adopting a special meaning in the regular expression syntax. Special meanings in regular expressions have been described in the foregoing in the context. The time part of the log record is surrounded by a pair of brackets. You can extract all the contents inside from the scratches as follows: First search the start square bracket character ("["), extract all the content that does not exceed the end square bracket character ("]"), looking forward until finding the end Square bracket character. Figure 9 shows the regular expression of this part.
Figure Nine: Match at least one character until "]"
Now, the above two regular expressions are combined into a single expression after the packet symbol (parentheses), so that IP addresses and time can be extracted from logging. Note that "/ S- / S- / S" is added in the middle of the regular expression in order to match "- -" (but not extracting it). Complete regular expressions are shown in Figure 10.
Figure 10: Match IP address and time tag
Now the regular expression has been written, then you can write Java code that uses the regular expression library.
To use the Jakarta-ORO library, first create a regular expression string and to be analyzed logging strings:
The regular expression used here is almost identical to the regular expression of Figure 10, but there is a little exception: In Java, you must perform escape processing for each forward slash ("/"). Figure 10 is not a representation of Java, so we have to add one "/" in front of each "/" to avoid compile errors. Unfortunately, the escape processing process is easy to have an error, so be careful. You can first enter the regular expression that unsidaled processed, then replace each "/" from left to right into "//". If you want to check, you can try it to the screen.
After initializing the string, instantiate the PatternCompiler object and compile the regular expression with PatternCompiler to create a Pattern object:
Now create a PatternMatcher object, call the contAin () method of the PatternMatcher interface to check the match:
Next, the MatchResult object returned by the PATTERNMATCHER interface, outputs the matching group. Due to the LoGENTRY string contains the matching content, you can see the following output:
3.2 HTML treatment example one
One of the following tasks is to analyze all properties of the Font tag within the HTML page. The typical font tag in the HTML page is shown below:
The program will output the properties of each FONT tag according to the following form:
In this case, I suggest you use two regular expressions. The first is shown in Figure 11, which extracts from font tag "" Face = "arial, serif" size = " 2" color = "red" ".
Figure 11: Match all attributes of the Font tag
The second regular expression is shown in Figure 12, which divides each attribute into a name-value pair.
Figure 12: Match a single attribute and divide it into a name-value pair
The segmentation result is:
Now let's take a look at the Java code that completes this task. First create two regular expression strings, compile them into Pattern objects with Perl5Compiler. When compiling regular expressions, specify the perl5compiler.case_insensitive_mask option so that the matching operation is not case sensitive.
Next, create a Perl5Matcher object that performs a matching operation.
Suppose there is a string type variable HTML, which represents a line of content in the HTML file. If the HTML string contains the Font tag, the match will return true. At this point, you can get the first group with the MatchResult object returned by the matchr object, which contains all properties of the Font: Next, create a PatternmatCherInput object. This object allows you to continue matching operations from the last match location, so it is well suited to extract the name-value pair of font tag internal properties. Create a PatternMatCherInput object to pass in the parameter form to the string to be matched. The properties of each FONT are then extracted with a match instance. This repeatedly calls the contAins () method of the PatternMatcher object by specifying the PatternMatCherInput object (not a string object) as a parameter. Each iteration in the PatternMatCherInput object will move its internal pointer, and the next detection starts from the back of the previous match position.
The output of this example is as follows:
3.3 HTML Treatment Example 2
Let's take a look at another example of handling HTML. This time, we assume that the web server moves from widgets.acme.com to newserver.acme.com. Now you have to modify the links in some pages:
The regular expression of the execution of this search is shown in Figure 13:
Figure 13: Link before the modification
If you can match this regular expression, you can replace the link between the following content:
Attention #, after the back, add $ 1. Perl regular expressions syntax with $ 1, $ 2, etc., which have been matched and extracted. Figure 13 of the expression attached all the contents as a group matching and extracted to the link.
Now return Java. As we do in front, you have to create a test string, create objects necessary to compile regular expressions into the Pattern object, and create a patternmatcher object:
Next, replace the substetute () static method of the Util class with the com.oroinc.text.Regex package, output result string:
The syntax of the util.substitute () method is as follows:
The first two parameters of this call are previously created Patternmatcher and Pattern objects. The third parameter is a Substiution object that determines how the replacement operation is performed. This example uses Perl5Substitution objects that can be replaced by Perl5 style. The fourth parameter is a string that wants to replace the operation, and the last parameter allows you to specify all matching sub-strings (util.substitute_all) of whether the mode is replaced, or only replace the specified number of times.