If you have used Perl or any other internal toch, you must know how simple it is to handle text and match mode with regular expressions. If you are not familiar with this term, "Regular Expression) is a string of characters, which defines a mode for searching for matching strings. Many languages, including Perl, PHP, Python, JavaScript, and JScript, both support text with regular expressions, some text editors implement advanced "Search-Replace" feature with regular expressions. So what about java? When writing this article, a Java specification requirement containing text processing with regular expressions has been recognized, and you can look forward to it in the next version of JDK. However, if you need to use a regular expression now, what should I do? You can download the Jakarta-ORo library that the source code open from Apache.org. The content next this article briefly introduces the entry knowledge of the regular expression, and then describes how to use the regular expression as an Jakarta-Oro API. 1. General expressions Basic knowledge We first started from simple. Suppose you want to search for a string containing characters "cat", the regular expression for search is "cat". If the search is insensitive to case sensitive, words "Catalog", "Catherine", "Sophisticated" can match. That is to say: 1.1 Status symbol assumes that you are playing English spelling games, want to find three letters of words, and these words must begin with "T" letters, ending with "n" alphabet. In addition, suppose there is an English dictionary, you can search all of its content with regular expressions. To construct this regular expression, you can use a wildcard-period symbol ".". In this way, the complete expression is "TN", which matches "Tan", "Ten", "TIN", and "TON", also matching "T 1.2 square bracket symbol to solve the sentence symbol match range too extensive, you It can be specified in square bracket ("[]"). At this time, only characters specified in square brackets participate in match. That is to say, regular expression "T [AEIO] N" only matches " Tan "," Ten "," TIN "and" Ton ". But" TOON "does not match, because in square brackets you can only match a single character: 1.3" or "If all words match all the words, You still want to match "Toon", then you can use the "|" operator. "|" The basic meaning of the operator is "or" operation. To match "TOON", use "T (A | E | i | o | OO) N "Regular expression. Here you cannot use the square bookcase, because square brackets only allow matching single characters; crackers" () "must be used. Parentheses can also be used, please refer to later. 1.4 Symbol table indicating the number of matches shows the symbols indicating the number of matching times, which is used to determine the number of times the symbols on the left side of the symbol: Suppose we are searching for the US Social Security Number in the text file. This number format It is 999-99-9999. The regular expression used to match it is shown. In the regular expression, even characters ("-") have special meaning, which represents a range, such as from 0 to 9. Therefore, when matching the homology symbol in the social security number, its front should be added with a escape character "/". Figure 1: Match all 123-12-1234 social security numbers assumptions When searching, you hope The hypothy symbol can appear, or it may not appear - ie, 999-99-9999 and 999999999 belong to the correct format.
At this time, you can add "?" Numbers after the lunker symbol, as shown in Figure 2: Figure 2: Match all 123-12-1234 and 123121234 Social Security Number Let's take another example. A format of a US car license is four numbers plus two letters. Its regular expression front is a digital portion "[0-9] {4}", plus the letter part "[a-z] {2}". Figure 3 shows a complete regular expression. Figure 3: Match a typical US car license number, such as 8836kv1.5 "No" symbol "^" symbol is called "No" symbol. If used in square brackets, "^" indicates the character that does not want to match. For example, the regular expression of Figure 4 matches all words, except for the words starting with "X" letters. Figure 4: Except for all words, except "X" 1.6 parentheses and blank symbols assume that the monthly sections are extracted from the format "June 26, 1951", which is used to match the regular expression of the date. As shown in Figure 5: Figure 5: Matching all "/ s" symbols of all Moth DD, YYYY format are blank symbols, match all blank characters, including TAB characters. If the string is correctly matched, how do you get the one of the months? Simply add a group of parentheses around the month, then extract its value with the Oro API (described in detail later). The modified regular expression is shown in Figure 6: Figure 6: Matching all MONTH DD, YYYY format, defining the month value for the first group 1.7 other symbols to see, you can use some common regular expressions Fast symbol. As shown in Table 2: Table 2: Common Symbols For example, in the example of the previous social security number, all "[0-9]" places can use "/ d". The regular regular expression is shown in Figure 7: Figure 7: Matching all 123-12-1234 format Social Security Number 2, Jakarta-Oro Library has many source code open regular expressions available for Java programmers, And many of them support Perl 5 compatible regular expressions syntax. I am here to choose the Jakarta-ORO regular expression library, which is one of the most comprehensive regular expressions APIs, and it is fully compatible with the PERL 5 regular expression. In addition, it is also one of the best APIs optimized. The Jakarta-Oro library is called Oromatcher before, and Daniel Savarese gives it to Jakarta Project. You can download it according to the end of this article. I first briefly introduce the object you must create and access when using Jakarta-Oro libraries, and then describe how to use Jakarta-Oro API. ▲ PatternCompiler object First, create an instance of a Perl5Compiler class and assign it to the PatternCompiler interface object. Perl5Compiler is an implementation of the PatternCompiler interface, allowing you to compile regular expressions into the Pattern object used to match. ▲ Pattern object To compile the regular expression into a Pattern object, call the Compiler object's Compile () method, and specify a regular expression in the call parameter. For example, you can compile regular expressions "T [AEIO] N" in this way: By default, the compiler creates a case-sensitive mode (Pattern). Therefore, the above code is compiled and the mode matches "TiN", "Tan", "Ten" and "TON" but do not match "Tin" and "TAN".
To create a case where you are not sensitive mode, you should specify an extra parameter when calling the compiler: After creating the Pattern object, you can match the PatternMatcher class with the Pattern object. ▲ PatternMatcher object PatternMatcher object matches the pattern based on the Pattern object and string. You have to instantiate a Perl5MATCher class and assign the result to the PatternMatcher interface. The Perl5Matcher class is an implementation of the PatternMatcher interface, which matches the PERL 5 regular expression syntax: use the PatternMatcher object, you can match multiple methods, these methods must be done according to regular expressions Matching strings: · Boolean Matches (String Input, Pattern Pattern): Use when entering strings and regular expressions to be exactly matched. In other words, the regular expression must completely describe the input string. · Boolean Matchesprefix (String Input, Pattern Pattern): When the regular expression matches the input string starting part. · Boolean Contains (String Input, Pattern Pattern): When the regular expression is used to match a part of the input string (ie, it must be a substring). In addition, in three ways to call, you can also use the PatternMatCherInput object to replace String objects as parameters; at this time, you can continue to match the last matching position in the string. When a string may have multiple substrings to match a given regular expression, it is useful to use the PatternMatCherInput object as a parameter. With the object as a parameter PatternMatcherInput Alternatively String, the syntax of the three methods as follows: · boolean matches (PatternMatcherInput input, Pattern pattern) · boolean matchesPrefix (PatternMatcherInput input, Pattern pattern) · boolean contains (PatternMatcherInput input, Pattern pattern) Third, the application Example Let's take a look at some application examples of the Jakarta-Oro library. 3.1 Log File Processing Tasks: Analyze a web server log file to determine how each user spends on the website. In a typical bea weblogic log file, the format of the logging is as follows: Analyze this log record, you can find that there are two items extracted from this log file: IP address and page access time. You can extract IP addresses and time tags from logging from logging (parentheses). First let's take a look at the IP address. There are 4 bytes of IP addresses, each byte between 0 and 255, each byte separated by a sentence. Therefore, each byte in the IP address has at least one, up to three numbers. Figure 8 shows the regular expression written for the IP address: Figure 8: Match the descriptive character character in the IP address IP address must be escaped (previously coupled with "/"), because the period in the IP address has its original meaning Instead of adopting a special meaning in regular expression syntax. Special meanings in regular expressions have been described in the foregoing in the context. The time part of the log record is surrounded by a pair of brackets.
You can extract all the contents inside from the scratches as follows: First search the start square bracket character ("["), extract all the content that does not exceed the end square bracket character ("]"), looking forward until finding the end Square bracket character. Figure 9 shows the regular expression of this part. Figure 9: Match at least one character until "]" now, the above two regular expressions are merged into a single expression after adding the packet symbol (parentheses), so that IP addresses and time can be extracted from logging. Note that "/ S- / S- / S" is added in the middle of the regular expression in order to match "- -" (but not extracting it). Complete regular expressions are shown in Figure 10. Figure 10: Match IP Address and Time Tag Now Regular expression has been written, then you can write Java code that uses the regular expression library. To use the Jakarta-Oro library, first create a regular expression string and to be analyzed, the regular expression is almost identical to the regular expression of Figure 10, but there is a little exception: in Java, you Each of the forward slash ("/") must be performed. Figure 10 is not a representation of Java, so we have to add one "/" in front of each "/" to avoid compile errors. Unfortunately, the escape processing process is easy to have an error, so be careful. You can first enter the regular expression that unsidaled processed, then replace each "/" from left to right into "/". If you want to check, you can try it to the screen. After initializing the string, instantiate the PatternCompiler object, compile the regular expression with PatternCompiler: Now, create a PatternMatcher object, call the contAin () method of the PatternMatcher interface: Next, use the patternmatcher interface to return MatchResult object, Output matching group. Since the LoGENTRY string contains the matching content, you can see the class below: 3.2 HTML Processing Instance One Next task is to analyze all properties of the Font tag within the HTML page. The typical font tag in the HTML page is as follows: The program will output the properties of each FONT tag according to the following form: In this case, I suggest you use two regular expressions. The first is shown in Figure 11, it extracts from font tag "" Face = "arial, serif" size = " 2" color = "red" ". Figure 11: Match all attributes of the font tag second A regular expression is shown in Figure 12, which segments each attribute into a name-value pair. Figure 12: Match a single attribute, and divide it into a name-value to the division results: now let's take a look at this Task's Java code. First create two regular expression strings, compile them into Pattern objects with Perl5Compiler. When compiling regular expressions, specify the perl5compiler.case_insensitive_mask option, so that the matching operation is not case sensitive. Next, create one Perl5matcher objects that match the matching operation. Assume that there is a string type variable HTML, which represents a line of content in the HTML file. If the HTML string contains the Font tag, the match will return to TRUE.
At this point, you can get the first group with the MatchResult object returned by the matchr object, which contains all properties of the Font: Next, create a PatternmatCherInput object. This object allows you to continue matching operations from the last match location, so it is well suited to extract the name-value pair of font tag internal properties. Create a PatternMatCherInput object to pass in the parameter form to the string to be matched. The properties of each FONT are then extracted with a match instance. This repeatedly calls the contAins () method of the PatternMatcher object by specifying the PatternMatCherInput object (not a string object) as a parameter. Each iteration in the PatternMatCherInput object will move its internal pointer, and the next detection starts from the back of the previous match position. The output of this example is as follows: 3.3 HTML Processing Example II Let's take a look at another example of handling HTML. This time, we assume that the web server moves from widgets.acme.com to newserver.acme.com. Now you want to modify the link in some pages: Perform the regular expression of this search is shown in Figure 13: Figure 13: Matching the chain before you can match this regular expression, you can replace the following Three links: Note Now, return Java. As we do in front, you have to create a test string, create the object necessary to compile the regular expression to the Pattern object, and create a patternmatcher object: Next, use the com.roinc.text.Regex package Util class Substitution () static method for replacing, output result strings: The syntax of the util.substitute () method is as follows: This call is the previously created Patternmatcher and Pattern objects. The third parameter is a Substiution object that determines how the replacement operation is performed. This example uses Perl5Substitution objects that can be replaced by Perl5 style. The fourth parameter is a string that wants to replace the operation, and the last parameter allows you to specify all matching sub-strings (util.substitute_all) of whether the mode is replaced, or only replace the specified number of times. [Conclusion] In this article, I introduced you the powerful function of the regular expression. As long as it is properly used, the regular expression can play a lot in string extraction and text modification. In addition, I also introduced how to use the JAKARTA-ORO library in the Java program to use regular expressions. As for the final use of vintage string processing (using StringTokenizer, Charat, and Substring), or use regular expressions, which is to be determined by you. (Web page editing: wind wings) Java regular expressions detailed cactus studio 01-7-31 04:13:03 --------------------- -------------------------------------------------- ------- If you have used Perl or any other in-house, you have a language that you support support, you must know how simple it is to process text and match mode with regular expressions. If you are not familiar with this term, "Regular Expression) is a string of characters, which defines a mode for searching for matching strings.
Many languages, including Perl, PHP, Python, JavaScript, and JScript, both support text with regular expressions, some text editors implement advanced "Search-Replace" feature with regular expressions. So what about java? When writing this article, a Java specification requirement containing text processing with regular expressions has been recognized, and you can look forward to it in the next version of JDK. However, if you need to use a regular expression now, what should I do? You can download the Jakarta-ORo library that the source code open from Apache.org. The content next this article briefly introduces the entry knowledge of the regular expression, and then describes how to use the regular expression as an Jakarta-Oro API. 1. General expressions Basic knowledge We first started from simple. Suppose you want to search for a string containing characters "cat", the regular expression for search is "cat". If the search is insensitive to case sensitive, words "Catalog", "Catherine", "Sophisticated" can match. That is to say: 1.1 Status symbol assumes that you are playing English spelling games, want to find three letters of words, and these words must begin with "T" letters, ending with "n" alphabet. In addition, suppose there is an English dictionary, you can search all of its content with regular expressions. To construct this regular expression, you can use a wildcard-period symbol ".". In this way, the complete expression is "TN", which matches "Tan", "Ten", "TIN", and "TON", also matching "T 1.2 square bracket symbol to solve the sentence symbol match range too extensive, you It can be specified in square bracket ("[]"). At this time, only characters specified in square brackets participate in match. That is to say, regular expression "T [AEIO] N" only matches " Tan "," Ten "," TIN "and" Ton ". But" TOON "does not match, because in square brackets you can only match a single character: 1.3" or "If all words match all the words, You still want to match "Toon", then you can use the "|" operator. "|" The basic meaning of the operator is "or" operation. To match "TOON", use "T (A | E | i | o | OO) N "Regular expression. Here you cannot use the square bookcase, because square brackets only allow matching single characters; crackers" () "must be used. Parentheses can also be used, please refer to later. 1.4 Symbol table indicating the number of matches shows the symbols indicating the number of matching times, which is used to determine the number of times the symbols on the left side of the symbol: Suppose we are searching for the US Social Security Number in the text file. This number format It is 999-99-9999. The regular expression used to match it is shown. In the regular expression, even characters ("-") have special meaning, which represents a range, such as from 0 to 9. Therefore, when matching the homology symbol in the social security number, its front should be added with a escape character "/". Figure 1: Match all 123-12-1234 social security numbers assumptions When searching, you hope The hypoth of which can appear or not. - ie, 999-99-9999 and 99999999 belong to the correct format. At this time, you can add "after the homework symbol"? "Quantity limit symbol, as shown in Figure 2: Figure 2: Match all 123-12-1234 and 123121234 Social Security Number Let's look at another example. A format of the US car license is four numbers plus two A letter. Its regular expression is the number "[0-9] {4}", plus the letter part "[AZ] {2}".
Figure 3 shows a complete regular expression. Figure 3: Match a typical US car license number, such as 8836kv1.5 "No" symbol "^" symbol is called "No" symbol. If used in square brackets, "^" indicates the character that does not want to match. For example, the regular expression of Figure 4 matches all words, except for the words starting with "X" letters. Figure 4: Except for all words, except "X" 1.6 parentheses and blank symbols assume that the monthly sections are extracted from the format "June 26, 1951", which is used to match the regular expression of the date. As shown in Figure 5: Figure 5: Matching all "/ s" symbols of all Moth DD, YYYY format are blank symbols, match all blank characters, including TAB characters. If the string is correctly matched, how do you get the one of the months? Simply add a group of parentheses around the month, then extract its value with the Oro API (described in detail later). The modified regular expression is shown in Figure 6: Figure 6: Matching all MONTH DD, YYYY format, defining the month value for the first group 1.7 other symbols to see, you can use some common regular expressions Fast symbol. As shown in Table 2: Table 2: Common Symbols For example, in the example of the previous social security number, all "[0-9]" places can use "/ d". The regular regular expression is shown in Figure 7: Figure 7: Matching all 123-12-1234 format Social Security Number 2, Jakarta-Oro Library has many source code open regular expressions available for Java programmers, And many of them support Perl 5 compatible regular expressions syntax. I am here to choose the Jakarta-ORO regular expression library, which is one of the most comprehensive regular expressions APIs, and it is fully compatible with the PERL 5 regular expression. In addition, it is also one of the best APIs optimized. The Jakarta-Oro library is called Oromatcher before, and Daniel Savarese gives it to Jakarta Project. You can download it according to the end of this article. I first briefly introduce the object you must create and access when using Jakarta-Oro libraries, and then describe how to use Jakarta-Oro API. ▲ PatternCompiler object First, create an instance of a Perl5Compiler class and assign it to the PatternCompiler interface object. Perl5Compiler is an implementation of the PatternCompiler interface, allowing you to compile regular expressions into the Pattern object used to match. ▲ Pattern object To compile the regular expression into a Pattern object, call the Compiler object's Compile () method, and specify a regular expression in the call parameter. For example, you can compile regular expressions "T [AEIO] N" in this way: By default, the compiler creates a case-sensitive mode (Pattern). Therefore, the above code is compiled and the mode matches "TiN", "Tan", "Ten" and "TON" but do not match "Tin" and "TAN". To create a case where you are not sensitive mode, you should specify an extra parameter when calling the compiler: After creating the Pattern object, you can match the PatternMatcher class with the Pattern object. ▲ PatternMatcher object PatternMatcher object matches the pattern based on the Pattern object and string. You have to instantiate a Perl5MATCher class and assign the result to the PatternMatcher interface.
The Perl5Matcher class is an implementation of the PatternMatcher interface, which matches the PERL 5 regular expression syntax: use the PatternMatcher object, you can match multiple methods, these methods must be done according to regular expressions Matching strings: · Boolean Matches (String Input, Pattern Pattern): Use when entering strings and regular expressions to be exactly matched. In other words, the regular expression must completely describe the input string. · Boolean Matchesprefix (String Input, Pattern Pattern): When the regular expression matches the input string starting part. · Boolean Contains (String Input, Pattern Pattern): When the regular expression is used to match a part of the input string (ie, it must be a substring). In addition, in three ways to call, you can also use the PatternMatCherInput object to replace String objects as parameters; at this time, you can continue to match the last matching position in the string. When a string may have multiple substrings to match a given regular expression, it is useful to use the PatternMatCherInput object as a parameter. With the object as a parameter PatternMatcherInput Alternatively String, the syntax of the three methods as follows: · boolean matches (PatternMatcherInput input, Pattern pattern) · boolean matchesPrefix (PatternMatcherInput input, Pattern pattern) · boolean contains (PatternMatcherInput input, Pattern pattern) Third, the application Example Let's take a look at some application examples of the Jakarta-Oro library. 3.1 Log File Processing Tasks: Analyze a web server log file to determine how each user spends on the website. In a typical bea weblogic log file, the format of the logging is as follows: Analyze this log record, you can find that there are two items extracted from this log file: IP address and page access time. You can extract IP addresses and time tags from logging from logging (parentheses). First let's take a look at the IP address. There are 4 bytes of IP addresses, each byte between 0 and 255, each byte separated by a sentence. Therefore, each byte in the IP address has at least one, up to three numbers. Figure 8 shows the regular expression written for the IP address: Figure 8: Match the descriptive character character in the IP address IP address must be escaped (previously coupled with "/"), because the period in the IP address has its original meaning Instead of adopting a special meaning in regular expression syntax. Special meanings in regular expressions have been described in the foregoing in the context. The time part of the log record is surrounded by a pair of brackets. You can extract all the contents inside from the scratches as follows: First search the start square bracket character ("["), extract all the content that does not exceed the end square bracket character ("]"), looking forward until finding the end Square bracket character. Figure 9 shows the regular expression of this part. Figure 9: Match at least one character until "]" now, the above two regular expressions are merged into a single expression after adding the packet symbol (parentheses), so that IP addresses and time can be extracted from logging.
Note that "/ S- / S- / S" is added in the middle of the regular expression in order to match "- -" (but not extracting it). Complete regular expressions are shown in Figure 10. Figure 10: Match IP Address and Time Tag Now Regular expression has been written, then you can write Java code that uses the regular expression library. To use the Jakarta-Oro library, first create a regular expression string and to be analyzed, the regular expression is almost identical to the regular expression of Figure 10, but there is a little exception: in Java, you Each of the forward slash ("/") must be performed. Figure 10 is not a representation of Java, so we have to add one "/" in front of each "/" to avoid compile errors. Unfortunately, the escape processing process is easy to have an error, so be careful. You can first enter the regular expression that unsidaled processed, then replace each "/" from left to right into "/". If you want to check, you can try it to the screen. After initializing the string, instantiate the PatternCompiler object, compile the regular expression with PatternCompiler: Now, create a PatternMatcher object, call the contAin () method of the PatternMatcher interface: Next, use the patternmatcher interface to return MatchResult object, Output matching group. Since the LoGENTRY string contains the matching content, you can see the class below: 3.2 HTML Processing Instance One Next task is to analyze all properties of the Font tag within the HTML page. The typical font tag in the HTML page is as follows: The program will output the properties of each FONT tag according to the following form: In this case, I suggest you use two regular expressions. The first is shown in Figure 11, it extracts from font tag "" Face = "arial, serif" size = " 2" color = "red" ". Figure 11: Match all attributes of the font tag second A regular expression is shown in Figure 12, which segments each attribute into a name-value pair. Figure 12: Match a single attribute, and divide it into a name-value to the division results: now let's take a look at this Task's Java code. First create two regular expression strings, compile them into Pattern objects with Perl5Compiler. When compiling regular expressions, specify the perl5compiler.case_insensitive_mask option, so that the matching operation is not case sensitive. Next, create one Perl5matcher objects that match the match. Assume that there is a string type variable HTML, which represents a line of content in the HTML file. If the HTML string contains the Font tag, the match will return true. At this point, you can use the matching object to return The MatchResult object gains the first group, which contains all properties of the Font: Create a PatternMatCherInput object. This object allows you to continue matching from the last match location, so it is well suited to extract the Font tag The name-value pair. Create a PatternMatCherInput object, pass the string to be matched in parameter. Then, use the matching instance to extract the properties of each FONT. This is parameter by specifying the PatternMatCherInput object (not a string object) , Repeatedly call the contAins () method of the PatternMatcher object.