Due to the need for work, I often face a large number of text electronic information, so I have focused on the application of regular expressions in Java, and have a certain understanding, I hope that the relevant aspects of this article are Exhaustion. Regular expression: Regular expression is a powerful tool that can be used for pattern matching and replacement, and a regular expression is composed of ordinary characters (such as characters a to z) and special characters (called metammatics). Text mode describes one or more strings to be matched when looking for a text body. Regular expression As a template, a character mode matches the search string. Regular expressions play a very important role in character data processing, we can use regular expressions to complete most of data analysis processing, if you judge whether a string is a number, whether it is a valid email address, from massive text The data is extracted with valuable data. If the regular expression is not used, the implementation of the implementation may be very long and is easily error. For this, I have a deep understanding of this point, in the face of the finishing work of a large number of tool book electronic file, if you don't know how to apply the regular expression, it will be a very painful thing. Effect. Since this article is to introduce how to use regular expressions in Java, please refer to the relevant information for readers who have just contacted regular expressions, which is limited. Java supports regular expressions: In JDK1.3 or previous JDK versions do not include regular expressions available for Java programmers, we generally use the regular expression library provided by third parties, these third parties There is a source of source code in the library, and there is also a fee to purchase, and there is also a regular expression library in the JDK1.4 test version - Java.util.Regex. Therefore, we now have a lot of Java's regular expression library available. The following I will introduce two more representative Jakarta-ORO and Java.util.Regex, first of course, I have been using Jakarta-oro: Jakarta -Oro regular expression library 1. Summary: Jakarta-ORO is one of the most comprehensive and optimized regular expression APIs, Jakarta-Oro library called OROMATCHER, is written by Daniel F. Savarese, and later he gives him with Jakarta Project, readers can be in Apache. ORG website downloads the API package. Many source code open regular expression libraries are regular expressions that support Perl5 compatible expression, Jakarta-ORO regular expression library is no exception, and he is fully compatible with Perl 5 regular expressions. 2. Objects with their methods: ★ PatternCompiler object: When we use the Jakarta-Oro API package, the first thing to do is, create an instance of a Perl5Compiler class and assign it to the PatternCompiler interface object. Perl5Compiler is an implementation of the PatternCompiler interface, allowing you to compile regular expressions into the Pattern object used to match. PatternCompiler = new perl5compiler (); ★ pattern object: To compile the corresponding regular expression into a Pattern object, you need to call the Compiler object's compile () method, and specify the regular expression in the call parameter.
For example, you can compile the regular expression "S [AHKL] Y" in this way: pattern pattern = null; try {pattern = compiler.compile ("s [ahkl] y");} catch (MalformedPatterNexception E ) {E.PrintStackTrace ();} In the default, the compiler creates a case-sensitive mode (Pattern). Therefore, the above code is compiled with the pattern only matches "Say", "Shy", "SKY", and "SLY", but does not match "Say" and "Sky". To create a case where you don't sense sensitive mode, you should specify an extra parameter when calling the compiler: pattern = compiler.compile ("s [AHKL] Y", Perl5Compiler.case_insensitive_mask; after the Pattern object is created, Mode match can be made with this PatternMatcher class. ★ PatternMatcher object: PatternMatcher object is expanded in accordance with the Pattern object and string. You have to instantiate a Perl5MATCher class and assign the result to the PatternMatcher interface. The Perl5Matcher class is an implementation of the PatternMatcher interface, which matches the PERL 5 regular expression syntax: patternmatcher matcher = new perl5matcher (); PatternMatcher object provides multiple methods for matching operations, the first parameters of these methods are needed Based on the regular expression: Boolean Matches (String INPUT, PATTERN PATTERN): This method is used when the input string INPUT and regular expression pattern are precisely matched. That is to say, the true value is returned when the regular expression is integrated. Boolean Matchesprefix (String Input, Pattern Pattern): Requests the regular expression to match the input string start portion. That is, the true value is returned when the start portion of the input string matches the regular expression. Boolean Contains (String Input, Pattern Pattern): Use this method when the regular expression needs to match a part of the input string. The true value is returned when the regular expression is a substring of the input string. But the above three methods will only find the first object of the collected regular expression in the input string. If the string may have multiple substrings to match the given regular expression, then you can call the above three methods. Use the PatternMatCherInput object as a parameter instead of the String object so that you can continue to match the last matching position in the string, which is more convenient.
With the object as a parameter PatternMatcherInput Alternatively String, the syntax of the three methods as follows: boolean matches (PatternMatcherInput input, Pattern pattern) boolean matchesPrefix (PatternMatcherInput input, Pattern pattern) boolean contains (PatternMatcherInput input, Pattern pattern) ★ Util.substitute () Method: After looking for replacement, we must use the util.substitute () method, its syntax is as follows: Public static string substitution (pattern pattern, substitution sub, string infut, int number), two parameters respectively For patternmatcher and pattern objects. The third parameter is a subs constiution object, which is made to determine how the replacement operation is performed. The fourth parameter is the target string to be replaced, and the last parameter is used to specify all matching sub-strings (util.substitute_all) of the replacement mode, or only the number of specified times. Here I believe there is necessary to explain the third parameter substertion object because it will decide how to replace it. Substiution: Substiution is an interface class that provides you with a means of controlling the replacement method when using util.substitute () methods, which has two standard implementation classes: Stringsubstitution and Perl5Substitution. Of course, you can also generate your own implementation class to customize the special replacement you need. Stringsubstitution: Stringsubstitution is implemented for simple texture, it has two constructors: stringsubstitution () -> Default constructor, initializing an alternative object containing zero length strings. Stringsubstitution (Java.lang.String Substitution) -> Initialize an alternative object of a given string. Perl5Substitution: Perl5Substitution is Script of Stringsubstitution, which also allows for replacement of Perl5 variables for each matching group in the MATH class while achieving a pure text replacement, so his replacement means more diversified than its direct parent class Stringsubstitution. It has three constructors: Perl5Substitution () Perl5Substitution (java.lang.String substitution) Perl5Substitution (java.lang.String substitution, int numInterpolations) before the two configurations and methods as StringSubstitution, while the third will be described below to the constructor . In the replacement string of Perl5Substitution, you can include a variable that is replaced by a matching group surrounded by a small expansion in the regular expression, and the variables are identified by,, and the like.
We can use an example to explain how to use the replacement variable for replacement: Suppose we have regular expression mode to b / d : (that is, B [0-9] :), and we want to put all matching strings "B" is changed to "a", and ":" is changed to "-", and the rest is not modified, as we enter the string is "Example B123:", it should become "Example) after replacing "A123-". To do this, we must first put it up with the partial symbol parentheses, so that the regular expression becomes "B (/ d ):", and constructs the Perl5Substitude object when it replaces the string It should be "A-", that is, the structure is Perl5Substitution ("A-"), indicating that when using util.substitute () methods, as long as you find and regular expressions in the target string, "B (/ d ):" The matching substring is replaced with a replacement string, and the variable indicates that if the contents match the first group match in the regular expression, the original text is inserted into the case, as in "Example B123:" and regular The part of the expression is "B123:", and the portion therein matches the portion of the first group "(/ d )" 123 ", so the final replacement result is" example a123- ". One thing needs to be clearly, if you set the NUMINTERPOLATIONS parameter in the constructor Perl5Substitution (Java.lang.String Substitution, INT NUMINTERPOLATIONS) to Interpolate_all, then replace variables (, etc.) when you find a matching string each time you find a match The content points to update according to the current match strings, but if the NumInterPolations parameter is set to a positive integer N, then the replacement variable will follow the matching object when the previous N matches will follow the matching object to adjust the contents of the content. However, after n times, it is consistent with the contents of the nth replacement variable as a later replacement result.
For example, it will be better to understand: If the regular expression pattern in the above example is used, the target string is "TANK B123: 85 TANK B256: 32 TANK B78: 22", and set NumInterpolations parameters For interpolate_all, the NUMSUB variable in the util.substitute () method is set to Substitute_all (please refer to the contents of Util.Substitute () method), then the replacement result you get will be: Tank A123- 85 Tank A256- 32 TANK A78-22 but if you set NuminterPolations to 2, and Numsubs still set to substeute_all, then the result you get will be: Tank A123- 85 Tank A256- 32 Tank A256- 22 You have to pay attention to the last replacement The content represented by the variable is "256" as the second one, not the expected "78", because in the replacement, the replacement variable is only updated according to the matching content, and the last time makes the second match The result of the updated, then we can thus know that if NuminterPolations is set to 1, then the result will be: TANK A123- 85 TANK A123- 32 TANK A123-22 3. Application example: Just a few time ago, the company has a "Iso prophecy" English learning interactive textbook, which has the finishing work of the electronic file information. We will take this as an example to see Jakarta-ORO and JDBC2.0 API combine Data within the database make simple extraction and finishing. It is assumed that the table structure of the electronic file stored in the MS SQLServer 7 database is as follows (Note: Perhaps there is a corresponding regular expression application in different DBMs, but this is not within the scope of this article): Table name: AESOP, each record in the table contains three columns: ID (int): a word index number word (varchar): Word content (varchar): Saving words related interpretation and example sentences in which the content in the content column is as follows : [Phonetic] [Words] (Explanation) {(Explanation) The word: Words in words in the sentence "Explanation of the words in the second / example sentences: Words in words in the sentence The meaning of the corresponding words kevin, content is as follows: ['Kevin] [Name] (Kemen Name) {(Kevin Loves Comic./ Kevin Love Comics / Nourse: Kevin (Kevin Is Living In Zhuhai) Now./ Kevin is now in Zhuhai / Noun: Kevin)} Our example is primarily for string processing in Content columns.
★ Find a single match: First, let's try to list the contents of the [phonet] field in the Contnet column, because this item has this item and all in the string start, so this lookup work is relatively simple. : Determine the corresponding regular expression: / [[^]] /] This is a very simple regular expression, which means that the matching string must be included in a pair of brackets, such as [' Kevin], [noun], etc., but not include "]" symbols, that is, "[] []" will appear as a match object (for basic knowledge about regular expressions, please refer to the relevant information, This is no longer detailed here). Note that in Java, you must perform escape processing for each forward slash ("/"). So we have to add one "/" in front of the previous "/" to avoid compilation errors, which is: string rest = "/ [ [^] /] "; And in the middle of each symbol in the expression, there is a compilation error. PatternCompiler instantiated object created Pattern object PatternCompiler compiler = new Perl5Compiler (); Pattern pattern = compiler.compile (restring); PatternMatcher create objects, call PatternMatcher interface Contain () method to check the matching of: PatternMatcher matcher = new Perl5Matcher (); If (Matcher here matcher.contains (content, pattern), Content is a string variable from the database. This method only checks the first matching object string, but because the phonetic item is in contet content The starting position in the string, so use this method to ensure that the phonetic symbol item in each record is found, but the more direct and reasonable approach is to use the Boolean Matchesprefix (Pattern Pattern) method. Method Verify that the target string is starting with the string of regular expressions. The specific program code for the specific implementation is as follows: package regularexpressions; // import ... Import org.apache.oro.text.Regex. *; / / Before using the Jakarta-ORO regular expression library, you need to add it to the classpath, if you use IDE, you can also build a new library directly in JBuilder. Public class yisuo {public static void main (String [] Args) {Try {// Use JDBC Driver for DBMS connection, here I use a third-party JDBC // Driver, Microsoft itself has a free JDBC // Driver for SQLServer7 / 2000, but its performance is really odd No need.
Class.Forname ("com.jnetdirect.jsql.jsqldriver"); Connection Con = DriverManager.getConnection ("JDBC: JSQLConnect: // Kevin: 1433", "Kevin Chen", "RE"); statement stmt = con.createstatement (ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_UPDATABLE); // use Jakarta-ORO library created corresponding object String rsstring = "/ [[^]] /]"; PatternCompiler orocom = new Perl5Compiler (); pattern pattern = orocom .Compile (Rsstring); PatternMatcher Matcher = new perl5matcher (); resultset UPRS = Stmt.executeQuery ("SELECT * AESOP"); while (UPRS}} catch (exception E) {system.out.println (e);} }} The result of the output is: Kevin's phonetic is ['Kevin] I use the toString () method to achieve the result, but if the regular expression is used in group symbol (parentheses), then you can use Group (int GID) method to obtain the results of the corresponding group match, such as regular expressions are changed to "(/ [[^]] /]), then the following method can be obtained: pure = result.group (0); Using program verification, the output results are also: Kevin's phonetic is ['Kevin] and if the regular expression is (/ [[^] /]), If you find the contents of two consecutive square brackets, you will find two [phonetic icons] [Words], but the results are in both groups, respectively, from the following statement: Result.group 0) -> Return [Physical Space] [Words], which is the result string that matches the entire regular expression, here is ['Kevin] [Noun] Result.Group (1) -> Return [Phonetic] item content, the result should be ['Kevin] Result.group (2) -> Return [Words] Item, the result should be [Noun] Continue to verify the program, found that the output is not correct, mainly when the content is in Chinese, considering that the Jakarta-ORo regular expression library version does not support Chinese Question, look back, I have always used the old version of 2.0.1, immediately downloaded the latest version 2.0.4 version of the latest version 2.0.4 on Jakarta.org, and the resulting results are as correct as expected. ★ Find multiple matchs: After the first step is trying to use Jakarta-ORO, we already know how to use the API package to find a matching substring in the target string. Let's take a look at the target character. The skewers contain more than one matching substring we will take them a corresponding process.
First we first try a simple application, assuming that we want to find all the strings that all packages with square brackets in the contents of the Contnet field, clearly, only two match contents of the Contnet field: [Phonetic] And [Words], just now we have already found them separately, but the method we use is to group methods, "[phonetic] [Words]" as a whole regular expression, and then find it according to packet [Phonetics] and [Words] are picked out separately. But now we need to do [Phonetic "and [Words] as the content that matches the same regular expression, first find one, then find the next one, that is, our expression is (/ [^ ] /]) (/ [[^]] /]), And should now be "/ [[^]] /]." We already know that only the PatternMatCherInput object is used as a parameter to replace the String object as a parameter in the three methods of the matching operation, and the program fragment of the implementation can be continued from the last matching position in the string, as follows: patternmatcherinput input = new patternmatcherinput While (Matcher output is: ['Kevin] [Noun] then we have a complicated handling, that is, we must first put the following: [' Kevin] [Noun] (Kevin Loves COMIC) {(Kevin Loves COMIC. / Kevin love comics / noun: Kevin (Kevin) (文 有 现 名 名 名: 海 海 海 部分 部分 部分 部分 部分 部分 例 部分 部分Out, I will find out that the contents (English sentences, Chinese sentences, mean, mean, mean, explanation) in the sentences are also listed. The first step is of course the corresponding regular Expression, there is a need to have two regular expressions that match the entire plurality of examples (which is partly parentheses): "/{ ( /}", the other is partially matched with each example sentence (I.e. (([^)] /) and because the various parts of the example sentence are to be separated, so it is necessary to match the part of the group: "([[[ ^ (] ) / (. ) / (. ): ([^)] ) ".
For the sake of simplicity, we will no longer be read from the database, but construct a string variable containing the same content, the program fragment is as follows: try {string content = "['Kevin] [Noun] (Name Kevin) { Kevin loves comic./ Kevin love comics / noun: Kevin Is Living in Zhuhai Now./ Kevin lives in Zhuhai / Nourse: Kevin)} "; string ps1 =" /{( } " String PS2 = "/ ([^)] /)"; String PS3 = "([^ (] ) / (. ) / (. ): ([^)] )"; String Sentence; PatternCompiler orocom = new Perl5Compiler (); Pattern pattern1 = orocom.compile (ps1); Pattern pattern2 = orocom.compile (ps2); Pattern pattern3 = orocom.compile (ps3); PatternMatcher matcher = new Perl5Matcher (); // find someone Out of the whole example part of (Matcher}}} catch (exception e) {system.out.println (e);} output is: English sentence: Kevin loves comic. Sentence Chinese translation: Kevin love comic: nouns Kevin English sentence: Kevin is Living in Zhuhai Now. Sentences in Zhuhai Words: Nouns Elever: Kevin ★ Find Replacement: The above two applications are simply lookup string matching, we Let's take a look at how to replace the target string after searching. For example, I will now change the second example sentences, change to: Kevin Has Seen "Leon" Seveal Times, Because IT IS A Good Film./ Kevin has seen "This killer is not too cold" a few times, because it is a good movie ./ Noun: Kevin. That is, ['Kevin] [Noun] (Name Kevin) {(Kevin Loves comic./ Kevin Love Comics / Noun: Kevin (Kevin) (Kevin) (Kemin)} is changed to: ['Kevin] [Name] (KEVIN LOV) { Es Comic./ Kevin Love Comics / Noun: Kevin "Seveal Times, Because It Is A Good Film./ Kevin has seen" This killer is not too cold "a few times, because it It is a good movie. / Name: Kevin.
Before we understand the Util.Substitution () method and the Substiution interface, as well as the two implementations of the substruction, Stringsubstitution and Perl5Substitution, let's take a look at how to use the util.substitute () method to match Perl5Substitution to complete the replacement of our above Requirements, determine the regular expression: We must first find the string of the entire plumber, and packet two cases, the regular expression is: "/ {(/ ^ )] /)) (/ (^)] /)) /} ", If used in replacement variables, then the above expression can be seen as" / {/} ", so it can be easier to see Relationship between variables and groups. According to the above regular expression, the Perl5Substitution class can be constructed: Perl5Substitution ("{(Kevin Has Seen" Seveal Times, Because It is a good film./ Kevin has seen "this killer is not too cold" a few times, Because it is a good movie ./ Noun: Kevin.)} ") The Util5Substitute () method can be completed according to this Perl5Substitution () method, the code snippet implementation is as follows: try {string content = 'Kevin] [Name] (Kemen) {(Kevin Loves Comic./ Kevin Love Comics / Nouns: Kevin Lives in Zhuhai Now./ Kevin live in Zhuhai / Noun: Kevin)} " ; String ps1 = "/ {(/ (^)] /))) (/ ([^)] /)) /}"; string senence; string pure; patterncompiler = new perl5compiler (); pattern pattern1 = orocom.compile (ps1); PatternMatcher matcher = new Perl5Matcher (); String result = Util.substitute (matcher, pattern1, new Perl5Substitution ( "{(Kevin has seen" LEON "seveal times, because it is a good film./ Kay Wen has seen "this killer is not too cold" a few times, because it is a good movie ./ Noun: Kevin.)} ", 1), content, util.substitute_all; system.out.println (Result );} catCH (Exception E) {system.out.println (e);} The output result is correct, for: ['Kevin] [Noun] (Name Kevin) {(Kevin Loves comic./ Kevin Love Comics / Nourse: Kevin (Kevin Has Seen "Leon" Seveal Times, Because It is a good film./ Kevin has seen "this killer is not too cold" a few times, because it is a good movie ./ Noun : Kevin.) } As for the constructor usage for using the NuminterPolations parameter, the reader will clearly try it according to the above introduction, it will be clear here.