Application of Regular Expression in Java (1) JAKARTA-ORO

xiaoxiao2021-03-06  98

Due to the need for work, I often face a large number of text electronic information, so I have focused on the application of regular expressions in Java, and have a certain understanding, I hope that the relevant aspects of this article are Exhaustion. Regular expression: Regular expression is a powerful tool that can be used for pattern matching and replacement, and a regular expression is composed of ordinary characters (such as characters a to z) and special characters (called metammatics). Text mode describes one or more strings to be matched when looking for a text body. Regular expression As a template, a character mode matches the search string. Regular expressions play a very important role in character data processing, we can use regular expressions to complete most of data analysis processing, if you judge whether a string is a number, whether it is a valid email address, from massive text The data is extracted with valuable data. If the regular expression is not used, the implementation of the implementation may be very long and is easily error. For this, I have a deep understanding of this point, in the face of the finishing work of a large number of tool book electronic file, if you don't know how to apply the regular expression, it will be a very painful thing. Effect. Since this article is to introduce how to use regular expressions in Java, please refer to the relevant information for readers who have just contacted regular expressions, which is limited. Java supports regular expressions: In JDK1.3 or previous JDK versions do not include regular expressions available for Java programmers, we generally use the regular expression library provided by third parties, these third parties There is a source of source code in the library, and there is also a fee to purchase, and there is also a regular expression library in the JDK1.4 test version - Java.util.Regex. Therefore, we now have a lot of Java's regular expression library available. The following I will introduce two more representative Jakarta-ORO and Java.util.Regex, first of course, I have been using Jakarta-oro: Jakarta -Oro regular expression library 1. Summary: Jakarta-ORO is one of the most comprehensive and optimized regular expression APIs, Jakarta-Oro library called OROMATCHER, is written by Daniel F. Savarese, and later he gives him with Jakarta Project, readers can be in Apache. ORG website downloads the API package. Many source code open regular expression libraries are regular expressions that support Perl5 compatible expression, Jakarta-ORO regular expression library is no exception, and he is fully compatible with Perl 5 regular expressions. 2. Objects with their methods: ★ PatternCompiler object: When we use the Jakarta-Oro API package, the first thing to do is, create an instance of a Perl5Compiler class and assign it to the PatternCompiler interface object. Perl5Compiler is an implementation of the PatternCompiler interface, allowing you to compile regular expressions into the Pattern object used to match. PatternCompiler = new perl5compiler (); ★ pattern object: To compile the corresponding regular expression into a Pattern object, you need to call the Compiler object's compile () method, and specify the regular expression in the call parameter.

For example, you can compile the regular expression "S [AHKL] Y" in this way: pattern pattern = null; try {pattern = compiler.compile ("s [ahkl] y");} catch (MalformedPatterNexception E ) {E.PrintStackTrace ();} In the default, the compiler creates a case-sensitive mode (Pattern). Therefore, the above code is compiled with the pattern only matches "Say", "Shy", "SKY", and "SLY", but does not match "Say" and "Sky". To create a case where you don't sense sensitive mode, you should specify an extra parameter when calling the compiler: pattern = compiler.compile ("s [AHKL] Y", Perl5Compiler.case_insensitive_mask; after the Pattern object is created, Mode match can be made with this PatternMatcher class. ★ PatternMatcher object: PatternMatcher object is expanded in accordance with the Pattern object and string. You have to instantiate a Perl5MATCher class and assign the result to the PatternMatcher interface. The Perl5Matcher class is an implementation of the PatternMatcher interface, which matches the PERL 5 regular expression syntax: patternmatcher matcher = new perl5matcher (); PatternMatcher object provides multiple methods for matching operations, the first parameters of these methods are needed Based on the regular expression: Boolean Matches (String INPUT, PATTERN PATTERN): This method is used when the input string INPUT and regular expression pattern are precisely matched. That is to say, the true value is returned when the regular expression is integrated. Boolean Matchesprefix (String Input, Pattern Pattern): Requests the regular expression to match the input string start portion. That is, the true value is returned when the start portion of the input string matches the regular expression. Boolean Contains (String Input, Pattern Pattern): Use this method when the regular expression needs to match a part of the input string. The true value is returned when the regular expression is a substring of the input string. But the above three methods will only find the first object of the collected regular expression in the input string. If the string may have multiple substrings to match the given regular expression, then you can call the above three methods. Use the PatternMatCherInput object as a parameter instead of the String object so that you can continue to match the last matching position in the string, which is more convenient.

With the object as a parameter PatternMatcherInput Alternatively String, the syntax of the three methods as follows: boolean matches (PatternMatcherInput input, Pattern pattern) boolean matchesPrefix (PatternMatcherInput input, Pattern pattern) boolean contains (PatternMatcherInput input, Pattern pattern) ★ Util.substitute () Method: After looking for replacement, we must use the util.substitute () method, its syntax is as follows: Public static string substitution (pattern pattern, substitution sub, string infut, int number), two parameters respectively For patternmatcher and pattern objects. The third parameter is a subs constiution object, which is made to determine how the replacement operation is performed. The fourth parameter is the target string to be replaced, and the last parameter is used to specify all matching sub-strings (util.substitute_all) of the replacement mode, or only the number of specified times. Here I believe there is necessary to explain the third parameter substertion object because it will decide how to replace it. Substiution: Substiution is an interface class that provides you with a means of controlling the replacement method when using util.substitute () methods, which has two standard implementation classes: Stringsubstitution and Perl5Substitution. Of course, you can also generate your own implementation class to customize the special replacement you need. Stringsubstitution: Stringsubstitution is implemented for simple texture, it has two constructors: stringsubstitution () -> Default constructor, initializing an alternative object containing zero length strings. Stringsubstitution (Java.lang.String Substitution) -> Initialize an alternative object of a given string. Perl5Substitution: Perl5Substitution is Script of Stringsubstitution, which also allows for replacement of Perl5 variables for each matching group in the MATH class while achieving a pure text replacement, so his replacement means more diversified than its direct parent class Stringsubstitution. It has three constructors: Perl5Substitution () Perl5Substitution (java.lang.String substitution) Perl5Substitution (java.lang.String substitution, int numInterpolations) before the two configurations and methods as StringSubstitution, while the third will be described below to the constructor . In the replacement string of Perl5Substitution, you can include a variable that replaces a matching group surrounded by a small expansion in the regular expression, which is identified by $ 1, $ 2, $ 3.

We can use an example to explain how to use the replacement variable for replacement: Suppose we have regular expression mode to b / d : (that is, B [0-9] :), and we want to put all matching strings "B" is changed to "a", and ":" is changed to "-", and the rest is not modified, as we enter the string is "Example B123:", it should become "Example) after replacing "A123-". To do this, we must first put it up with the partial symbol parentheses, so that the regular expression becomes "B (/ d ):", and constructs the Perl5Substitude object when it replaces the string It should be "a $ 1-", that is, the structure is Perl5Substitution ("a $ 1-"), indicating that when using util.substitute () methods, as long as you find and regular expressions in the target string, "B (/ d ): "The matched substring is replaced with replacement strings, and the variable $ 1 indicates that if the content that matches the first group in the regular expression, the original text is inserted into $ 1, such as" Example B123: "The part of the inherent and regular expressions is" B123: ", and the portion where the first group" (/ d ) "matches" 123 ", so the last replacement result is" Example A123- ". One thing to be clear is that if you set the NuminterPolations parameter in the constructor perl5substitution (Java.lang.String Substitution, int NumInterpolations), then the variable ($ 1, $ 2, etc.) when a matching string is found each time. The content pointed to by the current match strings, but if the NumInterPolations parameter is set to a positive integer N, then the replacement variable will follow the matching object when it is replaced. The content, but later, it will be used as a result of the contents representative of the nth replacement variable as a future.

For example, it will be better to understand: If the regular expression pattern in the above example is used, the target string is "TANK B123: 85 TANK B256: 32 TANK B78: 22", and set NumInterpolations parameters For interpolate_all, the NUMSUB variable in the util.substitute () method is set to Substitute_all (please refer to the contents of Util.Substitute () method), then the replacement result you get will be: Tank A123- 85 Tank A256- 32 TANK A78-22 but if you set NuminterPolations to 2, and Numsubs still set to substeute_all, then the result you get will be: Tank A123- 85 Tank A256- 32 Tank A256- 22 You have to pay attention to the last replacement The content represented by the variable $ 1 is "256", not the expected "78", because in the replacement, the replacement variable is only updated according to the matching content, and the last time makes the second The result updated when matching, then we can thus know that if NumInterPolations is set to 1, then the result will be: TANK A123- 85 TANK A123- 32 TANK A123-22 3. Application example: Just a few time ago, the company has a "Iso prophecy" English learning interactive textbook, which has the finishing work of the electronic file information. We will take this as an example to see Jakarta-ORO and JDBC2.0 API combine Data within the database make simple extraction and finishing. It is assumed that the table structure of the electronic file stored in the MS SQLServer 7 database is as follows (Note: Perhaps there is a corresponding regular expression application in different DBMs, but this is not within the scope of this article): Table name: AESOP, each record in the table contains three columns: ID (int): a word index number word (varchar): Word content (varchar): Saving words related interpretation and example sentences in which the content in the content column is as follows : [Phonetic] [Words] (Explanation) {(Explanation) The word: Words in words in the sentence "Explanation of the words in the second / example sentences: Words in words in the sentence The meaning of the corresponding words kevin, content is as follows: ['Kevin] [Name] (Kemen Name) {(Kevin Loves Comic./ Kevin Love Comics / Nourse: Kevin (Kevin Is Living In Zhuhai) Now./ Kevin is now in Zhuhai / Noun: Kevin)} Our example is primarily for string processing in Content columns.

★ Find a single match: First, let's try to list the contents of the [phonet] field in the Contnet column, because this item has this item and all in the string start, so this lookup work is relatively simple. : Determine the corresponding regular expression: / [[^]] /] This is a very simple regular expression, which means that the matching string must be included in a pair of brackets, such as [' Kevin], [noun], etc., but not include "]" symbols, that is, "[] []" will appear as a match object (for basic knowledge about regular expressions, please refer to the relevant information, This is no longer detailed here). Note that in Java, you must perform escape processing for each forward slash ("/"). So we have to add one "/" in front of each "/" to avoid compilation errors in the previous regular expression, that is, the statement of the string of the regular expression in Java should be: string restrings = "// [[^]] //] "; and there is no space in each symbol in the expression, otherwise compiling errors will also occur. PatternCompiler instantiated object created Pattern object PatternCompiler compiler = new Perl5Compiler (); Pattern pattern = compiler.compile (restring); PatternMatcher create objects, call PatternMatcher interface Contain () method to check the matching of: PatternMatcher matcher = new Perl5Matcher (); IF (Matcher.Contains (Content, Pattern)) {// Processing Code Fragment} The parameters in matcher.contains (content, pattern) Content are string variables from the database. This method only looks at the first matching object string, but since the phonetic item is in the starting position in the ContNet content string, the use of this method can be guaranteed to find the phonetic symbol item in each record. However, more direct and reasonable methods are to use the Boolean Matchesprefix (Pattern Pattern) method, which verifies whether the target string is starting with the string of the regular expression. The specific implementation of the specific program code is as follows: package regularexpressions; // import ... import org.apache.oro.text.regex. *; // Need to add it to classpath before using Jakarta-ORo regular expression library, if With IDE is // jbuilder, you can also build a new library directly in JBuilder. Public class yisuo {public static void main (String [] args) {try {// uses JDBC Driver for DBMS connection, here I use a third-party JDBC // Driver, Microsoft itself has a free JDBC for SQLServer7 / 2000 // Driver, but its performance is really awesome, no need.

Class.Forname ("com.jnetdirect.jsql.jsqldriver"); Connection Con = DriverManager.getConnection ("JDBC: JSQLConnect: // Kevin: 1433", "Kevin Chen", "RE"); statement stmt = con.createstatement (ResultSet.Type_Scroll_Sensitive, ResultSet.concur_updatable); // To create the corresponding object String Rsstring = "// [[^]] //]"; PatternCompiler (); Pattern Pattern Pattern (); Pattern Pattern = orocom.compile (rsstring); PatternMatcher matcher = new Perl5Matcher (); ResultSet uprs = stmt.executeQuery ( "SELECT * FROM aesop"); while (uprs.next ()) {Stirng word = uprs.getString ( "word" ); Stirng content = uprs.getstring ("content"); if (Matcher.Contains (Content, Pattern)) {// or IF (Matcher.matchesprefix (Content, Pattern) {matchResult result = matcher.getMatch (); Stirng Pure = Result.toString (); System.out.Println (Word "The phonetic is:" pure);}}} catch (Exception E) {system.out.println (e);}}} The output result is: Kevin's phonetic is ['Kevin] I am using the toString () method in this process, but if the regular expression It is a group symbol (parentheses), then the result of the corresponding group match can be obtained with Group (int GID) method, such as the regular expression change to "(/ [[^]] /]), Then you can use the following method to obtain the results: pure = result.group (0); verification with program, the output results are also: Kevin's phonetic is ['Kevin] and if the regular expression is (/ [[^]] /]) (/ [[^]] /]), Then the contents included in two consecutive square brackets are found, and two [phonetic icons] [Words], but the results of the two are in two In the group, the results are obtained by the following statement: Result.Group (0) -> Return [Phonetic] [Words], which is the result string that matches the entire regular expression,

Here is ['Kevin] [Noun] Result.group (1) -> Return [Physical Space], the result should be [' Kevin] Result.Group (2) -> Return [Words] item content, It should be [Noun] Continue to verify the program, the output is not correct, mainly when the content is in Chinese, considering that the Jakarta-ORo regular expression library version does not support Chinese problem, look at the original I have been using the old version of 2.0.1, immediately downloaded the latest version 2.0.4 version of the latest version 2.0.4, and the result is as correct as expected. ★ Find multiple matchs: After the first step is trying to use Jakarta-ORO, we already know how to use the API package to find a matching substring in the target string. Let's take a look at the target character. The skewers contain more than one matching substring we will take them a corresponding process. First we first try a simple application, assuming that we want to find all the strings that all packages with square brackets in the contents of the Contnet field, clearly, only two match contents of the Contnet field: [Phonetic] And [Words], just now we have already found them separately, but the method we use is to group methods, "[phonetic] [Words]" as a whole regular expression, and then find it according to packet [Phonetics] and [Words] are picked out separately. But now we need to do [Phonetic "and [Words] as the content that matches the same regular expression, first find one, then find the next one, that is, our expression is (/ [^ ] /]) (/ [[^]] /]), And should now be "/ [[^]] /]." We already know that only the PatternMatCherInput object is used as a parameter to replace the String object as a parameter in the three methods of the matching operation, and the program fragment of the implementation can be continued from the last matching position in the string, as follows: patternmatcherinput input = new patternmatcherinput WHILE (Matcher.Contains (Input, Pattern) {result = matcher.getmatch (); system.out.println (result.group (0))} The output is: ['Kevin] [Noun] then let us do The complex handling is that we must first put the following: ['Kevin] [Name] (Kevin Loves comic./ Kevin Love Comics / Noun: Kevin) (Kevin is Living In Zhuhai Now. / Kevin lives in a part of the Zhuhai / Noun: Kevin)}, which is found in the large brackets, and then identifies the examples, one and the second, respectively, and each Item content (English sentence, Chinese sentence, mean, explanation) also lists.

The first step is of course to set the corresponding regular expression, and there are two, one is the regular expression that matches the entire plurality of examples (which is part of the braces): "/{ ( /}" The other is partially matched with each example sentence (that is, the contents of the parentheses),: / (([^)] /) and because the various items of the example sentence are to be separated, so it is necessary to put the inside Partially matched by grouping: "([^ (] ) / (. ) / (. ): ([^)] )".

For the sake of simplicity, we will no longer be read from the database, but construct a string variable containing the same content, the program fragment is as follows: try {string content = "['Kevin] [Noun] (Name Kevin) { Kevin loves comic./ Kevin love comics / noun: Kevin Is Living in zhuhai now./ Kevin lives in Zhuhai / Nourse: Kevin)} "; string ps1 =" //(/// } "; String PS2 =" // ([^)] //) "; String PS3 =" ([^ (] ) / (. ) / (. ): ([^)] ) " ; String sentence; PatternCompiler orocom = new Perl5Compiler (); Pattern pattern1 = orocom.compile (ps1); Pattern pattern2 = orocom.compile (ps2); Pattern pattern3 = orocom.compile (ps3); PatternMatcher matcher = new Perl5Matcher (); // find the first part of the entire sentence if (matcher.contains (content, pattern1)) {MatchResult result = matcher.getMatch (); String example = result.toString (); PatternMatcherInput input = new PatternMatcherInput (example); // respectively To find out how to find out (Matcher.Contains (INPUT, PATTERN2) {result = matcher.getmatch (); senence = result.tostring (); // Separate the various groups in each example come out IF (Matcher.Contains) {result = matcher.getMatch (); system.out.println ("English sentence:" result.group (1)); system.out.println ("sentence Chinese translation : " Result.group (2)); System.out.Println (" Words: " Result.group (3)); System.out.Println (" Meaning: " Result.Group (4));} }}} Catch (exception e) {system.out.println (e);} The output result is:

English sentences: Kevin love comics: noun thinking: Kevin English sentence: Kevin is living in zhuhai now. Sentence meaning: Kevin ★ Find a replacement: The above two applications are simply looking for string matching, let's take a look at how to replace the target string after searching. For example, I now want to change the second example sentences, change to: Kevin Has Seen "Leon" Seveal Times, Because IT IS A Good Film./ Kevin has seen "This killer is not too cold" a few times, because it It is a good movie. / Name: Kevin. That is, ['Kevin] [Noun] (Kemen) {(Kevin Loves Comic./ Kevin Love Comics / Nourse: Kevin (Kevin) (Kevin) (Kevin is now living in Zhuhai / noun: Kevin)} Change to: ['Kevin] [Name] (Kevin Loves COMIC./ Kevin Love Comics / Noun: Kevin (Kevin Has Seen "Seveal Times, Because IT IS A Good film./ Kevin has seen "this killer is not too cold" a few times, because it is a good movie ./ Noun: Kevin.)} before, we have learned the util.substitute () method and substiution interface And the two Stringsubstitution and Perl5Substitution, we will use the util.substitute () method to match Perl5Substitution to complete the replacement requirements mentioned above, determine regular expressions: We must first find the entire example sentence It is also a string that is wrapped by braces and packets two examples, so regular expressions are: "/ {(/ ([^)] /)) (/ ([^)] /) ) /} "If the replacement variable is used instead of the packet, then the above expression can be seen as" / {$ 1 $ 2 /} ", so it can be easier to see the relationship between replacement variables and grouping. According to the above regular expression, the Perl5Substitution class can be constructed: Perl5Substitution ("{$ 1 (Kevin Has Seen" Leon "Seveal Times, Because IT Is a good film./ Kevin has seen" This killer is not too cold "a few times Because it is a good movie ./ Noun: Kevin.

转载请注明原文地址:https://www.9cbs.com/read-96485.html

New Post(0)