I. Overview
Most Java programmers have used Java.util.StringTokenizer classes. It is a very convenient string detacher, mainly used to split the string to TOKEN according to the separator, and then return to each tag according to the request. This process is called tokenization, in fact, the character sequence is converted into multiple tags that the application can understand.
Although StringTokenizer is very convenient, its function is limited. This class simply finds the separator in the input string, and the string is split once the separator is found. It does not check if the separator is in the substrings, and when two consecutive separators appear in the input string, it does not return "" The string length is 0) forms.
In order to break through these limitations, the Java 2 platform provides the BreakItemrator class, which is a string resken that is improved on StringTokenizer. Since JDK 1.1.x does not provide this class, developers often spend a lot of time to write a resolver from the header. This type of customized string detacher is sometimes visible everywhere in a large engineering that involves data formatting processing, and this is not rare.
The goal of this article is to help you use the existing StringTokenizer class to write a Advanced String Dracton.
Second, STRINGTOKENIZER limitations
You can create a StringTokenizer resolver in any of the following three constructors:
StringTokenizer (STRING SINPUT): Split the string with blank characters ("", "/ t", "/ n").
StringTokenizer (STRING SINPUT, STRING SDELIMITER): Split the string with SDELIMITER as a separator.
StringTokenizer (STRING SINPUT, STRING SDELIMITER, Boolean BreturntoKens): Split strings with SDELIMITER as a separator, but if BreturntoKens is true, the separator returns as a tag.
The first constructor does not check if the input string contains a substring. For example, if a gap character is separated as a separator "Hello. Today /" I am / "Going to My Home Town", the string decomposition result is Hello., Today, "I, AM,", Going, etc., not Hello., Today, "I am", going, etc.
The second constructor does not check the case where the two separators are continuously emerged. For example, if you split "Book, Author, Publication,, Date Published" strings, StringTokenizer returns the four tags of Book, Author, Publication and Date Published, not Book, Author, Publication , ",", "," "" "" "" "" Indicates a 0 length string). To get 6 markup answers, you must set the StringTokenizer's BreturNToKens parameter to True.
The BreturNKens parameter that allows the setting value is True is an important feature because it considers the case where the separator is continuous. For example, when using the second constructor, if the data is dynamically collected and to update the table in the database, the tag in the input string corresponds to the value of the column in the table, then we cannot determine which one should be set To "", we cannot map the tags in the input string to the database column. Suppose we want to insert records into a table with 6 columns, and input data contains two consecutive separators. At this point, the decomposition result of StringTokenizer is 5 tags (two consecutive separator "tags, which will be ignored by StringTokenizer), and we have six fields that need to be set. At the same time, we don't know where the continuous separator appears, so I don't know which one should be set to "". The third constructor is invalid when the marker itself is equivalent to the separator (whether the length is still value) and within the subtrunter. For example, if we want to solve the string "Book, Author, Publication, /", / ", /", Date Published "(this string contains a" mark, which is the same as the separator). These six tags of Book, Author, Publication, ",", Date Published, not Book, Author, Publication, (comma character), Date Published these five tags. A reminder, even if we set the StringTokenizer's BreturNToKens parameter settings to True, there is no help in this case.
Third, advanced string resolver
Before writing the code, you have to figure out which of the basic requirements for a good resolver. Because Java developers have become accustomed to using StringTokenizer classes, a good digestor should provide all practical methods available for StringTokenizer classes, such as HasmoreToKens (), NextToken () () () () () () () () () () () () () () () () () () ()
The code provided herein is simple, and most of the code is sufficient to explain themselves. Here, I mainly take advantage of the StringTokenizer class (when creating class instances, the Breturns parameter is set to true), and provides several methods mentioned above. Most of the tags are different from the separator, and sometimes the separator is output as a marker output (although very rare), if there is a request for the tag, the resolver outputs the separator as a marker. When you create a PowerFultokenizer object, you only need to provide both parameters of the input string and separator, and PowerFultokenizer will set it to TRINGTOKENIZER internally. (This is the reason for this is that if it is not to use BreturntoKens to set the StringTokenizer, it will be restricted when the previously proposed issues will be restricted). In order to properly control the resolver, the code is in several places (calculating the total number of tags, and NEXTTOKEN ()) check whether BreturntoKens is set to TRUE.
You may have discovered that PowerFultokenizer implements the Enumeration interface, which also implements two methods of HasMoreElements () and nextElement (), and these two methods have directly delegated to HasmoreToKens () and NEXTTOKEN (). (Because the enumeration interface is implemented, PowerFultokenizer implements backward compatibility with StringTokenizer.) Let's take an example, assume that the input string is "Hello, Today,, /" I, AM / ", Going to,, / "Buy, A, Book /", the separator is ",". Returning the result when using a resolver to divide this string is shown in Table 1:
Table 1: String Decomposition Results
The input string contains 11 comma (,) characters, three of which contained two consecutive commaings in the substrings, 4 consecutive appearances ("Today ,,", including two consecutive comma, the first comma is the separator of Today). Below is an algorithm of the total number of PowerFultokenizer computing tags:
If BreturntoKens = True, the number of separators in the subrout is multiplied by 2, and then subtract the number from the actual total number, it is obtained by the total number of tags. The reason is that for substrs "Buy, A, BOOK", StringTokenizer will return 5 tags (ie "Buy:,: a::::: /), and PowerFultokenizer will return a tag (ie" Buy, A, BOOK " ), The difference between the two is 4 (i.e., the number of separators in 2 in the sub-string). This formula is valid for all substrings containing the separator.
Similarly, for the case of BreturntoKens = FALSE, we subtract the expression from the actual total (19) [Separator Total (11) - The number of segments in the continuous separator (4) substrings (3)]. Since we don't return the separator, they (non-continuous appearance or in the subsidence), the above formula returns the total number of tags (9).
Remember these two formulas that are the core of PowerFultokenizer. These two formulas apply to almost all of their respective conditions. However, if you have more complex requirements, you can't use these two formulas, then you should analyze a variety of possible situations before writing code, and design your own formula.
/ / Check if the separator is within the substrings
For (int i = 1; I
{
IINDEX = SINPUT.INDEXOF (SDELIM, IINDEX 1);
IF (IIndex == -1)
Break;
// If the separator is within the substrings, then analyze until the substrings end
While (sinput.substring (iIndex-Ilen, IIndex) .equals (SDELIM))
{
INEXTINDEX = SINPUT.INDEXOF (SDELIM, IINDEX 1);
IF (INEXTINDEX == -1)
Break;
IINDEX = INEXTINDEX;
}
aiindex
= IIndex; //system.out.println ("AiIndex [" i "] =" iIndex); if (iswithinquotes (iindex)) {if (bincludedelim) ketom - = 2; else itokens - = 1;} } COUNTTOKENS () method Checks if the substrings contain double quotes. If included, then it reduces the total and modifies the index value to the position of the next double quotation in the string (as shown in the code snippet above). If BreturntoKens is false, it subtracts the number of non-continuous separators that appear in the input string from the total number. // Returns "" as a mark IF ((SPREVTOKEN.EQUALS (SDELIM))) ((SPREVTOKEN) (SDELIM))) {sprevtoken = stokeen; {sprevtoken = stokeen;} /} / / Check if the marker itself is equal to the separator IF ((stokeen.trim (). StartSwith ("/")) && (stokeen.Length () == 1)) {// Tag itself equals Snexing SnExtToken = Otokenizer.nextToken (); while (! SnextToken Stokeen = SNEXTTOKEN; SprevToken = Stokeen; ItokenNo ; returnosp.substring (1, stokeen.Length () - 1);} // Check the string ELSE IF ("(" / ")) && (((" / ")) && ((" / ")) && ((" / ")) && ((" / ")) && (!" / ")) && (!" / "). Endswith (" / "). / "" "" ")))) {If (otokenizer.hasmoretokens ()) {string snextToken = otokenizer.nextToken (); // check" / "/" "" "" " / "")) && (! SnextToken.trim (). Endswith ("/" / ")))) {stoken = snextToken; if (! Otokenizer.hasmoretoKens ()) {snextToken =" "; Break;} SnextToken = otokenizer.nextToken ();} stoken = SnextToken;}} nexttoken () method Get tags via StringTokenizer.nextToken method, and check the dual quotes in the tag. If these characters are found, it continues to get tags until no longer Find a tag with double quotation marks. In addition, it saves tags to a variable (SPREVTOKEN, see later Complete source code) to check the septum that continuously appears. If nextToken () finds a continuous plurality of tags equivalent to the separator, it returns "" "" (a string of length 0) as a tag.
According to a similar method, the HasmoreToKens () method checks if the number of markers already returned is less than the total number of tags. [Conclusion] This article introduces you how to easily write a strong string detachor. According to the principles introduced herein, you can quickly write a complex string detachor to save a lot of development time.