Analysis of the problem of java Chinese comparison problem
Crystal (YOUSP@yeah.net)
Java's Chinese problem has been previously, and the author needs to make Chinese comparison sort in memory. After GBK or GB2312 encoding the string, the use string.compareto method will not get the correct result. Therefore, the source code of the String class in JDK has been explored in the JDK. (The author uses JDK to version 1.3.1)
The following is the source code of Compareto in String.java, please pay attention to the comments:
Public Class String
{
...
Public int compareto (string annotherstring) {
Int len1 = count;
Int len2 = anotherstring.count;
// n is the smallest of the length of two strings
INT n = math.min (len1, len2);
// Get the number of characters
CHAR V1 [] = VALUE
CHAR V2 [] = anotherstring.value;
//
/ ** The offset is the first index of the storage what is buy. * /
// Offset is the first storage index
INT i = OFFSET;
INT j = anotherstring.offset;
// If i == j
// This may be a scenario that determines two strings in the same memory. . .
// a <- <----
// b S1 |
// c <- |
// d s2
// e |
// f |
// g <----------
// Maybe this case i = j
IF (i == j) {
INT K = I;
INT LIM = N I;
While (K { CHAR C1 = V1 [K]; CHAR C2 = V2 [K]; IF (c1! = c2) file: // until you find an unequal character, return C1 - C2 RETURN C1 - C2; K ; } } else { While (n--! = 0) file: // until the length of the two strings is 0 { CHAR C1 = V1 [i ]; file: // Take the character separately CHAR C2 = V2 [J ]; IF (C1! = C2) {// Discovery is not equal, return to C1 - C2 immediately; RETURN C1 - C2; } } } Return len1 - len2; // The case where the situation that may occur here is that the two strings have not been obtained after compared. Equal situation } ... } // end of class string Why is Java who have problems when making Chinese characters compareto? By analyzing the analysis of the COMPARETO source code, the key is that the JDK's CompareTo implementation is compared to use char to compare: CHAR C1 = V1 [K]; CHAR C2 = V2 [K]; However, when Java uses GB2312 encoding, a char value acquired by Chinese characters is irregular. That is, a Chinese word is handled in Java as a char to handle (double-byte character), make such a double-byte character When forced to convert into an int type, the resulting Chinese internal code is not included in the Chinese character encoding order. You can see a set of test data to see the mystery: characters Char value Byte [] value Press BYTE [] to synthesize I 25105 [50:46] [-5046] Love 29233 [80:82] [-8082] north 21271 [79:79] [-7979] Beijing 20140 [66:87] [-6687] day 22825 [52:20] [-5220] Safe 23433 [80:78] [-8078] Gate 38376 [61:59] [-6159] A 65 [-65] [65] B 66 [-66] [66] C 67 [-67] [67] Di 68 [-68] [68] According to the Chinese order: "I" word should be behind the "Love" word, so the CHAR value of "I" word should be larger than the "love" word. But I don't know why Java's Chinese character char (two BYTE) -> int type conversion will have a large deviation. And lost the Chinese characters originally in the GBK specification, arranged in the inner code. However, from a Chinese character to 2 bytes of Byte [], the resulting value does not disrupt the order of GBK coding, so getting the problem of problem: GB2312 encoding a String After getting a Chinese character When the char value is split into 2 bytes byte byte [] to get the correct internal code. So I wrote the following functions, basically solved the problem of Chinese characters: The function consists of three, you can place it at any class as a private helper. N Public Int Compa (String S1, String S2): The main job is to compare some pre-coding work can be said to be a housing of the system. N Public Int ChineseCompareto (String S1, String S2): This function is a Chinese string comparison body, and its internal implementation of the most basic logic of comparison, and the logic used by the JDK COMPARETO. The call interface is also called. N Public Static Int GetCharcode (String S): This function is responsible for converting a character that exists in a string into INT encoding, and child does not lose its location information. Note that the input is usually: "I" or "a", if you enter a longer string, the change function is the value of the first character. Private static string __encode__ = "GBK"; file: // must be GBK Private static string __server_encode__ = "GB2312"; File: // Default code on the server / * Compare two strings * / Public int Compa (String S1, String S2) { String m_s1 = null, m_s2 = NULL; Try { // First encode two strings into GBK m_s1 = new string (S1.GetBytes (__ server_encode__), __encode__); m_s2 = new string (S2.GetBytes (__ server_encode__), __encode__);} Catch (Exception EX) { Return S1.Compareto (S2); } Int res = ChineseCompareto (M_S1, M_S2); System.out.println ("Compare:" S1 "| S2 " ==== Result: " RES); Return res; } // Get a Chinese character / letter char value Public Static Int getcharcode (String S) { IF (s == null && s.equals (")) Return -1; file: // Protect code Byte [] b = s.getbytes (); INT value = 0; / / Ensure the first character (Chinese characters or English) For (int i = 0; i { Value = Value * 100 B [i]; } Return Value; } // Compare two strings Public int CHINESECMPARETO (String S1, String S2) { INT LEN1 = S1.LENGTH (); INT LEN2 = S2.LENGTH (); INT n = math.min (len1, len2); For (int i = 0; i { INT S1_CODE = GetCharcode (S1.Charat (i) "); INT S2_CODE = GetCharcode (S2.Charat (i) "); IF (S1_CODE! = S2_CODE) RETURN S1_CODE - S2_CODE; } Return len1 - len2; } It can be seen that the anatomy of the system source code can make us also have the opportunity to get the mystery within the system within the rest of the system. But people are very puzzling that some kind of writing style inside Java is very bad, and there are some bugs. But this may be a personal feeling of the author. I have won, I am willing to share with you, among which the omissions are expected to enlighten me.