In the Java language-based programming, we often encounter the process of Chinese characters and the problem of display. A lot of garbled is not what we are willing to see, how can they make those Chinese characters display correctly? Java language default encoding method is Unicode, and our Chinese usually used files and databases are encoded based on GB2312 or BIG5, how can they properly select the Chinese character encoding mode and correctly handle the code of Chinese characters? This article will start from the common sense of Chinese character encoding, combined with Java programming instance, analyze the above two problems and solve their solutions.
The Java programming language is now widely used in the Internet World. When Sun is developing Java language, it has considered support for non-English characters. Sun's Java Operation Environment (JRE) published by Sun is divided into English and international editions, but only international version supports non-English characters. However, in the application of the Java programming language, the support of Chinese characters is not as perfect as those in the standard specification of Java Soft, because the Chinese character set is not only one, and the different operating systems have different support for Chinese characters. Therefore, there will be many problems related to Chinese character encoding processing plague us in our application development. There are a lot of answers about these issues, but they are trivial, and they are not able to meet the desire to solve the problem. There are not many systems in the Java Chinese issue. This article starts from Chinese character coding common sense, analyzes java Chinese issues, I hope to everyone Solve this problem.
Common sense of Chinese character encoding
We know that English characters are typically represented by one byte, the most commonly used encoding method is ASCII. But one byte can only distinguish 256 characters, and Chinese characters are thousands of characters, so now in double bytes, in order to be separated from the English characters, the highest bit of each byte must be 1, such a pair Bytes can represent up to 64K characters. The encoding method we often encounter has GB2312, BIG5, Unicode, etc. For details on specific encoding methods, interested readers can consult relevant information. My skin talks about the close GB2312 and Unicode with us. GB2312 code, the national standard Chinese character information exchange for the People's Republic of China is a code issued by the General Administration of China on the General Administration of China on simplifying Chinese characters, which is connected in mainland China and Singapore, referred to as national code. In the two bytes, the value of the first byte (high byte) is a zone value plus 32 (20h), the value of the second byte (low byte) is 32 (20h), with These two values represent a Chinese character encoding. The Unicode code is Microsoft's multi-byte or other long code proposed by Microsoft, which is compatible with the English characters to add "0" bytes in front. If the ASCII code of "A" is 0x41, Unicode is 0x00, 0x41. Use special tools to be converted between various codes.
Preliminary understanding of java Chinese
When we developed development based on Java programming languages, we must inevitably handle Chinese. Java programming language default encoding method is Unicode, and the database and files we usually use are based on GB2312 encoding. We often encounter such situations: Browse JSP technology-based websites are garbled, after the file is opened It is also garbled, and the content of the database modified by Java cannot continue to properly provide information when applied in other occasions.
String sendlish = "apple"; string schinese = "Apple"; string s = "Apple Apple";
The length of SengLish is 5, and the length of SCHINESE is 4, while the S default length is 14. For SengLish, all classes in Java have supported very good and will definitely display correctly. However, for Schinese and S, although Java Soft declares Java's basic class has taken into account support for multi-Japanese characters (default Unicode encoding), if the default code of the operating system is not Unicode, but national codes, etc. From the Java source code to get the correct result, the process of "Java Source -> Java Bytecode ->; Virtual Machine -> Operating System -> Display Device". In each of the above processes, we must have correctly handled the code of the Chinese characters to make the final display result correctly. "Java Source -> Java Bytecode", the standard Java compiler Javac uses the character set is the system default character set, such as GBK on the Chinese Windows operating system, and is ISO-8859 on the Linux operating system- 1. So everyone will find that Chinese characters in the source files compiled on the Linux operating system have problems, the solution is to add an Encoding parameter when compiling, so that it can be independent of the platform. Use
Javac -Encoding GBK.
"Java bytecode -> virtual machine -> operating system", Java Run Environment (JRE) is divided into English and international, but only international version supports non-English characters. Java Development Kit (JDK) is definitely supporting multi-state characters, but not all computer users have JDK. Many operating systems and applications are better supporting Java, which is embedded in JRE international versions, which provides convenience for supporting multi-country characters.
"Operating System -> Display Devices", for Chinese characters, the operating system must support and display it. If you don't match a special application software, it is definitely unable to display Chinese.
There is also a problem, that is, in the Java programming process, the Chinese characters are correctly encoded. For example, when you output a Chinese string to the web page, whether you use it.
Out.println (String); // String is a string containing Chinese
Still use
<% = String%> must be converted to GBK, or manual, or automatic. In JSP 1.0, the output character set can be defined to implement an internal conversion of the internal code. Usage is that there is no support for the output character set in some JSP versions, such as JSP 0.92), which requires a lot of manual encoding output. The most common method is
String S1 = Request.getParameter ("Keyword"); STRING S2 = New String (S1.GetBytes ("ISO-8859-1"), "GBK");
The Gettes method is used to transform the Chinese characters into byte arrays in the "ISO-8859-1" encoding, and "GBK" is a target encoding method. From the database encoded in the ISO-8859-1, we read the Chinese string S2 through the above conversion process, and the Chinese string S2 can be correctly displayed in the operating system and application software supporting the GBK character set.
Surface Analysis and Treatment of Java Chinese Problem
background
Development environment
JDK1.15
Vcafe2.0
JPadPro
Service-Terminal
NT IIS
Sybase system
JConnect (JDBC)
Client IE5.0
PWIN98
The .CLASS file is stored on the server side, runs the applet by the client's browser, and the applet will only transfer the role of main programs such as the Frame class. The interface includes TextField, TextArea, List, Choice, etc.
I. Take Chinese
After performing the SELECT statement with the JDBC, after reading data (Chinese) from the server side, add the data with the APPEND method to TextArea (TA), and cannot be displayed correctly. But when adding to the List, most Chinese characters can be displayed correctly.
Transforming the data into a byte array according to the "ISO-8859-1" encoding mode, and then converted to String according to the system default encoding method, you can display correctly in TA and LIST.
The segment is as follows:
dbstr2 = results.getstring (1);
// after the Result from DB Server, Converting it to string.
DBBYTE1 = dbstr2.getbytes ("ISO-8859-1");
DBSTR1 = New String (dbbyte1);
Do not use the system default encoding mode when the string is converted, and "GBK" or "GB2312" is used directly, and there is no problem from the database data in both A and B.
II. Write Chinese to the database
The processing method is reversed in "Take Chinese", first transforms the SQL statement according to the system default encoding method to byte arrays, then press "ISO-8859-1" encoding to string to string, and finally send execution, the Chinese information can be Write correctly to the database.
The segment is as follows:
SQLSTMT = TF_INPUT.GETTEXT ();
// Before sending statement to db server, converting it to sql stat.
Dbbyte1 = SQLSTMT.GETBYTES ();
SQLSTMT = Newstring (dbbyte1, "ISO-8859-1);
_stmt = _con.createstatement ();
_stmt.executeUpdate (SQLSTMT);
......
Problem: If there is a ClassPath to point to ClassPath to the JDK (called A), the above program code can be executed correctly. But if the client only has a browser, there is no JDK and ClassPath (called B), the Chinese characters cannot be converted correctly.
Our analysis:
1. After testing, in A case, the default encoding mode of the system is GBK or GB2312. In B, the following error message appears in the browser's Java console when the program starts:
Can't find resource for sun.awt.windows.awtlocalization_zh_cn
The default encoding mode of the system is then "8859-1".
2. If the system default encoding mode is not used when the string is converted, "GBK" or "GB2312" is used directly, and the program can still run normally in A, in B, the system has an error:
UnsupportedEncodingexception.
3. On the client, after extracting the JDK's classes.zip, place in another directory, and the classpath only contains the directory. Theclass files in this directory are then deleted, and the other is running the test program. Finally, it is found in a thousand Class files, which is essential, the file is: sun.io.chartobytedoublebyte.class.
Put this file to the server side and other classes together, and in the beginning of the program, the program still does not function properly in B.
4. In A case, if Sun.IO.CHARTOBYTEDOUBLETE.CLASS is removed in ClassPth, the program is running to measure the default encoding mode as "8859-1", otherwise "GBK" or "GB2312".
If the version of JDK is 1.2 or more, the problem encountered in B is a good solution. The steps of the test are the same, and interested readers can try it.
Root Analysis and Solution of Java Chinese Issues
Under Simplified Chinese MS Windows 98 JDK 1.3, you can use system.getproperties () to get some basic properties of the Java runtime environment, class PoorCHINESE can help us get these properties.
Source code of class poorchinese:
Public class poorchinese {
Public static void main (String [] args) {
System.getProperties (). List (system.out);
}
}
After performing Java poorchinese, we will get:
The value of the system variable file.Encoding is GBK. The value of the user.Language is en, the value of User.Region is CN, and the value of these system variables determines how the system default encoding is GBK.
In the above system, the following code converts the GB2312 file into a BIG5 file, which can help us understand the transformation of Chinese character encoding in Java:
Import java.io. *;
Import java.util. *;
PUBLIC CLASS GB2BIG5 {
Static int icharnum = 0;
Public static void main (String [] args) {
System.out.println ("INPUT GB2312 File, Output Big5 File.");
IF (args.length! = 2) {
System.err.Println ("USAGE: JVIEW GB2BIG5 GBFILE BIG5FILE");
System.exit (1);
}
String InputString = ReadInput (Args [0]);
WriteOutput (InputString, Args [1]);
System.out.println ("Number of Characters in File:" iCharnum ".");
}
Static void writeoutput (String Str, String stroutfile) {
Try {
FileOutputStream Fos = New FileoutputStream (stroutfile);
Writer out = new OutputStreamWriter (FOS, "BIG5");
Out.write (STR);
Out.close ();
}
Catch (IOException E) {
E.PrintStackTrace ();
E.PrintStackTrace ();
}
}
Static string readinput (string strinfile) {
StringBuffer buffer = new stringbuffer ();
Try {
FileInputStream FII = New FileInputStream (Strinfile);
InputStreamReader ISR = New InputStreamReader (FIS, "GB2312");
Reader in = New BufferedReader (ISR);
int CH;
While (ch = in.read ())> -1) {
iCharnum = 1;
Buffer.Append ((char) CH);
}
In.Close ();
Return buffer.toString ();
}
Catch (IOException E) {
E.PrintStackTrace ();
Return NULL;
}
}
}
The process of encoding transformation is as follows:
ByTToChargb2312 Chartobytebig5
GB2312 ------------------> Unicode -------------> BIG5
Perform Java GB2BIG5 GB.TXT BIG5.TXT, if gb.txt is "Today,", the characters in the file BIG5.TXT can be displayed correctly; if GB.TXT is "Happy Valentine", The resulting file is a symbol "?" (0x3f), which is a symbol "?" (0x3f), can be seen in sun.io.bytetochargb2312 and sun.io.chartobytebig5, there is no okay .
As in the above example, the basic class of Java may also have problems. Since the work of internationalization is not completed in China, there is no strict test before these basic classes, so the support of Chinese characters is not as perfect as Java Soft claims. Not long ago, one of my technical friends sent me a letter that he finally found the root of Java servlet Chinese issues. Two weeks, he has been troubled for the Chinese issue of Java servlet, because each string containing a Chinese characters must be enforced to get the correct result (this is the only unique solution) . Later, he didn't want to continue to rest, because such things should not be the work to be done by senior programmers, he finds the source code for servlet decoding, because he doubizes the problem. . After four hours of struggle, he finally found the root of the problem. It turns out that his suspicion is correct, the decoding portion of the servlet does not consider the double-byte and directly regards% XX as a character. (The original Java Soft will also make this low-level mistake!)
If you are interested in this question or if you have the same troubles, you can modify servlet.jar according to his step:
Find the Static Private String Parsename in the source code httputils, copy the SB (StringBuffer) into BYTE BS [] before returning to Return New String (BS, "GB2312"). After making the above modification, you need to decode yourself: havehtable form = httputils .parsequeryString (Request.GetQueryString ()) or
Form = httputils.parsePostData (...)
Don't forget to build it in servlet.jar.
V. Summary on Java Chinese
Java programming languages grow in the network world, which requires Java to have good support to multi-country characters. The Java programming language adapted to calculate the needs of networked, and laid a solid foundation for it to grow rapidly in the network world. Java Soft has taken into account the support of the Java programming language to multi-country characters, just now there are many defects in the current solution, and we need some compensatory measures. The World Standardization Organization is also trying to unify all the words of human beings in a code, one of which is ISO10646, which uses four bytes to represent a character. Of course, before this solution is not adopted, it is desirable that Java Soft can strictly test its product to bring more convenience to users.
Attachment of a process function for removing Chinese garbled from a database and a network, a string that is a problem with a problem, and the outbound is a string that has been resolved.
String ParseChinese (String in)
{
String s = null;
Byte temp [];
IF (in == NULL)
{
System.out.println ("Warn: Chinese Null Founded!");
Return New String ("");
}
Try
{
Temp = in.GetBytes ("ISO-8859-1");
S = new string (TEMP);
}
Catch (unsupportedEncodingexception E)
{
System.out.println (E.TOString ());
}
Return S;
}
Reference
BBS Shuimu Tsinghua Station Java Discussion Forum
China's largest electronic bulletin board Java discussion area, many universities of Java enthusiasts conduct discussions about Java technology here
About the Author
Duan Minghui, Tsinghua University Electronic Engineering Department
Now I am currently engaged in the research and development of Java smart card microprocessors in Tsinghua University, the Java discussion group of BBS Shuimu Tsinghua Station, providing solutions for many Java technology applicants.