Description: This article is original, the author contact address is: josserchai@yahoo.com. Due to the Chinese programming in Java
The problem is a problem with a Chinese often, after reading a lot about the Java Chinese problem solution, combined with the author's compilation
Cheng practice, I found that many methods that have been discussed in the past cannot clearly explain the issues and solve problems, especially in cross-platform. So I give this article, including the Class, Servelets, JSP, and EJB classes running on the console.
Chinese problem I analyzed and suggested solutions. I hope everyone will advise. Any reference this article please indicate the place! ! Abstract
CT: In-depth analysis of Java compilers in Java programming, the Java source file and JVM encoding the Class class file /
Decoding process, through the resolution of this process, the root cause of Chinese issues in Java programming, and finally
Optimized method for solving Java Chinese issues.
1, the source of Chinese issues
The code supported by the computer initial operating system is a single-byte character encoding, so that all process in the computer
The order is initially processed by the single-byte encoded English. With the development of the computer, in order to adapt to the language of the world's national nation (of course, our Chinese characters), people have proposed Unicode encoding, which uses double-byte coding, compatible with English characters and other ethnic bodies, so currently Most international software uses Unicode encoding. When the software is running, it obtains a local support system (most time is an operating system) default supported encoding format, and then converts the Unicode's unicode to the local system default support. The format is displayed. The JAVA's JDK and JVM are the case. JDK I said here refers to the International version of JDK. Most of our programmers use international JDK versions. All JDK refers to an international JDK version. Our Chinese characters are double-byte coding languages. In order to allow computer to process Chinese, we have developed GB2312, GBK, GBK2K and other standards to accommodate computer processing needs. Therefore, most of the operating systems are customized to meet the needs of our dealing with Chinese, they are customized, which use GBK, GB2312 encoding format to display our Chinese characters correctly. For example, Chinese Win2K defaults to GBK coding display. The encoding format of the save file in Chinese Win2k is saved. The encoding format of the save file is also GBK, ie, all files saved in Chinese Win2k It uses GBK coding by default. Note: GBK is expanded based on GB2312.
Since the Unicode encoding is used inside the Java language, there is a problem with the encoding format supported by the Unicode encoding and the corresponding operating system and the browser, this conversion process has a series of steps. If any of these steps, the displayed Chinese characters will be garbled, this is our common java Chinese problem.
At the same time, Java is a cross-platform programming language, that is, the program we have written not only can be transported in Chinese.
OK, you can also run on Chinese Linux and other systems, as well as you can run in English, etc. (we often see
The person puts the Java program written on Chinese Win2k to run on English Linux). This transplant operation will also bring Chinese issues.
Also, someone uses English operating systems and English IE, to run programs and browsing with Chinese characters.
Chinese web pages, they do not support Chinese, will also bring Chinese issues. Almost all browsers are passed by default in the UTF-8 encoding format, rather than pressing Chinese coding, so there will be problems when passing Chinese parameters, resulting in garbled phenomena. In short, the above aspect is the main source of Chinese issues in Java, and the procedures that we can't put the above reasons
The problem arising correctly is called: Java Chinese issue.
2, the detailed process of java encoding conversion
Our common Java programs include the following categories:
* Classs running directly on Console (including classes of visual interface)
* JSP code class (Note: JSP is a variant of the servlets class)
* Servelets class
* EJB class
* Other support classes that cannot be run directly
These files may contain Chinese strings, and we often use the first three types of Java programs and users directly
Mutual, used to output and enter characters, such as: We get the character sent by the client in JSP and Servlet, these characters are also
Includes Chinese characters. Regardless of the role of these Java classes, these Java programs life cycle is like this:
* Programmaker selects a suitable editing software on a certain operating system to implement the source code and expand it in .java
The name is saved in the operating system, for example we edit a Java source program in Chinese Win2k;
* Programmaker compiles these source code with JAVAC.exe in JDK, formation .class class (JSP file is called by the container
Compile with JDK);
* Run these classes directly or to run these types to the web container and output the results.
So, how do JDK and JVMs encode and decode these files?
Here, we use a Chinese WIN2K operating system as an example to explain how Java classes come to encode and decoded.
The first step, we use editing software in Chinese Win2k such as Notepad, write a Java source file (including the above five
Class Java program), program files defaults to the operating system by default, using the operating system default support (the File.Encoding format for the operating system default) is formed. Java file, that is, the Java program is compiled. I
Our Java source files are saved by the File.Encoding encoding format that is supported by the operating system default. Java source
The order contains Chinese information characters and English program code; to view the file.Encoding parameters of the system, you can use the next generation
code:
Public class showsystemdefaultencoding {
Public static void main (String [] args) {
String encoding = system.getProperty ("file.encoding");
System.out.println (encoding);
}
Step 2, we compile our Java source program with JDK's JavaC.exe file, because JDK is international,
When we compile, if we don't use the -Encoding parameter to specify the encoding format of our Java source program, Javac.
EXE first obtains the encoding format used by our operating system default, that is, when compiling the Java program, if we do not specify the source
The code format of the program file, JDK first get the file.Encoding parameter of the operating system (it saves the operating system
The default encoding format, such as Win2K, its value is GBK), then JDK will put our Java source from File.Encodi
The NG encoding format is transformed into Java internal default Unicode format placed in memory. Then, JAVAC compiles the files in the converted Unicode format into a .class class file, at this time, the .class file is Unicode encoded, it is suspended in memory, followed by the JDK, the compiled class of Unicode encoded Class The file is saved to our operating system to form the .class file we have seen. For us, we finally obtained .class files are class files saved in Unicode encoding format, which contains Chinese strings in our source program, but at this time it translates to Unicode format via file.encoding format . In this step, the JSP source program file is different. For JSP, this process is like this: the web container
With the JSP compiler, the JSP compiler first views if there is a file encoding format in the JSP file. If there is no JSP file encoding format in the JSP file, the JSP compiler calls JDK first to use the JSP file with JVM default character encoding format ( That is, the default file.Encoding of the operating system in which the web container is located is transformed into a temporary servlet class, and then compile it.
Classic class in Unicode format and save it in a temporary folder. For example: On Chinese Win2k, the web container translates the JSP file from the GBK encoding format to Unicode format, then compiles the temporary SERVLET class to respond to the user's request.
In the third step, run the second step-by-step, divided into three situations:
A, class running directly on the console
B, EJB classes and support classes that cannot be run directly (such as Javabean classes)
C, JSP code and servlet class
D, Java program and database
Below we are divided into these four situations.
A, class running directly on the console
This situation, running this class first requires JVM support, ie, JRE must be installed in the operating system. The running process is like this
: First, Java launch JVM, at which point JVM reads the Class file saved in the operating system and read the content into memory, this
The memory is in the Unicode format, and then the JVM runs it. If this class needs to receive user input, the class will encode the user input by default with the file.Encoding encoding format and converted to Unicode Save into the memory (
The user can set the encoding format of the input stream). After the program runs, the resulting string (Unicode encoded) is returned to the JVM, and finally the JRE converts this string to the file.Encoding format (the user can set the output stream encoding format)
Pass to the operating system display interface and output it on the interface.
For this class running directly on the Console, its conversion process can be expressed in Figure 1:
figure 1
Each of the above transformations require the correct coding format to conversion to eventually do not have garbled.
B, EJB classes and support classes that cannot be run directly (such as Javabean classes)
Because EJB classes and support classes that cannot be run directly, they generally do not interact with users directly, they
It is often interacting with other classes, so they have been compiled in the second step, and the contents of the Unicode encoded are stored in the operating system, and the interaction between them is in parameters. It is not lost during the transfer, and it will run correctly. This EJB class and the support class that cannot be run directly, its conversion process can be expressed in Figure 2:
figure 2
C, JSP code and servlet class
After the second step, the JSP file is also converted to a servlets class file, but it is unlike standard servlets.
The store is in the classs directory, which exists in the temporary directory of the web container, so we also do it as servlets in this step. For servlets, when the client requests it, the web container calls its JVM to run the servlet. First, JVM reads the Servlet's Class class from the system and loads the memory, the memory is the code of the servlet class encoded in Unicode, Then JVM runs the servlet class in memory. If the servlet is running, you need to accept characters from the client such as: the value incorporated in the value of the form and the value in the URL, if there is no setting accepted in the program The encoding format used when the parameter is used, the web container will use ISO-8859-1 encoding format by default to accept incoming values and translated into the Unicode format in the memory of the Web container in the JVM. After the servlet runs, the output is generated. The output string is the Unicode format, which follows the container to send the servlet running the unicode format string (such as HTML syntax, user output string, etc.) directly on the client browser and output it. User, if the encoding format output when sending is specified, then output to the browser in the specified encoding format, if not specified, the default is sent to the customer's browser. This JSP code and servlet class, its conversion process can be explicitly expressed in Figure 3:
image 3
D, Java program and database
For almost all databases of JDBC drivers, the default transmitted data between Java programs and databases is based on ISO-8859-1 as the default encoding format, so our program is stored in the database containing the Chinese data. JDBC first transforms the data in the Unicode encoding format within the program to the ISO-8859-1 format, and then passes to the database, when it saves the data, it defaults to save it by ISO-8859-1, so why The Chinese data we often read in the database is garbled.
For data transfer between Java programs and databases, we can clearly expressed by Figure 4:
Figure 4
3. Analyze a few of the common Java Chinese issues must be clear
First, after the above detailed analysis, we can clearly see that in the lifetime of any Java program, its encoding
The key process for conversion is to: initially compiled into a transcoding process of the transcoding of the Class file and finally output to the user.
Second, we must understand that Java is supported at compile, and the commonly used coding format has the following:
* ISO-8859-1, 8-Bit, with 8859_1, ISO-8859-1, ISO_8859_1, etc.
* CP1252, American English coding, with ANSI standard code
* UTF-8, with Unicode encoding
* GB2312, encoding with GB2312-80, GB2312-1980
* GBK, with MS936, it is the expansion of GB2312
And other codes, such as Korean, Japanese, Traditional Chinese, etc. At the same time, we should pay attention to compatibility between these codes.
The company is as follows:
Unicode and UTF-8 encoding are a one-one correspondence. GB2312 can be considered a subset of GBK, ie GBK coding is expanded on GB2312. At the same time, GBK encoding contains 20902 Chinese characters, and the coding range is: 0x8140-0xFefe, all characters can correspond to Unicode 2.0 in Unicode 2.0.
Again, for the .java source file file in the operating system, we can specify its contents
The code format, specifically, specified with -encoding. Note: If you contain Chinese characters in the source program, you are apparent to other coding characters with -encoding, obviously wrong. Use -encoding to specify the encoding mode of the source file for GBK or
GB2312, no matter what the Java source program we are compiling contains Chinese characters, there is no problem, which will correct Chinese into Unicode stored in the class file.
Then, we must clearly, almost all Web containers are in its internal default character encoding formats in ISO-8859-1.
For the default, almost all browsers are passing the parameters by default when passing the parameters.
. So, although our Java source file specifies the correct encoding method in the entrance, it is transported inside the container.
During the time of ISO-8859-1.