Description: This article is the author original, the author contact address is:
JOSSERCHAI@yahoo.com. Since the Chinese issue in Java Programming is an older's all-talk, after reading many of the Java Chinese issues, combined with the author's programming practice, I found that many methods that have been discussed in the past cannot clearly explain the problem and solve problems, especially Chinese issues in cross-platform. So I gave this article, including the Chinese problem in the Class, Servelets, JSP, and EJB classes running in the console. I analyze and recommend a solution. I hope everyone will advise.
Abstract: In-depth analysis of Java programming Java compilers on the Java source file and JVM on the encoding / decoding process of the Class class file, through the resolution of this process, the root cause of the Chinese problem in Java programming, finally given The proposed optimized method for solving the Java Chinese issue.
1, the source of Chinese issues
The code supported by the computer initial operating system is a single-byte character encoding, so that all handlers in the computer are initially processed by single-byte encoded English. With the development of the computer, in order to adapt to the language of the world's national nation (of course, our Chinese characters), people have proposed Unicode encoding, which uses double-byte coding, compatible with English characters and other ethnic bodies, so currently Most international software uses Unicode encoding. When the software is running, it obtains a local support system (most time is an operating system) default supported encoding format, and then converts the Unicode's unicode to the local system default support. The format is displayed. The JAVA's JDK and JVM are the case. JDK I said here refers to the International version of JDK. Most of our programmers use international JDK versions. All JDK refers to an international JDK version. Our Chinese characters are double-byte coding languages. In order to allow computer to process Chinese, we have developed GB2312, GBK, GBK2K and other standards to accommodate computer processing needs. Therefore, most of the operating systems are customized to meet the needs of our dealing with Chinese, they are customized, which use GBK, GB2312 encoding format to display our Chinese characters correctly. For example, Chinese Win2K defaults to GBK coding display. The encoding format of the save file in Chinese Win2k is saved. The encoding format of the save file is also GBK, ie, all files saved in Chinese Win2k It uses GBK coding by default. Note: GBK is expanded based on GB2312.
Since the Unicode encoding is used inside the Java language, there is a problem with the encoding format supported by the Unicode encoding and the corresponding operating system and the browser, this conversion process has a series of steps. If any of these steps, the displayed Chinese characters will be garbled, this is our common java Chinese problem.
At the same time, Java is a cross-platform programming language, that is, the program we have written can not only run on Chinese Windows, but also on Chinese Linux and other systems, but also request to run on English, etc. (we often see Some people put the Java program written on Chinese Win2k to run on English Linux). This transplant operation will also bring Chinese issues.
Also, some people use English operating systems and English IE and other browsers to run programs with Chinese characters and browse Chinese web pages. They do not support Chinese, and they will bring Chinese issues.
Almost all browsers are passed by default in the UTF-8 encoding format, rather than pressing Chinese coding, so there will be problems when passing Chinese parameters, resulting in garbled phenomena. In summary, the above aspect is the main source of Chinese issues in Java, and the procedures that we can't operate correctly, the procedures caused by the above reasons are called: Java Chinese issues.
2, the detailed process of java encoding conversion
Our common Java programs include the following categories: * Class directly running on the Console (including the visual interface class) * JSP code class (Note: JSP is a variety of servlets class) * Servelets * EJB class * Others can not run directly Support class
These types of files may contain Chinese strings, and we often use the first three types of Java programs and users directly to output and enter characters, such as: We get the character sent by the client in JSP and Servlet, these Characters also include Chinese characters. Regardless of the role of these Java classes, these Java programs life cycle is like this:
* Programmaker selects a suitable editing software on a certain operating system to implement source code and saved in the operating system with .java extensions, for example, we use Notepad to edit a Java source program in Chinese Win2k; * Programmer Use JavaC.exe in JDK to compile these source code, formation .class classes (JSP files are compiled by the container); * Run these classes directly or to run these types to the web container, and output result. So, how do JDK and JVMs encode and decode these files?
In the Chinese WIN2K operating system, we explain how Java classes come to encode and decoded.
In the first step, we use the editing software in Chinese Win2k, such as Notepad, such as Notepad, including the above five types of Java programs), the program file is saved by default, the operating system is used by default support GBK coding format (operating system default support) The format forms a .java file, that is, the Java program is saved by the File.Encoding encoding format that uses the operating system default support before compiling, Java source programs Contains Chinese information characters and English program code; to view the File.Encoding parameters of the system, you can use the following code: public class showstemdefaultencoding {public static void main (string "args) {string encoding = system.getProperty (" file.encoding " "System.out.println (encoding);}}
In the second step, we compile our Java source program with JDK's JavaC.exe file. Because JDK is an international version, when compiled, if we don't use the -Encoding parameter to specify our Java source programming format, Javac First, get the encoding format used by our operating system default, that is, when compiling the Java program, if we do not specify the encoding format of the source file file, JDK first get the File.Encoding parameter of the operating system (it saves the operating system default The encoding format, such as Win2K, its value is GBK), then JDK turns our Java source from File.Encoding encoding format into Java internal default Unicode format placed in memory. Then, JAVAC compiles the files in the converted Unicode format into a .class class file, at this time, the .class file is Unicode encoded, it is suspended in memory, followed by the JDK, the compiled class of Unicode encoded Class The file is saved to our operating system to form the .class file we have seen. For us, we finally obtained .class files are class files saved in Unicode encoding format, which contains Chinese strings in our source program, but at this time it translates to Unicode format via file.encoding format . In this step, the JSP source file is different. For JSP, this process is like this: that is, the web container calls the JSP compiler. The JSP compiler first checks if there is a file encoding format in the JSP file, if there is no JSP file Set the encoding format of the JSP file, then the JSP compiler calls JDK first to convert JSP files with JVM default character encoding format (that is, the default file.Encoding of the operating system of the web container) into a temporary servlet class, then It compiles Class classes in Unicode format and saves in a temporary folder. For example: On Chinese Win2k, the web container translates the JSP file from the GBK encoding format to Unicode format, then compiles the temporary SERVLET class to respond to the user's request.
In the third step, run the second step-by-step, divided into three situations:
A. Class B, EJB classes, and non-direct support classes (such as JavaBean classes) C, JSP code, and servlet class D, Java program, and database. . A, class running directly on the console
This situation, running this class first requires JVM support, ie, JRE must be installed in the operating system. The running process is this: First, Java starts JVM, at which point the JVM reads the Class file saved in the operating system and read the content into the memory, at which time the memory is a Class class in the Unicode format, then the JVM runs it, if this time Such a user input needs to receive user input, the class will encode the user input by default and converted to Unicode to save into memory (the user can set the encoding format of the input stream). After the program runs, the resulting string (Unicode encoded) is handed over again to the JVM, and finally the JRE converts this string to the file.Encoding format (the user can set the encoded format of the output stream) to the operating system display interface and output On the interface.
For this class running directly on the Console, its conversion process can be expressed in Figure 1:
figure 1
Each of the above transformations require the correct coding format to conversion to eventually do not have garbled. B, EJB classes and support classes that cannot be run (such as JavaBean classes) Because EJB classes and support classes that cannot be run directly, they generally do not interact with users directly, they often interact with other classes and Output, so they have been compiled in the second step, forming the content of the Unicode encoded is saved in the operating system, and later, as long as it is not lost during the parameter transfer, it will Will run correctly. This EJB class and support class that cannot be run directly, its conversion process can be explicitly expressed in Figure 2: Figure 2
C, JSP code and servlet class
After the second step, the JSP file has also been converted to a servlets class file, but it is not like a standard servlets one existed in the classs directory, which exists in the temporary directory of the web container, so we also put it in this step. Be a servlets.
For servlets, when the client requests it, the web container calls its JVM to run the servlet. First, JVM reads the Servlet's Class class from the system and loads the memory, the memory is the code of the servlet class encoded in Unicode, Then JVM runs the servlet class in memory. If the servlet is running, you need to accept characters from the client such as: the value incorporated in the value of the form and the value in the URL, if there is no setting accepted in the program The encoding format used when the parameter is used, the web container will use ISO-8859-1 encoding format by default to accept incoming values and translated into the Unicode format in the memory of the Web container in the JVM. After the servlet runs, the output is generated. The output string is the Unicode format, which follows the container to send the servlet running the unicode format string (such as HTML syntax, user output string, etc.) directly on the client browser and output it. User, if the encoding format output when sending is specified, then output to the browser in the specified encoding format, if not specified, the default is sent to the customer's browser. This JSP code and servlet class, its conversion process can be explicitly expressed in Figure 3:
image 3
D, Java program and database
For almost all databases of JDBC drivers, the default transmitted data between Java programs and databases is based on ISO-8859-1 as the default encoding format, so our program is stored in the database containing the Chinese data. JDBC first transforms the data in the Unicode encoding format within the program to the ISO-8859-1 format, and then passes to the database, when it saves the data, it defaults to save it by ISO-8859-1, so why The Chinese data we often read in the database is garbled. For data transfer between Java programs and databases, we can clearly expressed in Figure 4.
Figure 4
3. Analysis of common Java Chinese issues must be clear, first, after the above detailed analysis, we can clearly see that in any Java program life, its coding transformation is: initially compiled into a class file. The transcoding process is finally output to the user. Second, we must understand that Java is supported by Java, and the commonly used coding format has the following: * ISO-8859-1, 8-Bit, with 8859_1, ISO-8859-1, ISO_8859_1, etc. Coding * CP1252, US English Code , With the ANSI standard code * UTF-8, with Unicode encoded * GB2312, GB2312-80, GB2312-1980, etc., GB2312, with MS936, is the expansion of GB2312 and other codes, such as Korean, Japanese, Traditional Chinese, etc. . At the same time, we have to pay attention to the compatibility system between these codes as follows: Unicode and UTF-8 encoding are one-to-one relationships. GB2312 can be considered a subset of GBK, ie GBK coding is expanded on GB2312. At the same time, GBK encoding contains 20902 Chinese characters, and the coding range is: 0x8140-0xFefe, all characters can correspond to Unicode 2.0 in Unicode 2.0. Again, for the .java source file file in the operating system, when compiling, we can specify the encoding format of its content, specifically use -encoding to specify. Note: If you contain Chinese characters in the source program, you are apparent to other coding characters with -encoding, obviously wrong. The encoding mode of the specified source file is GBK or GB2312, whether we compile the Java source program containing Chinese characters on what system, which will translate Chinese into Unicode stored in the class file. Then, we must clearly, almost all Web containers are default in their internal default character encoding format, while almost all browsers are default UTF-8 by default when passing parameters. The way of transmitting parameters. So, although our Java source file specifies the correct encoding mode in the entrance and exit, it is handled by ISO-8859-1 during the internal operation of the container.
4. Classification of Chinese issues and its suggestion optimal solution
After understanding the principles of Java processing files, we can put forward a solution to the best solution to solve Chinese characters. Our goal is to: we have a Chinese string or a Chinese string or a Chinese-processed Java source program that can be translated into any other operating system, or get it to compile other operating systems. Correctly run, correctly transmit Chinese and English parameters, accurately communicate with database strings. Our specific ideas are: in the port and exports of the Java program transcoding and the Java program limit the encoding method to limit the encoding method to the user.
The specific solution is as follows:
1. For this situation that is running directly on the CONSOLE, we recommend that when you write, if you need to receive the user from the user, you may have a Chinese input or an output of Chinese, the program should be used to process the input. And output, specifically, the following face-to-character node stream type: pair file: FileReader, FileWrieTer, FileInputStream, FileOutputStream pairs memory (array): ChararrayReader, CharaRrayWriter, CharaRrayWriter is: ByteArrayInputStream, ByteArrayOutputStream memory (string): StringReader, StringWriter pipeline: PipedReader, PipedWriter byte node whose stream type is: PipedInputStream, PipedOutputStream the same time, the input and output should be processed by the following processing flow for the character: BufferedWriter , the process flow is the byte which BufferedReader: BufferedInputeStream, process flow BufferedOutputStream InputStreamReader, OutputStreamWriter byte which is: DataInputStream, DataOutputStream wherein InputStreamWriter InputStreamReader and for converting the byte stream to the character stream according to the specified character code sets, Such as: InputStreamReader IN = New InputStreamReader (System.in, "GB2312"); OutputStreamWriter Out = New OutputStreamWriter (System.out, "GB2312"); for example: using the following example Java encoding requests:
//Read.java import java.io. *; public class read {public static void main (string [] args) throws ioException {string str = "/ n Chinese test, this is internal hard-coded string" "/ ntest english character "; String strin =" "; BufferedReader stdin = new BufferedReader (new InputStreamReader (System.in," gb2312 ")); // input interface provided by the Chinese encoding BufferedWriter stdout = new BufferedWriter (new OutputStreamWriter (System.out, "GB2312")); // Set the output interface Press Chinese Code Stdout.Write ("Please enter:"); stdout.flush (); strin = stdin.readline (); stdout.write ("This is from User Enter String: " strin); stdout.write (STR); stdout.flush ();}} At the same time, we use the following ways when compiling: javac -encoding gb2312 read.java running results as shown in Figure 5 Schedule: Figure 52, support class for EJB and cannot run directly (such as JavaBean class)
Since this class itself is called, it is not directly interacting with the user. Therefore, for this type, our proposed handling method is a Chinese string that should be used in the internal program to handle the interior of the program (specifically as above As in one section), while compiling the class -Encoding GB2312 parameter indication source file is the Chinese format encoding.
3, targeting the servlet class
For servlet, we recommend the following methods:
When compiling the source program of the Servlet class, use -encoding to specify the encoded as GBK or GB2312, and the encoding portion when outputting to the user is set to setContentType ("text / html; charset = GBK"); or GB2312 to set the output In encoding format, when you receive user input, we use Request.setCharacterencoding ("GB2312"); this can be displayed correctly regardless of our Servlet class transplantation, only client browser supports Chinese display, you can display correctly. The following is a correct example:
//HelloWorld.java package hello; import java.io. *; import javax.servlet *;. Import javax.servlet.http *;. Public class HelloWorld extends HttpServlet {public void init () throws ServletException {} public void doGet ( HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException {request.setCharacterEncoding ( "GB2312"); // set the input encoding format response.setContentType ( "text / html; charset = GB2312"); // set the output encoding format PrintWriter out = Response.getwriter (); // Recommended using PrintWriter output Out.println ("
Test this Servlet program as follows: <% @ page contenttype = "text / html; charset = GB2312"%> <% Request.setCharacterencoding ("GB2312");%>