In-depth analysis of Chinese issues in Java programming and suggestion optimal solution ABNERCHAI (original)
Keywords: java Chinese problem unicode GB2312 XML code
Description: This article is original, the author contact address is: josserchai@yahoo.com. Since the Chinese issue in Java Programming is an older's all-talk, after reading many of the Java Chinese issues, combined with the author's programming practice, I found that many methods that have been discussed in the past cannot clearly explain the problem and solve problems, especially Chinese issues in cross-platform. So I gave this article, including the Chinese problem in the Class, Servelets, JSP, and EJB classes running in the console. I analyze and recommend a solution. I hope everyone will advise. Any reference this article please indicate the place! ! Abstract: In-depth analysis of Java programming Java compilers on the Java source file and JVM on the encoding / decoding process of the Class class file, through the resolution of this process, the root cause of the Chinese problem in Java programming, finally given The proposed optimized method for solving the Java Chinese issue. 1. The source of the Chinese problem is the initial operating system supported by the system. The encoding supported by the single-byte character encoding, so that all handlers in the computer are initially processed by single-byte encoded English. With the development of the computer, in order to adapt to the language of the world's national nation (of course, our Chinese characters), people have proposed Unicode encoding, which uses double-byte coding, compatible with English characters and other ethnic bodies, so currently Most international software uses Unicode encoding. When the software is running, it obtains a local support system (most time is an operating system) default supported encoding format, and then converts the Unicode's unicode to the local system default support. The format is displayed. The JAVA's JDK and JVM are the case. JDK I said here refers to the International version of JDK. Most of our programmers use international JDK versions. All JDK refers to an international JDK version. Our Chinese characters are double-byte coding languages. In order to allow computer to process Chinese, we have developed GB2312, GBK, GBK2K and other standards to accommodate computer processing needs. Therefore, most of the operating systems are customized to meet the needs of our dealing with Chinese, they are customized, which use GBK, GB2312 encoding format to display our Chinese characters correctly. For example, Chinese Win2K defaults to GBK coding display. The encoding format of the save file in Chinese Win2k is saved. The encoding format of the save file is also GBK, ie, all files saved in Chinese Win2k It uses GBK coding by default. Note: GBK is expanded based on GB2312. Since the Unicode encoding is used inside the Java language, there is a problem with the encoding format supported by the Unicode encoding and the corresponding operating system and the browser, this conversion process has a series of steps. If any of these steps, the displayed Chinese characters will be garbled, this is our common java Chinese problem. At the same time, Java is a cross-platform programming language, that is, the program we have written can not only run on Chinese Windows, but also on Chinese Linux and other systems, but also request to run on English, etc. (we often see Some people put the Java program written on Chinese Win2k to run on English Linux). This transplant operation will also bring Chinese issues. Also, some people use English operating systems and English IE and other browsers to run programs with Chinese characters and browse Chinese web pages. They do not support Chinese, and they will bring Chinese issues.
Yes, almost all browsers are passed by default when passing parameters, not by Chinese coding, so there will be problems when passing Chinese parameters, resulting in garbled. In summary, the above aspect is the main source of Chinese issues in Java, and the procedures that we can't operate correctly, the procedures caused by the above reasons are called: Java Chinese issues. 2, Java encoding conversion details Our common Java programs include the following categories: * Class directly running on Console (including the visual interface class) * JSP code class (Note: JSP is a variant of the servlets class) * Servelets * EJB Category * Others Key Key Support Class These class files may contain Chinese strings, and we often use the first three-class Java programs and users directly interact, used to output and enter characters, such as: We are in JSP and The character sent by the client is also included in the servlet, and these characters also include Chinese characters. Regardless of these Java classes, these Java programs have the life cycle of these Java programs: * Programmers select a suitable editing software on a certain operating system to implement source code and save in the operating system in the operating system in the operating system in the operating system in a certain operating system. For example, we have edited a Java source program in Chinese Win2k; * Programmers compile these source code with javac.exe in JDK, formation .class classes (JSP files are compiled by the container); * Run these classes directly or to run these types to the web container and output the results. So, how do JDK and JVMs encode and decode these files? Here, we use a Chinese WIN2K operating system as an example to explain how Java classes come to encode and decoded. In the first step, we use the editing software in Chinese Win2k, such as Notepad, such as Notepad, including the above five types of Java programs), the program file is saved by default, the operating system is used by default support GBK coding format (operating system default support) The format forms a .java file, that is, the Java program is saved by the File.Encoding encoding format that uses the operating system default support before compiling, Java source programs Contains Chinese information characters and English program code; to view the File.Encoding parameters of the system, you can use the following code: public class showstemdefaultencoding {public static void main (string "args) {string encoding = system.getProperty (" file.encoding " SYSTEM.OUT.Println (Encoding);} 2, we compile our Java source program with JDK's JavaC.exe file, because JDK is an international version, when compiled, if we don't use -ENCODING The parameter specifies the encoding format of our Java source program. File.Encoding parameters (it saves the operating system default encoding format, such as Win2K, its value is GBK), then JDK translates our Java source from File.Encoding format to Java internal default Unicode format Put in memory.
Then, JAVAC compiles the files in the converted Unicode format into a .class class file, at this time, the .class file is Unicode encoded, it is suspended in memory, followed by the JDK, the compiled class of Unicode encoded Class The file is saved to our operating system to form the .class file we have seen. For us, we finally obtained .class files are class files saved in Unicode encoding format, which contains Chinese strings in our source program, but at this time it translates to Unicode format via file.encoding format . In this step, the JSP source file is different. For JSP, this process is like this: that is, the web container calls the JSP compiler. The JSP compiler first checks if there is a file encoding format in the JSP file, if there is no JSP file Set the encoding format of the JSP file, then the JSP compiler calls JDK first to convert JSP files with JVM default character encoding format (that is, the default file.Encoding of the operating system of the web container) into a temporary servlet class, then It compiles Class classes in Unicode format and saves in a temporary folder. For example: On Chinese Win2k, the web container translates the JSP file from the GBK encoding format to Unicode format, then compiles the temporary SERVLET class to respond to the user's request. In the third step, run the second step, divided into three situations: a, the class B, EJB classes, and the support class that cannot be run directly on the Console (such as Javabean class) C, JSP code, and Servlet class D, Java program, and databases are below our four situations. A. This kind of class is running directly on the CONSOLE, running this class first requires JVM support, ie, JRE must be installed in the operating system. The running process is this: First, Java starts JVM, at which point the JVM reads the Class file saved in the operating system and read the content into the memory, at which time the memory is a Class class in the Unicode format, then the JVM runs it, if this time Such a user input needs to receive user input, the class will encode the user input by default and converted to Unicode to save into memory (the user can set the encoding format of the input stream). After the program runs, the resulting string (Unicode encoded) is handed over again to the JVM, and finally the JRE converts this string to the file.Encoding format (the user can set the encoded format of the output stream) to the operating system display interface and output On the interface. For this class running directly on the console, its conversion process can be more clearly expressed by Figure 1: Figure 1 The conversion of each step of each step is required to transform the correct encoding format to eventually do not have garbled phenomena. B, EJB classes and support classes that cannot be run (such as JavaBean classes) Because EJB classes and support classes that cannot be run directly, they generally do not interact with users directly, they often interact with other classes and Output, so they have been compiled in the second step, forming the content of the Unicode encoded is saved in the operating system, and later, as long as it is not lost during the parameter transfer, it will Will run correctly.
This EJB class and support class that cannot be run directly, its conversion process can be more clearly expressed by Figure 2: Figure 2 c, JSP code and servlet class After the second step, the JSP file is also converted to servlets class files. However, it is not like a standard servlets, exists in the class content, which exists in the temporary directory of the web container, so we also do it as servlets in this step. For servlets, when the client requests it, the web container calls its JVM to run the servlet. First, JVM reads the Servlet's Class class from the system and loads the memory, the memory is the code of the servlet class encoded in Unicode, Then JVM runs the servlet class in memory. If the servlet is running, you need to accept characters from the client such as: the value incorporated in the value of the form and the value in the URL, if there is no setting accepted in the program The encoding format used when the parameter is used, the web container will use ISO-8859-1 encoding format by default to accept incoming values and translated into the Unicode format in the memory of the Web container in the JVM. After the servlet runs, the output is generated. The output string is the Unicode format, which follows the container to send the servlet running the unicode format string (such as HTML syntax, user output string, etc.) directly on the client browser and output it. User, if the encoding format output when sending is specified, then output to the browser in the specified encoding format, if not specified, the default is sent to the customer's browser. This JSP code and servlet class, its conversion process can be explicitly represented by Figure 3: Figure 3 D, Java program and database of JDBC drivers for almost all databases, by default transmitting between Java programs and databases The data is based on ISO-8859-1 as the default encoding format, so our program When you store data contained in the database, JDBC first transforms the data within the Unicode encoding format within the program to ISO-8859-1. Format, then transferred to the database, when the database saves data, it is default that is ISO-8859-1, which is why the Chinese data we often read in the database is garbled. For data delivery between Java programs and databases, we can clearly express Figure 4 3, analyzing a common Java Chinese issue, first, first, we can clearly see the above detailed analysis In any life period of any Java program, the key process of its encoding conversion is that the transcoding process initially compiled into a Class file and finally output to the user. Second, we must understand that Java is supported by Java, and the commonly used coding format has the following: * ISO-8859-1, 8-Bit, with 8859_1, ISO-8859-1, ISO_8859_1, etc. Coding * CP1252, US English Code , With the ANSI standard code * UTF-8, with Unicode encoded * GB2312, GB2312-80, GB2312-1980, etc., GB2312, with MS936, is the expansion of GB2312 and other codes, such as Korean, Japanese, Traditional Chinese, etc. . At the same time, we have to pay attention to the compatibility system between these codes as follows: Unicode and UTF-8 encoding are one-to-one relationships. GB2312 can be considered a subset of GBK, ie GBK coding is expanded on GB2312.
At the same time, GBK encoding contains 20902 Chinese characters, and the coding range is: 0x8140-0xFefe, all characters can correspond to Unicode 2.0 in Unicode 2.0. Again, for the .java source file file in the operating system, when compiling, we can specify the encoding format of its content, specifically use -encoding to specify. Note: If you contain Chinese characters in the source program, you are apparent to other coding characters with -encoding, obviously wrong. The encoding mode of the specified source file is GBK or GB2312, whether we compile the Java source program containing Chinese characters on what system, which will translate Chinese into Unicode stored in the class file. Then, we must clearly, almost all Web containers are default in their internal default character encoding format, while almost all browsers are default UTF-8 by default when passing parameters. The way of transmitting parameters. So, although our Java source file specifies the correct encoding mode in the entrance and exit, it is handled by ISO-8859-1 during the internal operation of the container. 4. The classification of Chinese issues and its proposed optimal solution After understanding the principles of JAVA processing files, we can put forward a solution to the best solution to Chinese characters. Our goal is to: we have a Chinese string or a Chinese string or a Chinese-processed Java source program that can be translated into any other operating system, or get it to compile other operating systems. Correctly run, correctly transmit Chinese and English parameters, accurately communicate with database strings. Our specific ideas are: in the port and exports of the Java program transcoding and the Java program limit the encoding method to limit the encoding method to the user.
The specific solutions are as follows: 1. For this situation that is running directly on the CONSOLE, we recommend that when the program is written, if you need to receive the user's input or the output of the Chinese, the program should be used in the program. Character flows to process input and output, specifically, apply the following: FileReader, FileWrieTER whose byte type node stream type is: FileInputStream, FileOutputStream pairs memory (array): ChararrayReader, ChararrayWriter section node stream types: ByteArrayInputStream, ByteArrayOutputStream memory (string): StringReader, StringWriter pipeline: PipedReader, PipedWriter byte node whose stream type is: PipedInputStream, PipedOutputStream the same time, the following should be processed for character stream processing input and output: process flow BufferedWriter, BufferedReader byte which is: BufferedInputeStream, process flow BufferedOutputStream InputStreamReader, OutputStreamWriter byte which is: DataInputStream, DataOutputStream InputStreamReader and InputStreamWriter wherein the byte stream in accordance with a specified character set encoding conversion to the character stream, such as: InputStreamReader in = new InputStreamReader (System.in, "GB2312"); OutputStreamWriter out = new OutputStreamWriter (System.out, "GB2312"); for example: an example is as follows JAVA reached coding requirements: //Read.javaimport java.io. *; Public class read {public static void main (string [] args) throws oews ioException {string str = "/ n" in Chinese Test, this is an internal hardcoded string " " / ntest english character "; string strin =" "; bufferedReader stdin = new bufferedReader (NEW INPUTSTREADER (SYSTEM.IN," GB2312 ")); // Setting Input interface Press in Chinese Encoding BufferedWriter Stdout = New OutputStreamWriter (SYSTEM.out, "GB2312"))); // Setting the output interface Press Chinese Code Stdout.Write ("Please enter:"); stdout.flush (); strin = stdin.readline (); stdout.write ("This is a string input from the user:
" Strin); stdout.write (); stdout.flush ();}} At the same time, when compiling procedures: Javac -Encoding GB2312 Read.java Operation results are shown in Figure 5: Figure 5 2, for the EJB class and the support class that cannot be directly run (such as the JavaBean class) Since these classes are used by other class calls, they are not interacting with the user, so our suggestions for this class. It is a Chinese string that should be used in the internal program to handle the Chinese string inside (as in the above section), while using the -Encoding GB2312 parameter when compiling classes, the source file is the Chinese format encoding. 3, For the servlet class, we recommend using the following methods: When compiling the servlet class source program, use -encoding to specify the setContentType encoding the GBK or GB2312, and the encoding portion when outputting to the user, using the setContentType ("Text / HTML "CHARSET = GBK"); or GB2312 to set the output encoding format, and when receiving the user input, we use Request.SetCharacterencoding ("GB2312"); this is only the client's browsing regardless of our Servlet class The device supports Chinese display, it can be displayed correctly.
The following is a correct example: //HelloWorld.javapackage hello; import java.io *; import javax.servlet *; import javax.servlet.http *; public class HelloWorld extends HttpServlet {public void init () throws ServletException... {} public void doGet (HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException {request.setCharacterEncoding ( "GB2312"); // set the input encoding format response.setContentType ( "text / html; charset = GB2312"); // set Output encoding format printwriter out = response.getwriter (); // Recommended PrintWriter output Out.println ("
Test this Servlet program as follows: <% @ page contenttype = "text / html; charset = GB2312"%> <% Request.setCharacterencoding ("GB2312");%>