Since the Chinese issue in Java Programming is an older's all-talk, after reading many of the Java Chinese issues, combined with the author's programming practice, I found that many methods that have been discussed in the past cannot clearly explain the problem and solve problems, especially Chinese issues in cross-platform.
So I gave this article, including the Chinese problem in the Class, Servelets, JSP, and EJB classes running in the console. I analyze and recommend a solution. I hope everyone will advise.
1, the source of Chinese issues
The code supported by the computer initial operating system is a single-byte character encoding, so that all handlers in the computer are initially processed by single-byte encoded English.
With the development of the computer, in order to adapt to the language of the world's national nation (of course, our Chinese characters), people have proposed Unicode encoding, which uses double-byte coding, compatible with English characters and other ethnic bodies, so currently Most international software uses Unicode encoding. When the software is running, it obtains a local support system (most time is an operating system) default supported encoding format, and then converts the Unicode's unicode to the local system default support. The format is displayed.
The JAVA's JDK and JVM are the case. JDK I said here refers to the International version of JDK. Most of our programmers use international JDK versions. All JDK refers to an international JDK version. Our Chinese characters are double-byte coding languages. In order to allow computer to process Chinese, we have developed GB2312, GBK, GBK2K and other standards to accommodate computer processing needs.
Therefore, most of the operating systems are customized to meet the needs of our dealing with Chinese, they are customized, which use GBK, GB2312 encoding format to display our Chinese characters correctly. For example, Chinese Windows defaults to GBK coding display. The encoding format of the save file in Chinese Windows 2000 is saved. The encoding format of the save file is also GBK, that is, all files saved in Chinese Windows2000 have GBK coding by default. Note: GBK is expanded on a GB2312 basis.
Since the Unicode encoding is used inside the Java language, there is a problem with the encoding format supported by the Unicode encoding and the corresponding operating system and the browser, this conversion process has a series of steps. If any of these steps, the displayed Chinese characters will be garbled, this is our common java Chinese problem.
At the same time, Java is a cross-platform programming language, that is, the program we have written can not only run on Chinese Windows, but also on Chinese Linux and other systems, but also request to run on English, etc. (we often see Some people put the Java program written on Chinese Windows2000 to run on English Linux). This transplant operation will also bring Chinese issues.
Also, some people use English operating systems and English IE and other browsers to run programs with Chinese characters and browse Chinese web pages. They do not support Chinese, and they will bring Chinese issues.
Almost all browsers are passed by default in the UTF-8 encoding format, rather than pressing Chinese coding, so there will be problems when passing Chinese parameters, resulting in garbled phenomena.
In summary, the above aspect is the main source of Chinese issues in Java, and the procedures that we can't operate correctly, the procedures caused by the above reasons are called: Java Chinese issues.
2, the detailed process of java encoding conversion
Our common Java programs include the following categories:
* Classs running directly on Console (including classes of visual interface)
* JSP code class (Note: JSP is a variant of the servlets class) * Servelets class
* EJB class
* Other support classes that cannot be run directly
These types of files may contain Chinese strings, and we often use the first three types of Java programs and users directly to output and enter characters, such as: We get the character sent by the client in JSP and Servlet, these Characters also include Chinese characters. Regardless of the role of these Java classes, these Java programs life cycle is like this:
* Programmaker selects a suitable editing software on a certain operating system to implement source code and saved in the operating system in the operating system in the .java extension, for example we edit a Java source program in Chinese Windows2000.
* Programmaker compiles these source code with JavaC.exe in JDK, formation .class classes (JSP files are compiled by the container to call JDK).
* Run these classes directly or to run these types to the web container and output the results.
So, how do JDK and JVMs encode and decode these files?
Here, we use the Chinese Windows2000 operating system as an example to explain how the Java class is encoded and decoded.
The first step, we use editing software in Chinese Windows2000, write a Java source file (including the above five types of Java programs), and the program file is saved by default to the operating system default support GBK coding format (operating system default support) The format forms a .java file, that is, the Java program is saved by the File.Encoding encoding format that uses the operating system default support before compiling, Java source programs Contains Chinese information characters and English program code; to view the File.Encoding parameters of the system, you can use the following code:
Public class showsystemdefaultencoding {public static void main (string "args) {string encoding = system.getProperty (" file.encoding "); system.out.println (Encoding);}}
In the second step, we compile our Java source program with JDK's JavaC.exe file, because JDK is an international version, when we compile, if we don't use - Encoding parameters Specify our Java source code format, Javaac First, get the encoding format used by our operating system default, that is, when compiling the Java program, if we do not specify the encoding format of the source file file, JDK first get the File.Encoding parameter of the operating system (it saves the operating system default The encoding format, such as Windows2000, its value is GBK), then JDK translates our Java source from File.Encoding format to Java internal default Unicode format placed in memory.
Then, JAVAC compiles the files in the converted Unicode format into a .class class file, at this time, the .class file is Unicode encoded, it is suspended in memory, followed by the JDK, the compiled class of Unicode encoded Class The file is saved to our operating system to form the .class file we have seen.
For us, we finally obtained .class files are class files saved in Unicode encoding format, which contains Chinese strings in our source program, but at this time it translates to Unicode format via file.encoding format .
In this step, the JSP source file is different. For JSP, this process is like this: that is, the web container calls the JSP compiler. The JSP compiler first checks if there is a file encoding format in the JSP file, if there is no JSP file Set the encoding format of the JSP file, then the JSP compiler calls JDK first to convert JSP files with JVM default character encoding format (that is, the default file.Encoding of the operating system of the web container) into a temporary servlet class, then It compiles Class classes in Unicode format and saves in a temporary folder. For example: On Chinese Windows 2000, the web container translates the JSP file from the GBK encoding format into Unicode format, then compiles a temporary SERVLET class to respond to the user's request.
In the third step, run the second step-by-step, divided into three situations:
A, class running directly on the console
B, EJB classes and support classes that cannot be run directly (such as Javabean classes)
C, JSP code and servlet class
D, Java program and database
Below we are divided into these four situations.
A, class running directly on the console
This situation, running this class first requires JVM support, ie, JRE must be installed in the operating system. The running process is this: First, Java starts JVM, at which point the JVM reads the Class file saved in the operating system and read the content into the memory, at which time the memory is a Class class in the Unicode format, then the JVM runs it, if this time Such a user input needs to receive user input, the class will encode the user input by default and converted to Unicode to save into memory (the user can set the encoding format of the input stream).
After the program runs, the resulting string (Unicode encoded) is handed over again to the JVM, and finally the JRE converts this string to the file.Encoding format (the user can set the encoded format of the output stream) to the operating system display interface and output On the interface. Each of the above transformations require the correct coding format to conversion to eventually do not have garbled. B, EJB classes and support classes that cannot be run directly (such as Javabean classes)
Because EJB classes and support classes that cannot be run directly, they are generally not directly interacting with the user, they often interact with other classes, so they have formed the content after the second step is compiled. The Unicode encoded class is saved in the operating system, and later it is not lost during the parameter transfer between the interaction between the other classes, it will run correctly.
C, JSP code and servlet class
After the second step, the JSP file has also been converted to a servlets class file, but it is not like a standard servlets one existed in the classs directory, which exists in the temporary directory of the web container, so we also put it in this step. Be a servlets.
For servlets, when the client requests it, the web container calls its JVM to run the servlet. First, JVM reads the Servlet's Class class from the system and loads the memory, the memory is the code of the servlet class encoded in Unicode, Then JVM runs the servlet class in memory. If the servlet is running, you need to accept characters from the client such as: the value incorporated in the value of the form and the value in the URL, if there is no setting accepted in the program The encoding format used when the parameter is used, the web container will use ISO- 8859-1 encoding format by default to accept the incoming value and convert it into the Unicode format in the UNICODE format in the JVM.
After the servlet runs, the output is generated. The output string is the Unicode format, which follows the container to send the servlet running the unicode format string (such as HTML syntax, user output string, etc.) directly on the client browser and output it. User, if the encoding format output when sending is specified, then output to the browser in the specified encoding format, if not specified, the default is sent to the customer's browser. D, Java program and database
For almost all databases of JDBC drivers, the default transmitted data between Java programs and databases is based on ISO-8859-1 as the default encoding format, so our program is stored in the database containing the Chinese data. JDBC first transforms the data in the Unicode encoding format within the program to the ISO-8859-1 format, and then passes to the database, when it saves the data, it defaults to save it by ISO-8859-1, so why The Chinese data we often read in the database is garbled.
3. Analyze a few of the common Java Chinese issues must be clear
First, after the above detailed analysis, we can clearly see that in any of the lifetime of the Java program, the key process of its encoding conversion is: the transcoding process that initially compiles to the transcoding of the Class file and finally output to the user.
Second, we must understand that Java is supported at compile, and the commonly used coding format has the following:
* ISO-8859-1, 8-Bit, with 8859_1, ISO-8859-1, ISO_8859_1, etc.
* CP1252, American English coding, with ANSI standard code
* UTF-8, with Unicode encoding
* GB2312, encoding with GB2312-80, GB2312-1980
* GBK, with MS936, it is the expansion of GB2312 and other encodings, such as Korean, Japanese, Traditional Chinese, etc. At the same time, we should pay attention to the compatibility system between these codes as follows:
Unicode and UTF-8 encoding are a one-one correspondence. GB2312 can be considered a subset of GBK, ie GBK coding is expanded on GB2312. At the same time, GBK encoding contains 20902 Chinese characters, and the coding range is: 0x8140-0xFefe, all characters can correspond to Unicode 2.0 in Unicode 2.0.
Again, for the .java source file file in the operating system, when compiling, we can specify the encoding format of its content, specifically use -encoding to specify. Note: If you contain Chinese characters in the source program, you are apparent to other coding characters with -encoding, obviously wrong.
The encoding mode of the specified source file is GBK or GB2312, whether we compile the Java source program containing Chinese characters on what system, which will translate Chinese into Unicode stored in the class file.
Then, we must clearly, almost all Web containers are default in their internal default character encoding format, while almost all browsers are default UTF-8 by default when passing parameters. The way of transmitting parameters.
So, although our Java source file specifies the correct encoding mode in the entrance and exit, it is handled by ISO-8859-1 during the internal operation of the container.
4. Classification of Chinese issues and its suggestion optimal solution
After understanding the principles of Java processing files, we can put forward a solution to the best solution to solve Chinese characters. Our goal is to: we have a Chinese string or a Chinese string or a Chinese-processed Java source program that can be translated into any other operating system, or get it to compile other operating systems. Correctly run, correctly transmit Chinese and English parameters, accurately communicate with database strings. Our specific ideas are: in the port and exports of the Java program transcoding and the Java program limit the encoding method to limit the encoding method to the user. The specific solution is as follows:
1. A class that is running directly on the Console
In this case, we recommend that when the program is written, if you need to receive the user from the client, you may have a Chinese input or an output of Chinese, the program should be used to process the input and output, specifically, the application is Character-like node flow type:
For files: FileReader, FileWrieter
Its byte node stream is: FileInputStream, FileOutputStream
For memory (arrays): ChararrayReader, ChararrayWriter
Its byte nodes are: byterrayinputstream, ByteaRrayoutputstream
For memory (string): StringReader, StringWriter
Tongue: PipedReader, PiPedWriter
Its byte nodes are: pipedinputstream, pipedoutputstream
At the same time, you should use the following to the characteristic processing flow to process the input and output:
BufferedWriter, BufferedReader
Its byte type processing flow is: bufferedInputStream, BufferedoutputStream
InputStreamReader, OutputStreamwriter
Its byte type processing flow is: DataInputStream, DataOutputStream
The InputStreamReader and InputStreamWriter are used to convert byte streams to the character stream according to the specified character encoding set, such as:
InputStreamReader IN = New InputStreamReader (System.in, "GB2312"); OutputStreamWriter Out = New OutputStreamWriter (System.out, "GB2312"); for example: using the following example Java coding requests:
//Read.javaimport java.io. *; Public class read {public static void main (string [] args) throws ioException {string str = "n Chinese test, this is internal hardcoded string" "ntest english character" ; String strin = ""; BufferedReader stdin = new BufferedReader (new InputStreamReader (System.in, "gb2312")); // input interface provided by the Chinese encoding BufferedWriter stdout = new BufferedWriter (newOutputStreamWriter (System.out, "gb2312") ); // Set the output interface Press Chinese Coded Stdout.write ("Please enter:"); STDOUT.FLUSH (); Strin = stdin.readline (); stdout.write ("This is from the user input string:" STRIN); stdout.write (STR); stdout.flush ();}} At the same time, we use the following ways when compiling:
Javac -ENCODING GB2312 Read.java
2, support for EJB classes and not directly running (such as Javabean classes)
Since this class itself is called, it is not directly interacting with the user. Therefore, for this type, our proposed handling method is a Chinese string that should be used in the internal program to handle the interior of the program (specifically as above As in one section), while compiling the class -Encoding GB2312 parameter indication source file is the Chinese format encoding.
3, targeting the servlet class
For servlet, we recommend the following methods:
When compiling the source program of the Servlet class, use -encoding to specify the encoded as GBK or GB2312, and the encoding portion when outputting to the user is set to setContentType ("text / html; charset = GBK"); or GB2312 to set the output In encoding format, when you receive user input, we use Request.setCharacterencoding ("GB2312"); this can be displayed correctly regardless of our Servlet class transplantation, only client browser supports Chinese display, you can display correctly. The following is a correct example:
//HelloWorld.Javapackage hello; import Java.io *;. Import Javax.servlet *;. Import Javax.servlet.http *;. Public class HelloWorldextends HttpServlet {public void init () throws ServletException {} public void doGet (HttpServletRequest request , HttpServletResponse response) throws IOException, ServletException {request.setCharacterEncoding ( "GB2312"); // set the input encoding format response.setContentType ( "text / html; charset = GB2312"); // set the output encoding format PrintWriter out = response. GetWriter (); // Recommended PrintWriter output Out.println ("
Testing this servlet is as follows:
<% @ Page ContentType = "Text / HTML; Charset = GB2312"%> <% Request.setChacterencoding ("GB2312");%>