Overview
There are local languages in all regions in the world. Regional differences directly lead to the difference in language environment. During the development of an internationalization process, it is important to handle language issues.
This is a problem that there is a world, so Java provides a worldwide solution. The methods described herein are used to handle Chinese, but, it is widely applicable to dealing with language in other countries and regions in the world.
Chinese characters are double bytes. The so-called double-byte means a double word to occupy the position (i.e., 16 bits), respectively, called high and low. Chinese prescribed Chinese characters encoded as GB2312, which is mandatory, and currently all of all applications that can handle Chinese support GB2312. GB2312 includes a second-class Chinese character and 9 zone symbols, high from 0xA1 to 0xFe, low from 0xA1 to 0xFe, where Chinese code ranges from 0xB0A1 to 0xF7FE.
In addition, there is a code called GBK, but this is a specification, not forced. GBK provides 20,902 Chinese characters, which are compatible with GB2312, with a range of 0x8140 to 0xFefe. All characters in the GBK can be mapped one by one to Unicode 2.0.
In the near future, the Chinese will promulgate another standard: GB18030-2000 (GBK2K). It contains the font of minority nationalities, and fundamentally solves the problem of insufficient word. Note: It is no longer fixed. Its binary part is compatible with GBK, the four-byte part is the expanded character, a glyph. Its first byte and third bytes range from 0x81 to 0xFe, Biytes and 4th bytes from 0x30 to 0x39.
This article does not intend to introduce Unicode, interested, you can browse "http://www.unicode.org/" View more information. Unicode has a feature: it includes all the characters in the world. Therefore, both languages in various regions can establish mapping relationship with Unicode, while Java is using this to achieve conversion between heterogeneous languages.
In JDK, the encoding related to Chinese is:
Table 1 List of codes related to Chinese in JDK
Encoding name
Understand
ASCII 7-bit, with ASCII7 identical ISO8859-1 8-bit, with 8859_1, ISO-8859-1, ISO_8859-1, Latin1 ..., etc. GB2312-80 16, with GB2312, GB2312-1980, EUC_CN, EUCCN, 1381, CP1381, 1383, CP1383, ISO2022CN, ISO2022CN_GB ..., etc., the same GBK is the same, Note: Case case case UTF8 is the same GB18030 as the CP1392, 1392, very few JDK
In actual programming, the contact is more contacts, GB2312 (GBK) and ISO8859-1.
Why is there "?"
The above says that the conversion between the xenograft is done through Unicode. Suppose there are two different languages A and B, the steps of conversion are: first convert A into Unicode, and then convert Unicode into B.
for example. There is a Chinese character "Li" in GB2312, which is encoded as "c0ee", wants to translate into ISO8859-1 encoding. The steps are: first transform "Li" word into Unicode, get "674E", and turn "674E" into ISO8859-1 characters. Of course, this mapping will not succeed because there is no character corresponding to "674E" in ISO8859-1.
When the mapping is unsuccessful, the problem has happened! When converted from a language to Unicode, if there is no character in a language, the code "/ ufffd" will be Unicode "/ ufffd" ("/ u" representation is Unicode encoding,). When the Unicode is transformed into a language, if a language has no corresponding characters, it is "0x3f" ("?"). This is the origin of "?". For example: performing the character stream BUF = "0x80 0x40 0xB0 0xA1" to do new string (BUF, "GB2312"), the result is "/ uffd / u554a", then println, the result will be "? Ah", Because "0x80 0x40" is the character in GBK, there is no in GB2312.
For another example, the string string = "/ u00d6 / u00EC / u00E9 / u0046 / u00bb / u00f9" is performed, and the result is "3FA8ACA8A6463FA8B4", "/ U00D6 "There is no corresponding character in" GBK ", resulting in" 3f "," / u00 EC "corresponding to" A8AC "," / U00E9 "corresponds to" A8A6 "," 0046 "corresponding to" 46 "(because this is ASCII character" ), "/ U00bb" did not find, get "3F", finally, "/ u00f9" corresponds to "A8B4". Take this string println, the result is "? Ìéf?". did you see? It is not all question mark, because the contents of GBK and Unicode maps include characters outside Chinese characters, this example is the best way to express.
So, when the Chinese character is transcoding, if there is an arrest, it is not necessarily a question mark! However, the wrong thing is wrong, 50 steps and 100 steps are not qualified.
Or will ask: If there is a set of source characters, there is no in Unicode, how will the result? I don't know. Because I have no source character set for this test. But one is sure, that is, the source character set is not standardized. In Java, if this happens, it will throw an exception.
What is UTF
UTF, is an abbreviation of Unicode text format, meaning unicode text format. For UTF, it is defined:
(1) If the head 9 bit of Unicode's 16-bit character is 0, use one byte, the first bit of this byte is "0", the remaining seven bits are the same in the original characters, such as " / U0034 "(0000 0000 0011 0100), represented by" 34 "(0011 0100); (same as the source Unicode character);
(2) If the first 5 digits of Unicode's 16-bit characters are 0, use 2 bytes, the first one is "110" beginning, the 5 bits of the back and the highest 5 zero in the source characters 5 The bit is the same; the second byte begins with "10", the 6 bits of the rear are the same as the low 6 bits in the source character. Such as "/ u025d" (0000 0010 0101 1101), the transformation is "C99D" (1100 1001 1001 1101);
(3) If the above two rules are not met, use three bytes. The first byte begins with "1110", the last four bits are the high four digits of the source characters; the second byte begins with "10", the next six is six in the middle of the source characters; the third byte "10" starts, the latter six bits are low six bits of source characters; such as "/ u9da7" (1001 1101 1010 0111), convert to "E9B6A7" (1110 1001 1011 0110 1010 0111); can be described in this Java program with Unicode The relationship of UTF, although not absolute: the string runs in memory, the UNFE code is manifested, and the UTF is used when saved to the file or other media. This conversion process is done by WriteUTF and Readutf.
Ok, the basic discussion is almost, and the topic is entered below.
First think this problem is a black box. First, look at the level of black boxes:
Input (Charseta) -> Process (Unicode) -> Output (charsetb)
Simple, this is an IPO model, namely input, processing, and output. The same content should be transformed from "from Charseta to Unicode to Charsetb".
Look at the second level again:
Sourcefile (JSP, Java) -> Class-> Output
In this figure, it can be seen that the JSP and Java source files are entered. During the process, the Class file is a carrier, and then output. Refined to three levels:
JSP-> Temp file-> Class-> Browser, OS Console, DB
App, servlet-> Class-> Browser, OS Console, DB
This picture is more understanding. Mr. JSP files become the Java file in the middle, regenerate into Class. The servlet and the ordinary app are directly compiled directly to generate CLASS. Then, output from the Class to the browser, console, or database.
JSP: Process from source files to Class
JSP's source file is a text file ending with ".jsp". In this section, the interpretation and compilation process of the JSP file will be explained, and the Chinese changes are tracked.
1. JSP Conversion Tools (JSPC) provided by the JSP / Servlet engine (JSPC) search in the JSP files in <% @ page contentType = "text / html; charset =
2, JSPC uses the "javac -encoding 3, the engine uses the "Javac -Encoding Unicode" command to compile the Java file into a class file; first look at the conversion of Chinese characters in these processes. Have the following source code: <% @ Page ContentType = "Text / HTML; Charset = GB2312"%>
This code is written on UltraEdit for Windows. After saving, the 16-way code of the "Chinese" word is "D6 D0 CE C4" (GB2312 encoded). After investigation, the Unicode encoded in "Chinese" is "/ U4E2D / U6587", which is "E4 B8 AD E6 96 87". Open the Java file generated by the JSP file generated, found that the two words of "Chinese" are indeed replaced by "E4 B8 AD E6 96 87", and then view the class file generated by the Java file, and find the results and The same is exactly the Java file.
Look at the situation of Charset specified in JSP as ISO-8859-1.
<% @ Page ContentType = "Text / HTML; Charset = ISO-8859-1"%>
<% string a = "Chinese"; out.println (a);%> body> < / html>Similarly, the file is written in UltraEdit, "Chinese" two words are also stored as GB2312 encoded "D6 D0 CE C4". First simulate the process of generated Java files and Class files: JSPC uses ISO-8859-1 to explain "Chinese" and map it to Unicode. Since ISO-8859-1 is 8-bit, and is a latin system, its mapping rules are "00" before each byte, so the Mapped Unicode encoding should be "/ U00D6 / U00D0 / U00 CE / U00C4" After the transformation into UTF, it should be "C3 96 C3 90 C3 8E C3 84". Ok, open the file, "Chinese" is expressed as "C3 96 C3 90 C3 8E C3 84".
If
By so far, the mapping process of Chinese characters in the transition process of the JSP file to the Class file has been explained. One sentence: from "JSPCharset to Unicode to UTF". The following table summarizes this process:
Table 2 "Chinese" Transformation Process from JSP to CLASS
JSP-charset
JSP file
Java file
Class file
GB2312 D6 D0 CE C4 (GB2312) from / U4E2D / U6587 (Unicode) to E4 B8 AD E6 96 87 (UTF) E4 B8 AD E6 96 87 (UTF) ISO-8859-1 D6 D0 CE C4 (GB2312) from / U00D6 / U00D0 / U00CE / U00C4 (Unicode) to C3 96 C3 90 C3 8E C3 84 (UTF) C3 96 C3 90 C3 8e C3 84 (UTF) None (default = file.encoding) with ISO-8859-1 ISO-8859 -1 The next section of ISO-8859-1 discusses the conversion process from the Java file to the Class file, and then explains how to output it from the Class file to the client. This arrangement is because JSP and servlet are the same in the output.
Servlet: Process from source files to Class
The servlet source file is a text file ending with ".java". This section will discuss the compilation process of the servlet and track the Chinese changes.
Compile the servlet source file with "Javac". Javac can bring "-Encoding
The source file is compiled, and all characters and ASCII characters are interpreted with
In the servlet, there is a place where the output stream is set. Typically, the setContentType method of HTTPSERVLETRESPETRESE is usually called before the output results, and the same effect as "JSP-Charset" is used in JSP, called
Note that three variables mentioned in the text:
Look at the example:
Import javax.servlet. *; import javax.servlet.http. *;
class testServlet extends HttpServlet {public void doGet (HttpServletRequest req, HttpServletResponse resp) throws ServletException, java.io.IOException {resp.setContentType ( "text / html; charset = GB2312"); java.io.PrintWriter out = resp.getWriter ( ); Out.println (""); out.println ("# 中文 #"); Out.println (" HTML>");}}
This file is also written in UltraEdit for Windows, where "Chinese" is saved as "D6 D0 CE C4" (GB2312 encoded).
Start compilation. The table below is
Compile-charset
SERVLET source file
Class file
Equivalent Unicode code
GB2312 D6 D0 CE C4 (GB2312) E4 B8 AD E6 96 87 (UTF) / U4E2D / U6587 (in Unicode = "Chinese") ISO-8859-1 D6 D0 CE C4 (GB2312) C3 96 C3 90 C3 8E C3 84 (UTF) / U00D6 / U00D0 / U00CE / U00C4 (with one 00 in D6 D0 CE C4) No (default) D6 D0 CE C4 (GB2312) with ISO-8859-1 ISO-8859-1
The compilation process of the ordinary Java program is exactly the same as the servlet.
Is the Chinese representation in the Class file? OK, let's take a look at how Class is output in Chinese?
Class: Output string
The above says that the string is behaving in memory as a Unicode encoding. As for this Unicode encoding, it depends on which character set is mapped from it, that is to say to see its ancestors. This is better than when the baggage is checked, the appearance is a paper box, which is installed in the people who want to see the mail.
Take a look at the example above, if you give a string Unicode encoding "00d6 00d0 00CE 00C4", if you do not convert it, use the Unicode code table to compare it, it is four characters (and special characters); if you use it to "ISO8859 -1 "Map, then" 00 "directly removes" 00 "," D6 D0 CE C4 ", which is four characters in the ASCII code table; if it is made as GB2312, the result is very It may be a lot of chaos because there may be no (or possibly) characters and other characters such as 00D6 in GB2312 (if the corresponding cannot be, 0x3f, that is, the question mark, if the group, etc., because 00d6 and other characters Before, it is estimated that some special symbols, the real Chinese character starts from 4E00 in Unicode).
Everyone saw that the same Unicode character can be interpreted as a different look. Of course, this is one of our expectations. In the above example, "D6 D0 CE C4" should be what we want. When "D6 D0 CE C4" is output to IE, you can see clear "Chinese" with "Simplified Chinese". Two words. (Of course, if you must use "Western European characters" to see, then there is no way, you will not have anything to have anything?) Why? Because "00D6 00D0 00CE 00C4" was originally transformed from ISO8859-1.
Given the following conclusions:
Before the Class output string, the Unicode's string will be restored according to a certain internal code, and then the byte stream is input, which is equivalent to "String.getbytes (???)" operation. ??? represents a character set. If it is a servlet, this internal code is the internal code specified in the httpservletresponse.setContentType () method, that is, the
If it is JSP, then this internal code is the internal code specified in <% @ page contentType = "%>, that is, the
If it is a java program, then this internal code is the internal code specified in File.Encoding, which defaults to ISO8859-1.
When the output object is a browser
Take the popular browser IE as an example. IE supports a variety of internal codes. If IE receives a byte stream "D6 D0 CE CE C4", you can try to view all kinds of internal codes. You will find the correct results when using "Simplified Chinese". Because "D6 D0 CE CE C4" is originally the code of "Chinese" in Simplified Chinese.
OK, look at it completely.
JSP: The source file is a text file in the GB2312 format, and there are "Chinese" in the JSP source file.
If
Table 4 Change process when jsp-charSet = GB2312
Serial number
Step description
result
1 Write a JSP source file, and save the GB2312 format D6 D0 CE C4 (D6D0 = CEC4 = text) 2 JSPC transforms the JSP source file to a temporary java file, and map the string to Unicode in GB2312, and writes in UTF format Java files E4 B8 AD E6 96 87 3 Compile temporary Java files into Class files E4 B8 AD E6 96 87 4 When running, first from the class file to read the string, in the memory is Unicode Code 4e 2D 65 87 (6587 = text in Unicode) 5 According to JSP-charset = GB2312, the Unicode is converted to byte stream D6 D0 CE CE C4 6 to output byte stream to IE, and set IE encoding as GB2312 (author press : This information is hidden in http head) D6 D0 CE CE C4 7 IE View Results "Chinese" (correctly)
If
Table 5 Change process of jsp-charSet = ISO8859-1
Serial number
Step description
result
1 Write the JSP source file, and save the GB2312 format D6 D0 CE C4 (D6D0 = CEC4 = text) 2 JSPC transforms the JSP source file to the temporary java file, and map the string to Unicode according to ISO8859-1, and use UTF format Write into the Java file C3 96 C3 90 C3 8e C3 84 3 compile the temporary java file into a Class file C3 96 C3 90 C3 8E C3 84 4 When running, first from the class file to read the string, in memory Is Unicode Code 00 D6 00 D0 00 CE 00 C4 (都 !!!) 5 According to JSP-charset = ISO8859-1 converts Unicode into byte stream D6 D0 CE C4 6 outputs byte stream to IE, and Set IE ISO8859-1 (Author Press: This information is hidden in http header) D6 D0 CE CE C4 7 IE Use "Western Europe Character" to view the results, actually four ASCII characters, but due to greater than 128, it is displayed The monster blame 8 Change the page of IE to "Simplified Chinese" "Chinese" (correctly)
Strange! Why is it possible to set
Let's see if you don't specify
Table 6 Change process when the JSP-Chars is not specified
Serial number
Step description
result
1 Write the JSP source file, and save the GB2312 format D6 D0 CE C4 (D6D0 = CEC4 = text) 2 JSPC transforms the JSP source file to the temporary java file, and map the string to Unicode according to ISO8859-1, and use UTF format Write into the Java file C3 96 C3 90 C3 8e C3 84 3 compile the temporary java file into a Class file C3 96 C3 90 C3 8E C3 84 4 When running, first from the class file to read the string, in memory Is unicode code 00 d6 00 d0 00 CE 00 C4 (都 !!!) 5 According to JSP-charset = ISO8859-1 converts Unicode into byte stream D6 D0 CE CE C4 6 to output byte stream to IE D6 D0 CE C4 7 IE The result of the code for the page when the request is issued. If it is Simplified Chinese, it can be displayed correctly, otherwise, the 8th step in Table 5 is required.
Servlet: The source file is a Java file, the format is GB2312, and the source file contains two Chinese characters in "Chinese".
If
Table 7 Change process when compile-charSet = servlet-charset = GB2312
Serial number
Step description
result
1 Write the servlet source file, and save it to GB2312 format D6 D0 CE C4 (D6D0 = CEC4 = text) 2 Compile Java source files into Class files E4 B8 AD E6 96 87 (UTF) 3 with Javac -Encoding GB2312 2 First read the string from the class file, in memory is Unicode Code 4e 2D 65 87 (Unicode) 4 According to servlet-charset = GB2312, convert Unicode to byte stream D6 D0 CE CE C4 (GB2312) 5 The throttle output to IE and set IE encoding properties for servlet-charset = GB2312 D6 D0 CE C4 (GB2312) 6 IE Use "Simplified Chinese" to view the results "Chinese" (correct display) if
Table 8 changes process when compile-charSet = servlet-charset = ISO8859-1
Serial number
Step description
result
1 Write a servlet source file, and save it to GB2312 format D6 D0 CE C4 (D6D0 = CEC4 = text) 2 with javac -Encoding ISO8859-1 compiles the Java source file into C3 96 C3 90 C3 8E C3 84 (UTF) 3 When running, first read the string from the class file, in memory is Unicode Code 00 D6 00 d0 00 CE 00 C4 (都 !!!) 4 According to Servlet-Charset = ISO8859-1 put Unicode Transforming to byte stream D6 D0 CE CE C4 5 outputs byte stream to IE and sets IE encoded attributes for servlet-charset = ISO8859-1 D6 D0 CE CE C4 (GB2312) 6 IE Use "Western Europe Character" to view results garbled ( The same reason is Table 5)
7 Changing the page of IE is encoded as "Simplified Chinese" "Chinese" (correctly)
If the compile-charSet or servlet-charSet is not specified, its default value is ISO8859-1.
When compile-charset = servlet-charset, step 2 and fourth steps can be reversed, "offset", and the display results can be correct. The reader can try to write the situation when compile-charSet <> servlet-charset is not correct.
When the output object is a database
The principle and output to the browser are also the same when outputting to the database. This section is only servlet as an example, and the case of JSP will be derived from the case.
Suppose there is a servlet, which can receive a Chinese character string from the client (IE, Simplified Chinese), then write it into the database of the internal code is ISO8859-1, then remove this string from the database, display it Client.
Table 9 Output Object is a change in the database (1)
Serial number
Step Description Results
1 Enter "Chinese" D6 D0 CE C4 IE 2 IE transfers the string into UTF and feeds the E4 B8 AD E6 96 87 3 Servlet to receive the input stream in the transport stream, read 4e 2d 65 87 with readUTF ( Unicode) servlet 4
The programmer must restore the string according to GB2312 in the servlet.
D6 D0 CE C4
5
Programmer generates a new string based on the database ISO8859-1
00 d6 00 d0 00 CE 00 C4
6 Submit the newly generated string to the JDBC 00 D0 00 CE 00 C4 7 JDBC detected the database is ISO8859-1 00 D6 00 D0 00 CE 00 C4 JDBC 8 JDBC put the received string according to ISO8859-1 Generating DVD D6 D0 CE CE C4 9 JDBC Written byte stream into the database D6 D0 CE CE C4 10 Complete data storage Work D6 D0 CE CE C4 Database The following is the process from the database
11 JDBC Removes byte stream from the database D6 D0 CE CE C4 JDBC 12 JDBC Follow the database of character sets ISO8859-1 to generate strings and submit to servlet 00 D0 00 CE 00 C4 (Unicode) 13 servlet to get string 00 D6 00 d0 00 CE 00 C4 (Unicode) servlet 15
Programming must restore to the original byte stream according to the internal code ISO8859-1 of the database
D6 D0 CE C4
16
Programming must generate new strings based on the client character set GB2312
4e 2D 65 87
(Unicode)
Servlet is ready to output strings to the client
17 Servlet Depending on the
IE
Explanation, step 4 and 15th Step 16 of the table is to use red markers, indicating that the coder is converted. The fourth step in the fourth step is actually a sentence: "New string (Source.getbytes (" GB2312 ")," ISO8859-1 ")". The 15th, 16th step is also a sentence: "New string (Source.getbytes (" ISO8859-1 ")," GB2312 "". Dear readers, do you realize every detail when writing a code like this?
As for the process of the client code and the database internal code for other values, and the output object is the process of the system console, please read the reader you think. Understand the principles of the above process, I believe you can easily write it.
The text has been filed, and it is already a paragraph. The end is back to the starting point, and it is almost impact on the programmer.
Because we have long been doing this.
The following conclusions are given as the end.
1. In the JSP file, you want to specify contentType, where the value of Charset is the same as the character set used by the client browser; for the string of the string, no internal code conversion; for string variables, requirements According to the character set specified in ContentType, it is simply that the "string variable is based on
2, in the servlet, you must set the charset with httpservletResponse.setContentType (), and set to the client's internal code; for the string constant, you need to specify eNCoding at Javac, this encoding must be with the platform of the source file. Like the character set, it is generally GB2312 or GBK; for string variables, like JSP, "is based on