There are local languages in all regions in the world. Regional differences directly lead to the difference in language environment. During the development of an internationalization process, it is important to handle language issues.
This is a problem that there is a world, so Java provides a worldwide solution. The methods described herein are used to handle Chinese, but, it is widely applicable to dealing with language in other countries and regions in the world.
Chinese characters are double bytes. The so-called double-byte means a double word to occupy the position (i.e., 16 bits), respectively, called high and low. Chinese prescribed Chinese characters encoded as GB2312, which is mandatory, and currently all of all applications that can handle Chinese support GB2312. GB2312 includes a second-class Chinese character and 9 zone symbols, high from 0xA1 to 0xFe, low from 0xA1 to 0xFe, where Chinese code ranges from 0xB0A1 to 0xF7FE.
In addition, there is a code called GBK, but this is a specification, not forced. GBK provides 20,902 Chinese characters, which are compatible with GB2312, with a range of 0x8140 to 0xFefe. All characters in the GBK can be mapped one by one to Unicode 2.0.
In the near future, the Chinese will promulgate another standard: GB18030-2000 (GBK2K). It contains the font of minority nationalities, and fundamentally solves the problem of insufficient word. Note: It is no longer fixed. Its binary part is compatible with GBK, the four-byte part is the expanded character, a glyph. Its first byte and third bytes range from 0x81 to 0xFe, Biytes and 4th bytes from 0x30 to 0x39.
This article does not intend to introduce Unicode, interested, you can browse "http://www.unicode.org/" View more information. Unicode has a feature: it includes all the characters in the world. Therefore, both languages in various regions can establish mapping relationship with Unicode, while Java is using this to achieve conversion between heterogeneous languages.
In JDK, the encoding related to Chinese is:
Table 1 List of codes related to Chinese in JDK
Code Name Description ASCII7 bit, with ASCII7 identical ISO8859-18-bit, with 8859_1, ISO-8859-1, ISO_8859-1, Latin1 ..., etc. GB2312-8016, with GB2312, GB2312-1980, EUC_CN, EUCCN, 1381, CP1381, 1383, CP1383, ISO2022CN, ISO2022CN_GB ..., etc., the same GBK is the same, Note: Case case case UTF8 is the same GB18030 as the CP1392, 1392, very few JDK
In actual programming, the contact is more contacts, GB2312 (GBK) and ISO8859-1.
Why is there "?"
The above says that the conversion between the xenograft is done through Unicode. Suppose there are two different languages A and B, the steps of conversion are: first convert A into Unicode, and then convert Unicode into B.
for example. There is a Chinese character "Li" in GB2312, which is encoded as "c0ee", wants to translate into ISO8859-1 encoding. The steps are: first transform "Li" word into Unicode, get "674E", and turn "674E" into ISO8859-1 characters. Of course, this mapping will not succeed because there is no character corresponding to "674E" in ISO8859-1.
When the mapping is unsuccessful, the problem has happened! When converted from a language to Unicode, if there is no character in a language, the code "/ ufffd" will be Unicode "/ ufffd" ("/ u" representation is Unicode encoding,). When the Unicode is transformed into a language, if a language has no corresponding characters, it is "0x3f" ("?"). This is the origin of "?". For example: performing the character stream BUF = "0x80 0x40 0xB0 0xA1" to do new string (BUF, "GB2312"), the result is "/ uffd / u554a", then println, the result will be "? Ah", Because "0x80 0x40" is the character in GBK, there is no in GB2312.
For another example, the string string = "/ u00d6 / u00EC / u00E9 / u0046 / u00bb / u00f9" is performed, and the result is "3FA8ACA8A6463FA8B4", "/ U00D6 "There is no corresponding character in" GBK ", resulting in" 3f "," / u00 EC "corresponding to" A8AC "," / U00E9 "corresponds to" A8A6 "," 0046 "corresponding to" 46 "(because this is ASCII character" ), "/ U00bb" did not find, get "3F", finally, "/ u00f9" corresponds to "A8B4". Take this string println, the result is "? Ìéf?". did you see? It is not all question mark, because the contents of GBK and Unicode maps include characters outside Chinese characters, this example is the best way to express.
So, when the Chinese character is transcoding, if there is an arrest, it is not necessarily a question mark! However, the wrong thing is wrong, 50 steps and 100 steps are not qualified.
Or will ask: If there is a set of source characters, there is no in Unicode, how will the result? I don't know. Because I have no source character set for this test. But one is sure, that is, the source character set is not standardized. In Java, if this happens, it will throw an exception. What is UTF
UTF, is an abbreviation of Unicode text format, meaning unicode text format. For UTF, it is defined:
(1) If the head 9 bit of Unicode's 16-bit character is 0, use one byte, the first bit of this byte is "0", the remaining seven bits are the same in the original characters, such as " / U0034 "(0000 0000 0011 0100), represented by" 34 "(0011 0100); (same as the source Unicode character);
(2) If the first 5 digits of Unicode's 16-bit characters are 0, use 2 bytes, the first one is "110" beginning, the 5 bits of the back and the highest 5 zero in the source characters 5 The bit is the same; the second byte begins with "10", the 6 bits of the rear are the same as the low 6 bits in the source character. Such as "/ u025d" (0000 0010 0101 1101), the transformation is "C99D" (1100 1001 1001 1101);
(3) If the above two rules are not met, use three bytes. The first byte begins with "1110", the last four bits are the high four digits of the source characters; the second byte begins with "10", the next six is six in the middle of the source characters; the third byte "10" starts, the latter six bits are low six bits of source characters; such as "/ u9da7" (1001 1101 1010 0111), convert to "E9B6A7" (1110 1001 1011 0110 1010 0111); can be described in this Java program with Unicode The relationship of UTF, although not absolute: the string runs in memory, the UNFE code is manifested, and the UTF is used when saved to the file or other media. This conversion process is done by WriteUTF and Readutf.
Ok, the basic discussion is almost, and the topic is entered below.
First think this problem is a black box. First, look at the level of black boxes:
Input (Charseta) -> Process (Unicode) -> Output (charsetb)
Simple, this is an IPO model, namely input, processing, and output. The same content should be transformed from "from Charseta to Unicode to Charsetb".
Look at the second level again:
Sourcefile (JSP, Java) -> Class-> Output
In this figure, it can be seen that the JSP and Java source files are entered. During the process, the Class file is a carrier, and then output. Refined to three levels:
JSP-> Temp file-> Class-> Browser, OS Console, DB
App, servlet-> Class-> Browser, OS Console, DB
This picture is more understanding. Mr. JSP files become the Java file in the middle, regenerate into Class. The servlet and the ordinary app are directly compiled directly to generate CLASS. Then, output from the Class to the browser, console, or database.
JSP: Process from source files to Class
JSP's source file is a text file ending with ".jsp". In this section, the interpretation and compilation process of the JSP file will be explained, and the Chinese changes are tracked.
1. JSP Conversion Tools (JSPC) provided by the JSP / Servlet engine (JSPC) search in the JSP files in <% @ page contentType = "text / html; charset =
2, JSPC uses the "javac -encoding 3, the engine uses the "Javac -Encoding Unicode" command to compile the Java file into a class file; first look at the conversion of Chinese characters in these processes. Have the following source code: <% @ Page ContentType = "Text / HTML; Charset = GB2312"%>
This code is written on UltraEdit for Windows. After saving, the 16-way code of the "Chinese" word is "D6 D0 CE C4" (GB2312 encoded). After investigation, the Unicode encoded in "Chinese" is "/ U4E2D / U6587", which is "E4 B8 AD E6 96 87". Open the Java file generated by the JSP file generated, found that the two words of "Chinese" are indeed replaced by "E4 B8 AD E6 96 87", and then view the class file generated by the Java file, and find the results and The same is exactly the Java file.
Look at the situation of Charset specified in JSP as ISO-8859-1.
<% @ Page ContentType = "Text / HTML; Charset = ISO-8859-1"%>
<% string a = "Chinese"; out.println (a);%> body> < / html>Similarly, the file is written in UltraEdit, "Chinese" two words are also stored as GB2312 encoded "D6 D0 CE C4". First simulate the process of generated Java files and Class files: JSPC uses ISO-8859-1 to explain "Chinese" and map it to Unicode. Since ISO-8859-1 is 8-bit, and is a latin system, its mapping rules are "00" before each byte, so the Mapped Unicode encoding should be "/ U00D6 / U00D0 / U00 CE / U00C4" After the transformation into UTF, it should be "C3 96 C3 90 C3 8E C3 84". Ok, open the file, "Chinese" is expressed as "C3 96 C3 90 C3 8E C3 84".
If
By so far, the mapping process of Chinese characters in the transition process of the JSP file to the Class file has been explained. One sentence: from "JSPCharset to Unicode to UTF". The following table summarizes this process:
Table 2 "Chinese" Transformation Process from JSP to CLASS
JSP-CharsetJSP files in the java files in the Class file GB2312D6 D0 CE C4 (GB2312) from / U4E2D / U6587 (UNICODE) to E4 B8 AD E6 96 87 (UTF) E4 B8 AD E6 96 87 (UTF) ISO-8859-1D6 D0 CE C4 (GB2312) from / U00D6 / U00D0 / U00CE / U00C4 (Unicode) to C3 96 C3 90 C3 8E C3 84 (UTF) C3 96 C3 90 C3 8E C3 84 (UTF) None (default = file.encoding) ISO-8859-1 discusses Servlet from the Java file to the Class file to the transformation process of the servlet from the Java file to the Class file, and then explain how to output from the Class file to the client. This arrangement is because JSP and servlet are the same in the output.
Servlet: Process from source files to Class
The servlet source file is a text file ending with ".java". This section will discuss the compilation process of the servlet and track the Chinese changes.
Compile the servlet source file with "Javac". Javac can bring "-Encoding
The source file is compiled, and all characters and ASCII characters are interpreted with
In the servlet, there is a place where the output stream is set. Typically, the setContentType method of HTTPSERVLETRESPETRESE is usually called before the output results, and the same effect as "JSP-Charset" is used in JSP, called
Note that three variables mentioned in the text:
Look at the example:
Import javax.servlet. *;
Import javax.servlet.http. *;
class testServlet extends HttpServlet {public void doGet (HttpServletRequest req, HttpServletResponse resp) throws ServletException, java.io.IOException {resp.setContentType ( "text / html; charset = GB2312"); java.io.PrintWriter out = resp.getWriter ( ); Out.println (""); out.println ("# 中文 #"); Out.println (" HTML>");}}
This file is also written in UltraEdit for Windows, where "Chinese" is saved as "D6 D0 CE C4" (GB2312 encoded).
Start compilation. The table below is
Compile-charsetServlet Source file Class file Dual IsoDe code GB2312D6 D0 CE C4 (GB2312) E4 B8 AD E6 96 87 (UTF) / U4E2D / U6587 (in Unicode = "Chinese") ISO-8859-1D6 D0 CE C4 (GB2312) C3 96 C3 90 C3 8E C3 84 (UTF) / U00D6 / U00D0 / U00CE / U00C4 (with one 00 in front of D6 D0 CE C4) No (default) D6 D0 CE C4 (GB2312) with ISO-8859 -1 with ISO-8859-1
The compilation process of the ordinary Java program is exactly the same as the servlet.
Is the Chinese representation in the Class file? OK, let's take a look at how Class is output in Chinese?
Class: Output string
The above says that the string is behaving in memory as a Unicode encoding. As for this Unicode encoding, it depends on which character set is mapped from it, that is to say to see its ancestors. This is better than when the baggage is checked, the appearance is a paper box, which is installed in the people who want to see the mail.
Take a look at the example above, if you give a string Unicode encoding "00d6 00d0 00CE 00C4", if you do not convert it, use the Unicode code table to compare it, it is four characters (and special characters); if you use it to "ISO8859 -1 "Map, then" 00 "directly removes" 00 "," D6 D0 CE C4 ", which is four characters in the ASCII code table; if it is made as GB2312, the result is very It may be a lot of chaos because there may be no (or possibly) characters and other characters such as 00D6 in GB2312 (if the corresponding cannot be, 0x3f, that is, the question mark, if the group, etc., because 00d6 and other characters Before, it is estimated that some special symbols, the real Chinese character starts from 4E00 in Unicode).
Everyone saw that the same Unicode character can be interpreted as a different look. Of course, this is one of our expectations. In the above example, "D6 D0 CE C4" should be what we want. When "D6 D0 CE C4" is output to IE, you can see clear "Chinese" with "Simplified Chinese". Two words. (Of course, if you must use "Western European characters" to see, then there is no way, you will not have anything to have anything?) Why? Because "00D6 00D0 00CE 00C4" was originally transformed from ISO8859-1.
The following conclusions are given: Before the Class output string, the Unicode's string will regenerate the word stream according to a certain internal code, and then input byte stream, which is equivalent to "string.getbytes (??? )"operating. ??? represents a character set. If it is a servlet, this internal code is the internal code specified in the httpservletresponse.setContentType () method, that is, the
Serial No. Step Description Result 1 Write a JSP source file, and store the GB2312 format D6 D0 CE C4 (D6D0 = CEC4 = text) 2JSPC transforms the JSP source file to the temporary java file, and map the string to Unicode, in accordance with ISO8859-1 And written in the Java file in the UTF format C3 96 C3 90 C3 8E C3 843 compile the temporary java file into a Class file C3 96 C3 90 C3 8E C3 844 running, first from the class file to read the string with the ReadUTF, in memory The Unicode Code 00 D6 00 D0 00 CE 00 C4 (都 !!!) 5 According to JSP-charset = ISO8859-1 Transform Unicode into byte stream D6 D0 CE CE C46 outputs byte stream to IE, and Set IE ISO8859-1 (Author Press: This information is hidden in http header) D6 D0 CE C47IE uses "Western European characters" to view the results garbled, actually four ASCII characters, but due to greater than 128, displayed strange models 8 Change the page of IE to "Simplified Chinese" "Chinese" (correct display) strange! Why is it possible to set
Sequence Number Step Description Result 1 Write a servlet source file, and save it as a GB2312 format D6 D0 CE CE C4 (D6D0 = CEC4 = text) 2 Compile the Java source file into a Class file E4 B8 AD E6 96 87 (UTF) with Javac -Encoding GB2312 3 When running, first from the class file to read the string in the Class file, in memory is Unicode Code 4e 2D 65 87 (Unicode) 4 converts Unicode to byte flow D6 D0 CE C4 according to servlet-charset = GB2312 (GB2312 5 outputs byte stream to IE and sets IE's encoded properties for servlet-charSet = GB2312D6 D0 CE CE C4 (GB2312) 6IE with "Simplified Chinese" View Results "Chinese" if
Serial No. Step Description Result Domain 1 Enter "Chinese" D6 D0 CE C4IE2IE in IE to transfers the string into UTF, and send the E4 B8 AD E6 96 873SERVLET to the transport stream to receive the input stream, read 4E 2D 65 87 with ReaduTF ( UNICODE) Servlet4 Programmer must put string according to GB2312 to generate new string 00 d0 00 CE 00 C46 to generate new string 00 d0 00 CE 00 C46 in the database, ISO8859-1 Submit to the JDBC00 D6 00 D0 00 CE 00 C47JDBC detected the database inner code is ISO8859-100 D6 00 D0 00 CE 00 C4JDBC8JDBC put the received string according to ISO8859-1 Menior D6 D0 CE C49JDBC to write byte flow In the database D6 D0 CE C410 Complete Data Storage Work D6 D0 CE C4 Database The following is the process of removing number from the database
11JDBC Removes byte stream from the database D6 D0 CE C4JDBC12JDBC Follow the database of character set ISO8859-1 to generate strings, and submit to servlet00 D6 00 D0 00 CE 00 C4 (Unicode) 13Servlet get string 00 d6 00 d0 00 CE 00 C4 (Unicode) Servlet14 programming must restore to the original byte stream based on the database to the original byte stream D6 D0 CE CE C4 15 Programming must generate new strings according to the client character set GB2312 to generate new strings 4e 2D 65 87 (Unicode)
Servlet is ready to output strings to the client