"Unlike C / C , the character data in Java is 16-bit non-symbolic data, which represents the Unicode set, not just ASCII set" 1. This is a good practice, which solves more programming problems on WWW, such as the internationalization of low cost (INTERNATIONAL), but uses 16 characters, but it brings waste, after all, the information processed by Java. Most of them are English, and the 7-bit ACSII code is already enough, but Unicode needs double space, so this kind of Java's practice is compromising with storage resources and efficiency. . For China's Java programmers (especially primary), Java adopts Unicode characters, but it brings us that even the nightmares --Web page is not Chinese, but garbled.
First, a common character set
The character set is a collection of mappings between characters between the characters to character. ASCII Character A is the form of performance in the internal code 0x41, so in many programming languages, character variables and integer variables are only different.
ISO8859 series
ISO8859 includes a series of character sets such as ISO8859-1, ISO8859-2, which are all 8-bit character sets, 0 ~ 0x7f remain compatible with the ASCII character set, greater than 0x7f is a variety of Latin characters or European characters. Expand.
2. GB2312 character set
If the characters greater than 0x7f are used to represent Chinese characters like the ISO8859 series, it will represent 128, which is obviously not enough, so there is a character set generated by the GB2312 standard. If the current byte (8 bit) is less than 0x80, then still When it is English characters; if it is greater than or equal to 0x80, it constitutes a Chinese character character, so that the GB2312 character set can contain approximately more than 4,000 common simplified Chinese characters and other Chinese characters ( Such as 1 1). Other similar Chinese character sets also have GBK (GB2312 expansion), GB18030, BIG5 (used, Taiwan), detailed specification, refer to: http://www.unihan.com.cn/cjk/ana17.htm
3. Unicode character set
The Unicode character was originally 16-bit (for the need, later added for use), it and 7 US-ASCII kept compatible, MS's Windows NT / 2000 / XP and Sun's Java use it as the default character set It was originally the factual standard for the US Business Alliance, which follows the International General Character (UCS) set standard: ISO / IEC 10646. The main goal of Unicode is to provide a "general character set", which includes all the language, letters and text in the world, so in the Unicode character set, not only "i" is a letter, "I" is also a letter, written Java You can also "int I am Chinese = 0xFF;". After all, the 16-bit Unicode character set is only 216 = 65536 characters, it is not enough to represent all characters in practical applications, and in the Internet age in English, its use, storage and transmission, extremely wasting space Therefore, the two specification of UTF-8 (UNICODE TRANSFORMATION FORM 8-bit Form) and UTF-16 appears, in UTF-8, it is characterized in US-ASCII, still used One byte is represented, and is compatible with US-ASCII, encoding other characters, uses 1 (greater than 0x7f) to 3 bytes. UTF-8 becomes longature and complexity, the characters of non-ASCII are not friendly, and it has also begun to violate the original intention of Unicode. UTF-16 is a very simple encoding method, which fully follows the Unicode standard, with 16-bit fixed-length space to represent some Unicode character sets. For more specifications for Unicode, visit the Unicode Alliance Site: http: //www.unicode.org ,UTF-8 and UTF-16 are defined in the RFC 2279 and RFC 2781 of IETF, respectively, can pass through http: // www. Ietf.org/rfc2279.txt or http://www.ietf.org/rfc2781.txt access them. In general, the character set name is not sensitive, so GB2312 can also write GB2312 or GB2312.
Second, the embarrassment of garbled
1. Let's take a JSP first
The essence of JSP (Java Server Page) is still a servlet, so use JSP, you can also show some problems in the servlet, in general, JSP code is simpler than servlet code.
Let's use JSP to do an experiment. The following JSP file contains a constant string "I am Chinese", see if it is garbled in the browser output?
<% - discomfiture.jsp -%>
<%
String str = "I am Chinese";
System.out.println (STR);
Out.println (STR);
%>
Open it from the browser, there is no garbled, the display is "I am Chinese" string. Be happy, look at the server's output window, as shown in Figure 2-1, the server monitor window outputs garbled (red underlined).
Figure 2-1 Cottage code output in the server window
Although it is just garbled in the server, the client browser is completely correct, but it is obviously what problems here, otherwise it should be the correct output on both sides.
The server side outputs garbled, indicating that the server Java virtual machine (Java Virtual Machine, JVM) does not "get" the correct string. In order to ensure that JVM can correctly "get" we point out the Chinese constant string, we can use the character's Unicode internal code instead of characters in the string, just like string str = "I am Chinese";
use
String str = "/ u0049 / u0020 / u0061 / u006d / u0020 / u0043 / u0068 / u006e / u0065 / u0073 / u0065";
Instead. Clearly pointing to the JVM to these strings, will it still have garbled? To get a characterful Unicode internal code is a very easy thing, Java and JavaSCRTIPT characters are set with Unicode character. Let's take a look at the Java program that outputs the unicode character:
Public Class getCode
{
Public static void main (string args [])
{
CHAR CHS [] = args [0] .tochararray ();
For (INT I = 0; I System.out.println (CHS [I] "=" (int) CHS [I]); } System.out.println (Args [0]); } } Compile and execute it, the result is as follows: Figure 2-2 JVM output However, JavaScript is used up, how is it more fast than Java, here is also introduced a javascript code: Var str = "I am Chinese"; For (VAR i = 0; i { Document.wirte (Str.Charat (i) "=" str.charcodeat (i) " } Document.write (STR); script> Save as an HTML file, output as shown below: Figure 2-2 JavaScript output in IE6.0 Now replace the Chinese string in discomfiture.jsp: <% - discomfiture1.jsp -%> <% String str = "/ u6211 / u662f / u4e2d / u56fd / u4eba"; System.out.println (STR); Out.println (STR); %> The experiment has been promising, and the server-side output window correctly outputs the string, but the client browser has output garbled, as shown below: Figure 2-3 JSP has garbled in IE6.0 Figure 2-4 The client brushes the output of the server window Because the Unicode code generated directly, it ensures that the value of the string STR must be "I am Chinese" in JSP, and the output of the server window has confirmed this, then That is to say, the value of STR in Servlet Discomfiture $ JSP is not "I am Chinese" because it outputs garbled in the server window, as shown in Figure 2-1, can be found, then 10 characters have been output, that is, the length of the STR is 10, is not 5. But why did you get this character's output in your browser? Briefly understand two concepts: coding and decoding. 2. Coding and decoding Encode and decodes are two opposite actions. The encoding is to convert the characters to a certain mapping standard (character set), and then we call the code encoding when the encoding action is performed. As we have a Unicode string "I am Chinese" According to GB2312 standard code (byte BSG [] = "I am Chinese" .GetBytes ("GB2312");), you can get a byte sequence (Bytes Sequence), expressed with hexadecimal code value: 0xCE0XD20xca0xc70xd60xd00xb90xfa0xc80xcb According to UTF-8 standard coding (byte bsu [] = "I am Chinese" .GetBytes ("UTF-8");), you can get byte sequence: 0xE60X880X910XE60X980XAF0XE40XB80XAD0XE50X9B0XBD0XE40XBA0XBA The decoding is to convert byte sequences in a character standard (decoded, decoding) to string. As we have a sequence of bytes: 0xCE0XD20xca0xc70xd60xd00xb90xfa0xc80xcb Decoded according to GB2312 (New String (BSG, "GB2312"), or byte sequence: 0xE60X880X910XE60X980XAF0XE40XB80XAD0XE50X9B0XBD0XE40XBA0XBA According to UTF-8 decoding (New String (BSU, "UTF-8"), you can get a string "I am Chinese", but if we use UTF-8 decoding with UTF-8 encoded by GB2312, this is chaos Set, the resulting string is obviously the wrong garbled. Let us look at a test. Import java.io.unsupportedEncodingexception; Public Class U2G { Public static void main (string args []) THROWS unsupportedEncodingexception { String str = args [0]; CHAR CHS [] = str.tochararray (); System.out.Println ("Unicode Characters:"); For (int i = 0; i System.out.print (CHS [i] "=" (int) CHS [i] ";"); SYSTEM.OUT.PRINTLN (); String message [] = { "Encodes this String Into a sequence of bytes sale the" "/ NPLATFORM'S DEFAULT Charset.", "Encodes this string into a sequence of bytes using GB2312.", "Encodes this String Into a sequence of bytes useing utf-8."}; String encodings [] = {NULL, "GB2312", "UTF-8"}; BYTE BS [] [] = new byte [3] []; For (int h = 0; h IF (Encodings [H] == NULL) BS [H] = Str.getbytes (); ELSE BS [H] = Str.getbytes (Encodings [h]); For (int L = 0; L IF (l% 4 == 0) system.out.println (); System.out.print ("Byte [" L "] =" Integer.tohexString (BS [H] [L] & 0xFF) ";"); } SYSTEM.OUT.PRINTLN (); } System.out.println ("Decodes The Sequence Of Bytes Using Corresponding Encoding."); For (int i = 0; i IF (Encodings [i] == null) System.out.Println (New String (BS [i])); Else System.Out.println (New String (BS [i], Encodings [i])); } String messages1 [] = { "Decodes the sequence of bytes encoded by GB2312 INTO A STRING / NUSING UTF-8.", "Decodes the sequence of bytes encoded by UTF-8 INTO A STRING / NUSING GB2312."}; For (int h = 0; h <2; h ) { System.out.println (Messages1 [H]); Str = new string (BS [H 1], Encodings [h == 0? 2: 1]); CHS = str.tochararray (); System.out.print ("Unicode Characters:"); For (int i = 0; i { IF (i% 4 == 0) system.out.println (); System.out.print (CHS [i] "=" (int) CHS [i] ";"); } SYSTEM.OUT.PRINTLN (); } System.out.println ("The Default Encoding of System IS" System.getProperty ("File.Encoding")); } } JVM Output is shown 2-5 As shown, it is obvious, UTF-8 Encoded byte stream, used GB2312 The code is completely failed, and we have not yet available. The system I use is MS Windows 2000 Server The default character set is GBK This experiment can also be seen GBK compatible GB2312 .
");