Chinese and related issues in J2EE Web Components
?
Author: whodsow (original)
"Unlike C / C , the character data in Java is 16-bit non-symbolic data, which represents the Unicode set, not just ASCII set" 1. This is a good practice, which solves more programming problems on WWW, such as the internationalization of low cost (INTERNATIONAL), but uses 16 characters, but it brings waste, after all, the information processed by Java. Most of them are English, and the 7-bit ACSII code is already enough, but Unicode needs double space, so this kind of Java's practice is compromising with storage resources and efficiency. . For China's Java programmers (especially primary), Java adopts Unicode characters, but it brings us that even the nightmares --Web page is not Chinese, but garbled. First, the Common Character Set Introduction The character set is a collection of mappings between the characters to the character of the character. ASCII Character A is the form of performance in the internal code 0x41, so in many programming languages, character variables and integer variables are only different. 1. ISO8859 Series ISO8859 includes a series of characters such as ISO8859-1, ISO8859-2, all of which are 8-bit character set, 0 ~ 0x
7f
Still compatible with the ASCII character set, greater than 0x
7f
It is the extension of various Latin characters or European characters. 2. GB2312 character set If it is like ISO8859 series, it is greater than 0x
7f
The characters are used to represent Chinese characters, then up to 128, which is obviously insufficient, so there is a character set generated by the GB2312 standard. If the current byte (8 bit) is less than 0x80, it is still as it is English characters; if it It is greater than or equal to 0x80, which makes a Chinese character character, so that the GB2312 character set can contain approximately more than 4,000 common symbols in other Chinese characters (such as 12). Other similar Chinese character sets also have GBK (GB2312 expansion), GB18030, BIG5 (used, Taiwan), detailed specification, refer to: http://www.unihan.com.cn/cjk/ana17.htm 3 The Unicode character set Unicode characters were originally 16-bit (forced to add substitute), and it was compatible with 7 US-ASCII, and the MS's Windows NT / 2000 / XP and Sun's Java used it as The default character set, which was originally the factual standard for the US Business Alliance, which follows the International General Character (UCS) set standard: ISO / IEC 10646. The main goal of Unicode is to provide a "general character set", which includes all the language, letters and text in the world, so in the Unicode character set, not only "i" is a letter, "I" is also a letter, written Java You can also "int I am Chinese = 0xFF;". After all, the 16-bit Unicode character set is only 216 = 65536 characters, it is not enough to represent all characters in practical applications, and in the Internet age in English, its use, storage and transmission, extremely wasting space Therefore, the two specification of UTF-8 (UNICODE TRANSFORMATION FORM 8-bit Form) and UTF-16 appears, in UTF-8, it is characterized in US-ASCII, still used One byte is represented, and is compatible with US-ASCII, encoding other characters, use 1 (greater than 0x7f
Part) to 3 bytes. UTF-8 becomes longature and complexity, the characters of non-ASCII are not friendly, and it has also begun to violate the original intention of Unicode. UTF-16 is a very simple encoding method, which fully follows the Unicode standard, with 16-bit fixed-length space to represent some Unicode character sets. For more specifications for Unicode, visit the Unicode Alliance Site: http: //www.unicode.org ,UTF-8 and UTF-16 are defined in the RFC 2279 and RFC 2781 of IETF, respectively, can pass through http: // www. Ietf.org/rfc2279.txt or http://www.ietf.org/rfc2781.txt access them. In general, the character set name is not sensitive, so GB2312 can also write GB2312 or GB2312. Second, the embarrassment of garbled 1. First look at a JSP JSP (Java Server Page) or a servlet, so use JSP, you can also explain some of the Questions in the servlet, in general, JSP code is more than servlet code. simple. Let's use JSP to do an experiment. The following JSP file contains a constant string "I am Chinese", see if it is garbled in the browser output? <% - discomfiture.jsp -%> <% string str = "I am Chinese"; system.out.println (STR); out.println (str);%> Open it from the browser, did not garbled The display is "I am Chinese" string. Be happy, look at the server's output window, as shown in Figure 2-1, the server monitor window outputs garbled (red underlined). Figure 2-1 The garbled code output in the server window is just garbled in the server side, and the client browser is completely correct, but it is obviously what problems here, otherwise it should be the correct output. The server side outputs garbled, indicating that the server Java virtual machine (Java Virtual Machine, JVM) does not "get" the correct string. In order to ensure that the JVM can "get", we pointed to the Chinese constant string, we can directly use the character's Unicode internal code instead of the characters in the string, just like string str = "I am Chinese"; use string str = " / u0049 / u0020 / u0061 / u006d / u0020 / u0043 / u0068 / u006e / u0065 / u0073 / u
0065
"
Instead of it. Clearly pointing to the JVM to these strings, will it still have garbled? To get a characterful Unicode internal code is a very easy thing, Java and JavaSCRTIPT characters are set with Unicode character. First look at the Java program that outputs the unicode character: public class getcode {public static void main (string args []) {char cha [] = args [0] .tochararray (); for (int i = 0; i < Chs.length; i ) {system.out.println (CHS [i] "=" (int) CHS [I]);} system.out.println (args [0]);}} is compiled and executed The result is that JavaScript is used up, how can it be fast than Java? Here is a javascript code: