Chinese and related issues in J2EE Web components (1)

zhaozj2021-02-16  57

"Unlike C / C , the character data in Java is 16-bit non-symbolic data, which represents the Unicode set, not just ASCII set" 1. This is a good practice, which solves more programming problems on WWW, such as the internationalization of low cost (INTERNATIONAL), but uses 16 characters, but it brings waste, after all, the information processed by Java. Most of them are English, and the 7-bit ACSII code is already enough, but Unicode needs double space, so this kind of Java's practice is compromising with storage resources and efficiency. . For China's Java programmers (especially primary), Java adopts Unicode characters, but it brings us that even the nightmares --Web page is not Chinese, but garbled.

First, a common character set

The character set is a collection of mappings between characters between the characters to character. ASCII Character A is the form of performance in the internal code 0x41, so in many programming languages, character variables and integer variables are only different.

ISO8859 series

ISO8859 includes a series of character sets such as ISO8859-1, ISO8859-2, which are all 8-bit character sets, 0 ~ 0x7f remain compatible with the ASCII character set, greater than 0x7f is a variety of Latin characters or European characters. Expand.

2. GB2312 character set

If the characters greater than 0x7f are used to represent Chinese characters like the ISO8859 series, it will represent 128, which is obviously not enough, so there is a character set generated by the GB2312 standard. If the current byte (8 bit) is less than 0x80, then still When it is English characters; if it is greater than or equal to 0x80, it constitutes a Chinese character character, so that the GB2312 character set can contain approximately more than 4,000 common simplified Chinese characters and other Chinese characters ( Such as 1 1). Other similar Chinese character sets also have GBK (GB2312 expansion), GB18030, BIG5 (used, Taiwan), detailed specification, refer to: http://www.unihan.com.cn/cjk/ana17.htm

3. Unicode character set

The Unicode character was originally 16-bit (for the need, later added for use), it and 7 US-ASCII kept compatible, MS's Windows NT / 2000 / XP and Sun's Java use it as the default character set It was originally the factual standard for the US Business Alliance, which follows the International General Character (UCS) set standard: ISO / IEC 10646. The main goal of Unicode is to provide a "general character set", which includes all the language, letters and text in the world, so in the Unicode character set, not only "i" is a letter, "I" is also a letter, written Java You can also "int I am Chinese = 0xFF;". After all, the 16-bit Unicode character set is only 216 = 65536 characters, it is not enough to represent all characters in practical applications, and in the Internet age in English, its use, storage and transmission, extremely wasting space Therefore, the two specification of UTF-8 (UNICODE TRANSFORMATION FORM 8-bit Form) and UTF-16 appears, in UTF-8, it is characterized in US-ASCII, still used One byte is represented, and is compatible with US-ASCII, encoding other characters, uses 1 (greater than 0x7f) to 3 bytes. UTF-8 becomes longature and complexity, the characters of non-ASCII are not friendly, and it has also begun to violate the original intention of Unicode. UTF-16 is a very simple encoding method, which fully follows the Unicode standard, with 16-bit fixed-length space to represent some Unicode character sets. For more specifications for Unicode, visit the Unicode Alliance Site: http: //www.unicode.org ,UTF-8 and UTF-16 are defined in the RFC 2279 and RFC 2781 of IETF, respectively, can pass through http: // www. Ietf.org/rfc2279.txt or http://www.ietf.org/rfc2781.txt access them. In general, the character set name is not sensitive, so GB2312 can also write GB2312 or GB2312.

Second, the embarrassment of garbled

1. Let's take a JSP first

The essence of JSP (Java Server Page) is still a servlet, so use JSP, you can also show some problems in the servlet, in general, JSP code is simpler than servlet code.

Let's use JSP to do an experiment. The following JSP file contains a constant string "I am Chinese", see if it is garbled in the browser output?

<% - discomfiture.jsp -%>

<%

String str = "I am Chinese";

System.out.println (STR);

Out.println (STR);

%>

Open it from the browser, there is no garbled, the display is "I am Chinese" string. Be happy, look at the server's output window, as shown in Figure 2-1, the server monitor window outputs garbled (red underlined).

Figure 2-1 Cottage code output in the server window

Although it is just garbled in the server, the client browser is completely correct, but it is obviously what problems here, otherwise it should be the correct output on both sides.

The server side outputs garbled, indicating that the server Java virtual machine (Java Virtual Machine, JVM) does not "get" the correct string. In order to ensure that JVM can correctly "get" we point out the Chinese constant string, we can use the character's Unicode internal code instead of characters in the string, just like string str = "I am Chinese";

use

String str = "/ u0049 / u0020 / u0061 / u006d / u0020 / u0043 / u0068 / u006e / u0065 / u0073 / u0065";

Instead. Clearly pointing to the JVM to these strings, will it still have garbled? To get a characterful Unicode internal code is a very easy thing, Java and JavaSCRTIPT characters are set with Unicode character. Let's take a look at the Java program that outputs the unicode character:

Public Class getCode

{

Public static void main (string args [])

{

CHAR CHS [] = args [0] .tochararray ();

For (INT I = 0; I

System.out.println (CHS [I] "=" (int) CHS [I]);

}

System.out.println (Args [0]);

}

}

Compile and execute it, the result is as follows:

Figure 2-2 JVM output

However, JavaScript is used up, how is it more fast than Java, here is also introduced a javascript code: