Chinese characters talk in deep [zz]

xiaoxiao2021-03-06  93

I. Topic: About Java's Chinese issues Java's Chinese problem is more prominent, mainly in control panel output, JSP page output, and database access. This article tries to avoid the font

Just talk about the code. Through this article, you can learn about the origin of the Java Chinese problem, the solution to the problem, which gives a number of JDBC access

According to the method of the library.

Second, the problem description: 1) Compile and run in the Chinese W2000 Chinese window, use the international version of JDK, connected to the number of CP936 encoded in Chinese W2000) SQL Server

According to the library

J: / supercise / demo / encode / helloworld> make created by xcompiler. Philosoft All Rights Reserved. Wed May 30 02:54:45 CST 2001

J: / adercise / demo / encode / helloworld> Run Created by xrunner. Philosoft All Rights Reserved. Wed May 30 02:51:33 CST 2001 Chinese [B @ 7bc8b569 [b @ 7b08b569 [b @ 7860b569 Chinese Chinese ????? Chinese Chinese???? ?? ?? ??

2) If you compile under the Chinese W2000's Western window (encoded 437), run with Java, because there is no font, if it is like it

Like the Chinese window of Chinese W2000, the output is:

J: / supercise / demo / encode / helloworld> Run Created by xrunner. Philosoft All Rights Reserved. Wed May 30 02:51:33 CST 2001 ???? [b @ 7b04b66a [b @ 7818b66a ??? ? ???? ???? ???? ???? ???? Chinese Chinese????

3) analysis

1) There is garbled (that is,?). Because only? There is no small box, which means that there is a problem, not a font problem. In the encoding, if it is converted from a character set to another character set, the typical is to convert from GB2312 to ISO8859_1 (ie, ASCII), then

Many Chinese characters (half of Chinese characters) cannot be mapped to the Western character, in which case the system uses these characters? instead. same,

There is also a small character set that cannot go to the big character set. If the reason is not described herein.

2) There is a Chinese environment compilation, the Chinese environment is running when Chinese characters have the correct place, and they are compiled in the Western environment.

Similar situations occur when running in a text environment. This is due to automatic (default) or manual (also new string (bytes [, Encode]) and bytes

Gettes ([Encode]))) The result of the transcodation.

2.1) In Java Source Files -> Javac -> Class -> Java -> Gettes () -> New String () -> Every step is edited

The conversion process of the code, this process always exists, but sometimes use the default parameters. Below we step by step, why appear on

Surface situation.

2.2) Here is the source code:

HelloWorld.java: ------------------------ Public Class HelloWorld {public static void main (string [] argv) {try {system.out.println ("Chinese"); // 1 System.out.println ("Chinese" .getbytes ()); // 2 System.out.println ("Chinese" .GetBytes ("GB2312")); // 3 System. Out.println ("Chinese" .GetBytes ("ISO8859_1"))))); // 4System.Out.println (New String ("Chinese" .getBytes ())); // 5 System.out.Println (New String " "Chinese" .getbytes (), "GB2312")); // 6 System.out.Println ("Chinese" .GetBytes (), "ISO8859_1"); // 7

System.out.println ("Chinese" .GetBytes ("GB2312")))); // 8 System.out.Println (New String ("Chinese" .GetBytes ("GB2312"), "GB2312")) ; // 9 System.out.println (New

String ("Chinese" .GetBytes ("GB2312"), "ISO8859_1"))); // 10

System.out.println (New String ("" ivo8859_1 "))))))); // 11 System.out.println (New

String ("Chinese" .Getbytes ("ISO8859_1"), "GB2312")); // 12 System.out.Println (New

String ("Chinese" .getbytes, "ISO8859_1")); // 13} catch (exception e) {E.PrintStackTrace ();}}}

For convenience, the operation sequence number is added after each conversion, which is 1, 2, ..., 13, respectively.

2.3) It is to be explained that Javac is read from the source file by the system default, and then press Unicode to encode. JA when running in Java

VA is also encoded by Unicode, and the default input and output are the default encoding of the operating system, that is,

New

String (bytes [, Encode]), the system considers that the input is the byte stream encoding Encode, in other words, if you press Encode to translate bytes

In order to get the correct result, this result is finally saved in Java, it still wants to convert from this Encode into unicode, that is, there is Byte

S -> Encode characters -> Unicode characters conversion;

String.getBytes ([Encode]), the system is to be a Unicode character -> Encode character -> bytes conversion.

In this example, except when the English window is encoded, in fact, the default code is GBK (in this case, we temporarily put GBK and G

B2312 is equivalent to it). 2.4) Due to the conversion that is not specified on the above-mentioned code, if an encode is not specified, the system will adopt the default encoding (here GB

K), we believe that the top 5, 6, 7 and 8, 9, 10 is the same, 8 and 9, 11 and 12, so we will only discuss 1, 9, 10 in the discussion.

12, 13. The 2, 3, 4 is only used for testing, not within our discussion.

2.5) Let's follow the translation of the word "in" word in the program, let us say the compilation and running process under the Chinese window, pay attention to below

In the letter subscript, I consciously use some numbers to express the same, difference or related

2.5.1) Let's first take the code 9 in the 13 code segments above:

Steps Content Location Description 01: C1 HelloWorld.java C1 Generally refers to a GBK Character 02: U1 Javac Read U1 General Finger A Unicode Character 03: C1 getBytes () First Step Java First and Operating System Communication 04: B1, B2 Gettes Second steps and return to the byte array 05: C1 new string () First step Java first and operating system exchange 06: u1 new string () second steps and return to the character 07: c1 println (String) can display "中" Word, content and the same

2.5.2) Then use the code segment 10 as an example, we noticed that it is just:

Steps Content Location Description 01: C1 HelloWorld.java C1 Generally refers to a GBK Character 02: U1 Javac Read U1 General Finger A Unicode Character 03: C1 getBytes () First Step Java First and Operating System Communication 04: B1, B2 Gettes The second step is then returned to the byte array 05: C3, C4 new string () first step Java first and operating system communication, then resolution errors 06: U5, U6 new string () second steps and return Character 07: C3 C4 println (String) is divided into two halves due to Chinese characters, and there is no character in ISO8859_1.

It can be mapped, so it is displayed as "??". In the above example, "Chinese" is displayed as "????" 2.5.3) Similar to other situations in full Chinese mode, I don't say much.

2.6) We can then look at why the classes compiled under the Western DOS window have similar situations under the Chinese window, especially why there is active

The Chinese characters can also be displayed correctly in the case.

2.6.1) We are still first as an example of code segment 9:

Steps Content Location Description 01: C1C2 HelloWorld.java C1C2 It is generally referred to a ISO8859_1 character, "in" word is removed 02: U3u4 javac read U1U2 generally refers to a Unicode character 03: C5c6 getBytes () first step Java first and operation System communication, then the analysis error 04: B5B6B7B8 getBytes () second steps Then return bytes 05: C5C6 new string () First step Java first and operating system communication 06: U3u4 new string () second steps and return characters 07: C5C6 Println (String) Although it is two characters, it is not the initial "two ISO8859_1 words"

", But" two BGK characters "," in "display"? ? "And" Chinese "is shown"? ? ? ? "2.6.2) Let's take the code segment 12 as an example, because it can display Chinese characters correctly

Steps Content Location Description

01: C1C2 HelloWorld.java C1C2 It is generally an ISO8859_1 character, "in" word is removed 02: U3U4 JAVAC read U1U2 generally refers to a Unicode character 03: C1C2 getBytes () first step Java first and operating system exchange (note Still correct!) 04: B5B6 getBytes () second steps and then return to the byte array (this is a key step!) 05: C12 new string () First step Java first and operating system exchange (this is more One step, Java has

Knowing B5B6 to resolve into a Chinese character! 06: u7 new string () second step and return characters (really one item two! U7 contains U3u4 information) 07: C12 println (String) This is the original "middle" word, very grievous by Javac Back, but by the programmer, it is right! Of course, "Chinese" can

Correctly displayed!

3) So why sometimes use JDBC's new string (rotordset.getbytes (int) [, eNCode]) RECORDSET.GETSTINGTSTINGTSTINGTSTIST.STBYTES (String.getBytes ([Encode]) and RecordSet.setString (String) When will there be garbled?

In fact, the problem occurs, and the code problem is written. After it reads the data from the database, it may be a self-acting claim to make a starting from GB2312.

(Default code) to Unicode's conversion, my JDBC Driver for this WebLogic for SQL Server is like this, when I read a string

When you read it is not the right Chinese character, hateful is that I can write the Chinese character string directly, this is a bit unacceptable! In other words, we have to transfer when reading or writing, although this transcoding sometimes is not so obvious, this is because we used

The default encoding is transcoded. The operation made by JDBC Driver, we only enter the source code inside, isn't it?

转载请注明原文地址:https://www.9cbs.com/read-106682.html

New Post(0)