Java-java Chinese problem detailed, underlying coding anatomy

xiaoxiao2021-03-06  44

1. Bytes and Unicode Java kernels are Unicode, even Class files are also, but many media, including file / streams save ways to use word current. Therefore, Java is to transform these byte streams. Char is Unicode, and Byte is byte. The function of BYTE / CHAR in Java is in the middle of Sun.IO. The BytetocharConverter class is scheduled, which can be used to tell you, you use the Convertor. Two of these very commonly used static functions are.

Public static bytetocharconverter getDefault (); public static bytetocharconverter getConverter (String eNCoding)

If you don't specify Converter, the system will automatically use the current Encoding, with GB platform with GBK, EN platform 8859_1

Let's come to a simple example:

"You" GB code is: 0xc4e3, Unicode is 0x4f60

You use:

Encoding = "GB2312"; BYTE B [] = {(byte) '/ u00c4', (byte) '/ u00E3'}; convertor = bytetocharconverter.getConverter (Encoding); char [] c = converter.convertall (b); For (int i = 0; i

Print is 0x4f60

But if you use 8859_1 encoding, print it out is

0x00c4, 0x00e3

Example 1

in turn:

Encoding = "GB2312"; char C [] = {'/ u4f60'}; convertor = bytetocharconverter.getConverter (Encoding); Byte [] b = convertER.convertall (c); for (int i = 0; i

Print comes out is: 0xC4, 0XE3

Example 2

If you use 8859_1 0x3f ,?, indicating that you can't turn

Many Chinese issues are derived from these two simplest classes. However, many classes don't directly support Encoding entries, which brings us more inconvenience. Many procedures are rare to use Encoding, with DEFAULT's Encoding, which brings a lot of difficulties to our transplantation.

2.UTF-8

UTF-8 is corresponding to Unicode, which is very simple

7-bit unicode: 0 _ _ _ _ _ _ _

11 unicode: 1 1 0 _ _ _ _ _ 1 0 _ _ _ _ _ _ _

16-bit unicode: 1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _ _ 1 0 _ _ _ _ _ _ _

21 unicode: 1 1 1 1 0 _ _ _ 1 0 _ _ _ _ _ _ _ 1 0 _ _ _ _ _ _ _ 1 0 _ _ _ _ _ _

Most of the cases are only available to Unicode below:

You "GB code is: 0xc4e3, Unicode is 0x4f60

We still use the above example

Example 1: 0xC4E3 binary: 1 1 0 0 0 1 0 0 1 1 0 0 0 1 1

Since there are only two weeks in the two codes, we found this line,

Because the seventh bit is not 0, it returns "?"

Example 2: Binary of 0x4F60:

0 1 0 0 1 1 1 1 0 1 0 0 0 0 0 0

We make up with UTF-8 to become:

11100100 10111101 10100000

E4 - BD - A0

So return 0xE4, 0XBD, 0xA0

3.String and Byte [] String actually core is char [], however, to convert Byte into string, must be encoded. String.length () is actually the length of the char array, and if you use different codes, it is likely to be scattered, resulting in scattering and garbled. example:

Byte [] b = {(byte) '/ u00c4', (byte) '/ u00E3'}; string str = new string (b, eNCoding);

If encoding = 8859_1, there will be two words, but Encoding = GB2312 has only one word.

This problem is often happening in processing paging

4.Reader, Writer / InputStream, OutputStream

Reader and Writer cores are CHAR, INPUTSTREAM, and OUTPUTSTREAM cores are BYTE.

But Reader and Writer's main purpose is to read Char read / write InputStream / OutputStream

An example of a reader:

Document Test.txt has only one "you" word, 0xc4, 0xe3

String Encoding =; InputStreamReader Reader = New FileInputStream ("text.txt"), encoding; char [] c = new char [10]; int layth = reader.read (c); for (INT i = 0 i

If encoding is GB2312, there is only one character if encoding = 8859_1, there are two characters

2. We have to know about Java's compiler: Javac -Encoding We often do not use Encoding parameters. In fact, ENCODING is important for cross-platform operations. If you do not specify eNCoding, follow the system's default eNCoding, the GB platform is GB2312, and the English platform is ISO8859_1. Java's compiler actually calls Sun.Tools.javac.main class, compiles files, this class has an encoding variable in the middle of the Compile function, and -Encoding parameters are actually transmitted to the Encoding variable. The compiler is based on this variable, and then compiles the UTF-8 form into a Class file. one example:

Public void test () {string str = "you"; FileWriter Write = New FileWriter ("Test.txt"); Write.Write (STR); Write.Close ();} Example 3

If you compile with GB2312, you will find the field of E4 BD A0

If you compile with 8859_1, binary of 00c4 00E3: 00000000 11000100 00000000 11100011 - Because each character is greater than 7 digits, there is a 11-bit code: 11000001 10000100 11000011 10100011 C1-- 84 - C3-- A3 You will find C1 84 C3 A3 - But we tend to ignore this parameter, so this often has a cross-platform problem: Example 3 compiled on the Chinese platform, generated ENCLASS Example 3 compiled in English platform, output Enclass 1.zhclass on Chinese platform OK, but in the English platform, it is not possible to execute OK on the English platform, but it is not on the Chinese platform.

Reason: 1. After compiling on the Chinese platform, the STR runs on the running state, running on the Chinese platform, FileWriter's default encoding is GB2312, so ChartobyTeconverter automatically uses Converter to call GB2312 to transform STR Enter the fileOutputStream in the fileOutputStream, and 0xC4, 0XE3 is put into the file. But if it is in the English platform, the default value of Chartobyteconvert is 8859_1. FileWriter automatically calls 8859_1 to transform Str, but he can't explain, so he will output "?".

After compiling on the English platform, the STR is running the CHAR [] is 0x00c4 0x00E3, and the Chinese cannot identify on the Chinese platform, so it will appear on the English platform, 0x00c4 -> 0xc4,0x00e3 > 0xE3, therefore 0xC4, 0XE3 is put into the file

1. For JSP text explanation: Tomcat first looks at the "<% @ page include symbol in your page. Yes, set response.setContentType (..) in the same place; follow encoding, no He reads the file according to the 8859_1, then writes the .java file with UTF-8, then read this file with Sun.Tools.main, (of course it uses UTF-8 to read), then compile it into a Class file setContentType changed OUT properties, OUT variables The default encoding is 8859_1

2. Unfortunately the parameter is only an interpretation of ISO8859_1, this material can be found in the implementation code of the servlet.

3. The explanation of the include, but very unfortunately, because the person written "org.apache.jasper.compiler.parser" is in array jsputil.validattribute [] forgot to add a parameter: eNCoding, thus causing not supported the way. You can compile the source code, plus support for Encoding

to sum up:

If you are under NT, the easiest way is to deceive Java, do not add any Encoding variables:

Hello <% = Request.getParameter ("Value")%> http://localhost/test/test.jsp? value = you

Result: Hello you

However, this method is limited, such as segmentation of the uploaded article, such a practice is that the best solution is to use this solution:

<% @ Page ContentType = "Text / HTML; Charset = GB2312"%> Hello <% = new string (Request.GetParameter ("Value"). Gettes ("8859_1"), "GB2312")%>

转载请注明原文地址:https://www.9cbs.com/read-91054.html

New Post(0)