Java solutions and experiences on Chinese garbled problems (1)

xiaoxiao2021-03-06 51

First, bytes and Unicode

The Java kernel is Unicode, even Class files are also, but many media, including file / streams, are used by word. Therefore, Java is to transform these byte streams. Char is Unicode, and Byte is byte. The functions of Byte / Char in Java are in the middle of Sun.IO. The BytetocharConverter class is scheduled, which can be used to tell you, you use the Convertor. Two of these very commonly used static functions are:

Public static bytetocharconvert ();

Public static bytetocharconverter getConvert (String Encoding);

If you don't specify Converter, the system automatically uses the current Encoding, with GB platform with GBK, EN platform 8859_1.

Byte -> char:

"You" GB code is: 0xc4e3, Unicode is 0x4f60

String encoding = "GB2312";

BYTE B [] = {(byte) '/ u00c4', (byte) '/ u00E3'};

ByTocharconverter CONVERTER = bytetocharconverter.getConverter (Encoding);

Char C [] = converter.convertall (b);

For (int i = 0; i

System.out.println (Integer.tohexString (C [i]));

}

what's the result? 0x4f60

If encoding = "8859_1", what is the result? 0x00c4, 0x00e3

If the code is changed to:

BYTE B [] = {(byte) '/ u00c4', (byte) '/ u00E3'};

Bytetocharconvert (); getDefault ();

Char C [] = converter.convertall (b);

For (int i = 0; i

System.out.println (Integer.tohexString (C [i]));

}

What will the results will it be?

This is to be determined according to the encoding of the platform.

Char -> Byte:

String encoding = "GB2312";

Char C [] = {'/ u4f60'};

Chartobyteconverter CONVERTER = Chartobyteconverter.getConverter (Encoding);

Byte b [] = converter.convertall (c);

For (int i = 0; i

System.out.println (Integer.tohexString (B [i]));

}

what's the result? 0x00c4, 0x00e3

If encoding = "8859_1", what is the result? 0x3f

If the code is changed to

String encoding = "GB2312";

Char C [] = {'/ u4f60'}; chartobyteconverter converter = chartobyteconverter.getDefault ();

Byte b [] = converter.convertall (c);

For (int i = 0; i

System.out.println (Integer.tohexString (B [i]));

}

What will the results will it be? Still depending on the encoding of the platform.

Many Chinese issues are derived from these two simplest classes. However, many classes don't directly support Encoding entries, which brings us more inconvenience. Many procedures are rare to use Encoding, directly with Default's Encoding, which gives us a lot of difficulties.

Second, UTF-8

UTF-8 is corresponding to Unicode, which is very simple:

7-bit unicode: 0 _ _ _ _ _ _ _

11 unicode: 1 1 0 _ _ _ _ _ 1 0 _ _ _ _ _ _ _

16-bit unicode: 1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _ _ 1 0 _ _ _ _ _ _ _

21 unicode: 1 1 1 1 0 _ _ _ 1 0 _ _ _ _ _ _ _ 1 0 _ _ _ _ _ _ _ 1 0 _ _ _ _ _ _

Most of the cases are only available to Unicode below:

"You" GB code is: 0xc4e3, Unicode is 0x4f60

Binary of 0xC4E3:

1100, 0100, 1110, 0011

Since only two we are in the two codes, we have found this line, because the 7th is not 0, therefore, return "?"

0x4f60 binary:

0100, 1111, 0110, 0000

We make up with UTF-8 to become:

1110, 0100, 1011, 1101, 1010, 0000

E4 - BD - A0

Then return: 0xE4, 0XBD, 0xA0.

Third, String and Byte []

String is actually core is char [], however, to convert Byte into string, must be encoded. String.length () is actually the length of the char array, and if you use different codes, it is likely to be scattered, resulting in scattering and garbled. E.g:

String encoding = "";

Byte [] b = {(Byte) '/ u00c4', (byte) '/ u00E3'};

String str = new string (b, eNCoding);

If eNCoding = 8859_1, there will be two words, but Encoding = GB2312 is only one word this problem in processing paging.

Four, Reader, Writer / InputStream, OutputStream

Reader and Writer cores are CHAR, INPUTSTREAM, and OUTPUTSTREAM cores are BYTE. But the main purpose of Reader and Writer is to read Char read / write InputStream / OutputStream. E.g:

Document Test.txt has only one "you" word, 0xc4, 0xe3

String encoding = "GB2312";

InputStreamReader Reader = New FileInputStream ("Text.txt"), Encoding;

Char C [] = new char [10];

INT length = reader.read (c);

For (INT i = 0; i

System.out.println (C [i]);

}

what's the result? It's you". If encoding = "8859_1", what is the result? "??" two characters, indicating that you don't know. Instead, do it yourself.

转载请注明原文地址:https://www.9cbs.com/read-57315.html

9cbs

New Post(0)