Exploration and understanding of character coding in Java

xiaoxiao2021-03-06 48

Today, I finally made a more headache in Java - character encoding, so I wrote an article to commemorate it, and I also provide you with a little one.

As we all know, Java uses Unicode to save the characters in order to internationally universal. Unicode is just a character set, the character's storage and indicates that the character encoding format corresponding to Unicode is the UTF-8, UTF-16, etc. we often see, and UTF-8 It is most commonly used, so people often share it with Unicode (I used to be like this), which is not wrong in some cases, but this understands will have some confusion in Java. We use the procedures below to demonstrate it.

Define a string

String name = "堂";

This string is a character and take it out.

CHAR C_NAME = Name.Charat (0);

The CHAR type in Java is a sixteen (two bytes), but if it is used with UTF-8, it may be only two (UTF-8 is a growing storage), then Java itself is not using UTF -8 is saved, the mouth said that there is no or a test. First look at the content saved in Char

INT low = (c_name) & 0xff; // Take C_NAME low

INT high = (c_name >> 8) & 0xff; // Take C_Name's high

System.out.println (Integer.toHexString (High) " Integer.toHexString (low);

The result is 58 02

Only two bytes (16 digits), then what is the real UTF-8 encoded content, then take a look.

For the convenience, I wrote a helper method PrintByte, and the role is to print each element of a BYTE array according to the hexade format, which is also for convenience, I will use it as a static method.

Public static void printbyte (byte [] bt)

{

For (int i = 0; i

{

INT HEX = (int) BT [I] & 0xFF;

System.out.print (Integer.ToHexString (HEX) "");

}

System.out.println ("Length =" bt.length;

}

BYTE [] UTF_8 = Name.getbytes ("UTF-8");

PrintByte (UTF_8);

The result is E5 A0 82 Length = 3

Wow, three bytes! It seems that Java is really UTF-8, what is it used? UTF-16? Look at it.

BYTE [] UTF_16 = Name.getbytes ("UTF-16");

PrintByte (UTF_16);

The result is FE FF 58 02 Length = 4, relying on, four bytes. Huh? Is the low 16-bit not exactly the same as the hexadecimal representation of the start c_name? It seems that Java real internal character encodings and UTF-16 have more or less contacts. What character encodes are it within Java? I have been looking for a long time, and later I saw an example of UTF-16BE in the 12 chapters of Think in Java 3rd. Is it? Byte [] UTF_16BE = Name.getbytes ("UTF-16BE");

PrintByte (UTF_16BE);

The result came out: 58 02 Length = 2

Haha, I got it! There are not many two bytes, the content is the same. Sure enough is it. At the same time, I also saw that Unicode's code has a Le, here, BE, le, I think it should be BiGendian and LittleEndian.

Ok, Java's character coding is finalized, this is the first original article of myself, it may be very superficial. If there is anything wrong, please give you a generous advice! Thank you.

转载请注明原文地址:https://www.9cbs.com/read-75701.html

9cbs

New Post(0)