Character encoding summary

xiaoxiao2021-03-14  195

Various coded file headers: FF, Fe (Unicode), Fe, FF (Unicode Big Endian), EF, BB, BF (UTF-8).

Unicode is also a character encoding method, but it is designed by an international organization that can accommodate all language texts in the world. Universal Multiple-Oct Coded Character Set is referred to as UCS. Starting from Unicode 2.0, UNICODE projects use the same font library and word code as ISO 10646-1.

1.Unicode: Each character uses two byte encodings, so it is a fixed length encoding format, and the Unicode encoding "/ U6C49", "A" Unicode encoding is "/ u0061". When compiling into a Class file, whether using GBK coding or UTF-8 encoding code in the compiler, the Unicode encoding of characters is saved in the Class file, which should be converted to Unicode encoding when compiling.

2.GB2312 and GBK: Chinese characters use two byte encodings, English characters use one byte code, GB2312, the bonus of GBK encoded, "Han", these two codes are Baba, can pass the location code high Plus A0 to get the GB2312 code.

3.UTF-8:

UTF-8 is encoded with 8 bits of units. The encoding method from UCS-2 to UTF-8 is as follows:

UCS-2 encoding (16 credits) UTF-8 byte stream (binary) 0000 - 007F0XXXXXXX0080 - 07FF110XXXXX 10XXXXXX0800 - FFFF1110XXXXXXXXXXX 10XXXXXX

The Unicode encoded is converted into a UTF-8 encoding with a one byte of a one byte of the 0x0080, greater than or equal to 0x0080 less than 0x0800, which is greater than or equal to 0x0800 is less than or equal to 0xFFF. Conversion.

For example, the Unicode encoding of the "Han" word is 6C49.6C49 between 0800-FFF, so it must use 3 bytes template: 1110xxxxxxxxxxx 10xxxxxx. The 6C49 is written into binary is: 0110 110001 001001, in order to use this bit stream instead of X in the template, obtained: 11100110 10110001 10001001, i.e., E6 B1 89.

Reminder: UCS just specifies how to encode, does not specify how to transfer, save this code. For example, the UCS coding of the "Han" word is 6C49, I can use 4 ASCII numbers to transfer this code; can also be encoded with UTF-8: 3 consecutive bytes E6 B1 89 to represent it. The key is that both communications should be recognized. UTF-8, UTF-7, UTF-16 are all widely accepted programs. A particular advantage of UTF-8 is that it is fully compatible with ISO-8859-1. UTF is an abbreviation for "UCS Transformation Format".

Code analysis:

1: String s = "Han"; Byte [] bytes = s.getbytes (); for (int i = 0; i

2: String s = "Han"; Byte [] bytes = null; try {bytes = s.getbytes ("UTF-8");} catch (unsupporteencodingexcection e) {} for (int i = 0; i

The code 2 gave the same result on the phone and the PC, the length of the Bytes array is 3, and the three bytes are E6, B1, 89, which are the correct UTF-8 code value of the Chinese characters.

转载请注明原文地址:https://www.9cbs.com/read-129377.html

New Post(0)