About UTF8, UTF16, UTF32, UTF16-Le, UTF16-BE

xiaoxiao2021-03-06 129

Unicode is the coding standard established by Unicode.org, which is currently supported by most operating systems and programming languages. Unicode.org's Official to Unicode is: UNICODE Provides a Unique Number for Every Character. It can be seen that Unicode is done to define a corresponding number representation for each character. For example, the Unicode value of "a" is 0x0061, "a" Unicde value is 0x4e00, which is the simplest case, and each character is represented by 2 bytes. Unicode.org defines more than 100,000 characters, if all characters are represented in a unified format, need 4 bytes. "A" Unicode representation will become 0x00000061, and "one" Unicode value is 0x00004E00. In fact, this is the Unicode scheme used on the UTF32, Linux operating system. However, careful analysis can be found that most of the characters can be represented by only 2 bytes. The English Unicode range is 0x0000-0x007F. The Chinese Unicode range is 0x4e00-0x9f **, which really needs to be extended to 4 bytes to indicate fewer characters, so some systems use 2 bytes to represent Unicode. For example, on the Windows system, Unicode is two bytes. For those characters that need 4 bytes, use a proxy method to expand (in fact, do one tag on the low two bytes, indicating that this is a proxy, need to connect two bytes, Can form a character). Such benefits is a large amount of access to access space, which also improves the speed of processing. This unicode representation is UTF16. Generally on the Windows platform, Unicode, that means UTF16. As for UTF16-LE and UTF16-BE, it is related to the computer's CPU architecture. Le refers to Little Endian, and the be refers to BIG Endian. About this information, there are many related posts online. Our general X86 system is Little Endian, which can be considered UTF16 = UTF16-Le. Because of Europe and North America, the coded range actually used between 0x0000-0x00FF, only one character can represent all characters. Even use UTF16 as a memory approach, it is also a huge space waste, so there is a UTF8 encoding method. This is a very flexible code. For characters that only need one byte, use a byte, for the character that the second-day Korean has originally referred to in the character, and the algorithm of UTF16-UTF8 is implemented. Mutual conversion (generally required 3 bytes), and UTF8 can be extended to 6 bytes for each character for characters that require 4 bytes.

转载请注明原文地址:https://www.9cbs.com/read-98325.html

9cbs

New Post(0)