UTF8 encoding

xiaoxiao2021-03-06 138

UTF-8

Section: Linux Programmer's Manual (7)

Updated: 1995-11-26

Index Return to Main Contents

-------------------------------------------------- ------------------------------

first name

UTF-8 - ASCII compatible multi-byte unicode encoding

description

The Unicode character set is used for a 16-bit (double-byte) code. The most common Unicode encoding method (UCS-2) consists of a 16-bit double word sequence. Some of such a string include characters such as '/ 0' or '/' in the file name or in the C library function. In addition, if there is no major correction, most of the UNIX tools of the ASCII code file cannot correctly identify the 16-bit characters. Therefore, UCS-2 is not a suitable external encoding method for Unicode file names, text files, environment variables, etc. ISO 10646 Universal Character Set (UCS), is an ultra-array of Unicode, and even 31-bit encoding mode, and there is also the same problem as described using 32 encoded UCS-4. UCS-4 does not exist like this with UTF-8 to Unicode UCS coding. Therefore, UTF-8 is obviously a solution for Unicode character sets under UNIX type operating systems.

Attributes

UTF-8 encoding has the following excellent properties:

UCS characters are simply encoded from 0x00000000 to 0x0000007F (traditional US-ASCII characters) as byte 0x00 to 0x7f (compatible with ASCII). This means that only the files and strings of 7 ASCII characters are in ASCII and UTF-8. The encoding mode is exactly the same.

All UCS characters greater than 0x7f are encoded into multi-byte sequences. The sequence is all consisting of 0x80 to 0fd characters, so that there will be no standard ASCII characters will appear as a part of a word, and the special characters such as '/ 0' and '' are There will be no problem.

The order of the byte strings in the UCS-4 dictionary is retained.

All 2 ^ 32 UCS codes can be encoded using UTF-8.

0xFe and 0xFF two characters will not be used in UTF-8.

The start character indicating the UCS multi-byte string of the non-ASCII code is always between 0xc0 to 0xFD, and will indicate the length of the string. The other characters of the multi-character string are characters between 0x80 to 0xBF. This makes it very simple to synchronize, and the encoding is an invitation, and the throwing bytes is not easy.

UCS characters encoded with UTF-8 can be increased to 6 bytes. And Unicode can only increase to 3 bytes long. Since Linux uses only 16 bits of Unicode, UCS subsets. So under Linux, the UTF-8 multi-byte string length will not exceed three bytes.

Encoding

The following byte string is used to represent a character. What string is used in accordance with the serial number of the UCS encoding in accordance with the character:

0x00000000 - 0x0000007F:

0xxxxxxx

0x00000080 - 0x000007FF:

110xxxxx 10xxxxxx

0x00000800 - 0x0000FFF:

1110xxxx 10xxxxxx 10xxxxxx

0x00010000 - 0x001FFFF:

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

0x00200000 - 0x03FFFFF:

111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx0x04000000 - 0x7FFFFFF:

1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Here the character encoding of the XXX position binary position is fill in. Use the shortest one enough to express a multi-character string of character coding.

for example

Unicode character 0xA9 = 1010 1001 (copyright symbol) is encoded in UTF-8:

11000010 10101001 = 0xc2 0xA9

Character 0x2260 = 0010 0010 0110 0000 ("Do not equal" symbol) is encoded as:

11100010 10001001 10100000 = 0xE2 0x89 0xA0

转载请注明原文地址:https://www.9cbs.com/read-102143.html

9cbs

New Post(0)