UTF8 encoding

xiaoxiao2021-03-06  99

UTF-8

Section: Linux Programmer's Manual (7)

Updated: 1995-11-26

Index Return to Main Contents

-------------------------------------------------- ------------------------------

first name

UTF-8 - ASCII compatible multi-byte unicode encoding

description

The Unicode character set is used for a 16-bit (double-byte) code. The most common Unicode encoding method (UCS-2) consists of a 16-bit double word sequence. Some of such a string include characters such as '/ 0' or '/' in the file name or in the C library function. In addition, if there is no major correction, most of the UNIX tools of the ASCII code file cannot correctly identify the 16-bit characters. Therefore, UCS-2 is not a suitable external encoding method for Unicode file names, text files, environment variables, etc. ISO 10646 Universal Character Set (UCS), is an ultra-array of Unicode, and even 31-bit encoding mode, and there is also the same problem as described using 32 encoded UCS-4. UCS-4 does not exist like this with UTF-8 to Unicode UCS coding. Therefore, UTF-8 is obviously a solution for Unicode character sets under UNIX type operating systems.

Attributes

UTF-8 encoding has the following excellent properties:

*

UCS characters are simply encoded from 0x00000000 to 0x0000007F (traditional US-ASCII characters) as byte 0x00 to 0x7f (compatible with ASCII). This means that only the files and strings of 7 ASCII characters are in ASCII and UTF-8. The encoding mode is exactly the same.

*

All UCS characters greater than 0x7f are encoded into multi-byte sequences. The sequence is all consisting of 0x80 to 0fd characters, so that there will be no standard ASCII characters will appear as a part of a word, and the special characters such as '/ 0' and '' are There will be no problem.

*

The order of the byte strings in the UCS-4 dictionary is retained.

*

All 2 ^ 32 UCS codes can be encoded using UTF-8.

*

0xFe and 0xFF two characters will not be used in UTF-8.

*

The start character indicating the UCS multi-byte string of the non-ASCII code is always between 0xc0 to 0xFD, and will indicate the length of the string. The other characters of the multi-character string are characters between 0x80 to 0xBF. This makes it very simple to synchronize, and the encoding is an invitation, and the throwing bytes is not easy.

*

UCS characters encoded with UTF-8 can be increased to 6 bytes. And Unicode can only increase to 3 bytes long. Since Linux uses only 16 bits of Unicode, UCS subsets. So under Linux, the UTF-8 multi-byte string length will not exceed three bytes.

Encoding

The following byte string is used to represent a character. What string is used in accordance with the serial number of the UCS encoding in accordance with the character:

0x00000000 - 0x0000007F:

0xxxxxxx

0x00000080 - 0x000007FF:

110xxxxx 10xxxxxx

0x00000800 - 0x0000FFF:

1110xxxx 10xxxxxx 10xxxxxx

0x00010000 - 0x001FFFF:

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

0x00200000 - 0x03FFFFF:

111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx0x04000000 - 0x7FFFFFF:

1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Here the character encoding of the XXX position binary position is fill in. Use the shortest one enough to express a multi-character string of character coding.

for example

Unicode character 0xA9 = 1010 1001 (copyright symbol) is encoded in UTF-8:

11000010 10101001 = 0xc2 0xA9

Character 0x2260 = 0010 0010 0110 0000 ("Do not equal" symbol) is encoded as:

11100010 10001001 10100000 = 0xE2 0x89 0xA0

转载请注明原文地址:https://www.9cbs.com/read-102143.html

New Post(0)