UTF-8
Section: Linux Programmer's Manual (7)
Updated: 1995-11-26
Index Return to Main Contents
-------------------------------------------------- ------------------------------
first name
UTF-8 - ASCII compatible multi-byte unicode encoding
description
The Unicode character set is used for a 16-bit (double-byte) code. The most common Unicode encoding method (UCS-2) consists of a 16-bit double word sequence. Some of such a string include characters such as '/ 0' or '/' in the file name or in the C library function. In addition, if there is no major correction, most of the UNIX tools of the ASCII code file cannot correctly identify the 16-bit characters. Therefore, UCS-2 is not a suitable external encoding method for Unicode file names, text files, environment variables, etc. ISO 10646 Universal Character Set (UCS), is an ultra-array of Unicode, and even 31-bit encoding mode, and there is also the same problem as described using 32 encoded UCS-4. UCS-4 does not exist like this with UTF-8 to Unicode UCS coding. Therefore, UTF-8 is obviously a solution for Unicode character sets under UNIX type operating systems.
Attributes
UTF-8 encoding has the following excellent properties:
*
UCS characters are simply encoded from 0x00000000 to 0x0000007F (traditional US-ASCII characters) as byte 0x00 to 0x7f (compatible with ASCII). This means that only the files and strings of 7 ASCII characters are in ASCII and UTF-8. The encoding mode is exactly the same.
*
All UCS characters greater than 0x7f are encoded into multi-byte sequences. The sequence is all consisting of 0x80 to 0fd characters, so that there will be no standard ASCII characters will appear as a part of a word, and the special characters such as '/ 0' and '' are There will be no problem.
*
The order of the byte strings in the UCS-4 dictionary is retained.
*
All 2 ^ 32 UCS codes can be encoded using UTF-8.
*
0xFe and 0xFF two characters will not be used in UTF-8.
*
The start character indicating the UCS multi-byte string of the non-ASCII code is always between 0xc0 to 0xFD, and will indicate the length of the string. The other characters of the multi-character string are characters between 0x80 to 0xBF. This makes it very simple to synchronize, and the encoding is an invitation, and the throwing bytes is not easy.
*
UCS characters encoded with UTF-8 can be increased to 6 bytes. And Unicode can only increase to 3 bytes long. Since Linux uses only 16 bits of Unicode, UCS subsets. So under Linux, the UTF-8 multi-byte string length will not exceed three bytes.
Encoding
The following byte string is used to represent a character. What string is used in accordance with the serial number of the UCS encoding in accordance with the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0x00200000 - 0x03FFFFF:
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx0x04000000 - 0x7FFFFFF:
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Here the character encoding of the XXX position binary position is fill in. Use the shortest one enough to express a multi-character string of character coding.
for example
Unicode character 0xA9 = 1010 1001 (copyright symbol) is encoded in UTF-8:
11000010 10101001 = 0xc2 0xA9
Character 0x2260 = 0010 0010 0110 0000 ("Do not equal" symbol) is encoded as:
11100010 10001001 10100000 = 0xE2 0x89 0xA0