UTF & C 1: Unicode and UTF-8

xiaoxiao2021-03-06  53

What is UTF-8? First, UCS and Unicode just assign an integer to the coding table. There are several ways to represent a string of characters as a string byte. The most apparent method is to store Unicode text to 2 A string of or 4 bytes. The official names of these two methods are UCS-2 and UCS-4, respectively, unless otherwise specified, the most bytes are such a (BiGendian Convert). A ASCII or Latin-1 files Convert to UCS-2 Simply insert 0x00 before each ASCII byte. If you want to convert into UCS-4, you must insert three 0x00 before each ASCII byte.

Using UCS-2 (or UCS-4) under UNIX will result in a very serious problem. Use these coded strings that will contain some special characters, such as '/ 0' or '/', they are in the file name and other C library There is a special meaning in the function parameters. In addition, most of the tools under UNIX using the ASCII file, if no major modification is unable to read 16 characters. Based on these reasons, in the file name, text file, environment variable And other places, UCS-2 is not suitable as external coding as Unicode.

The UTF-8 encoding defined in ISO 10646-1 Annex R and RFC 2279 does not have these problems. It is a significant approach to Unicode under UNIX style operating system.

UTF-8 has a property:

UCS characters U 0000 to U 007F (ASCII) are encoded as bytes 0x00 to 0x7f (ASCII compatible). This means that only 7 ASCII characters are the same in ASCII and UTF-8 encoding mode. .

All> UCS characters of U 007F are encoded as a plurality of bytes of strings, each byte is a marking bit set. Therefore, ASCII bytes (0x00-0x7f) cannot be part of any other characters.

The first byte of the multi-character string that represents non-ASCII characters is always in the range of 0xc0 to 0xFD, and pointed out how many bytes of this character contain. The remaining bytes of the multi-character string are in the 0x80 to 0xBF range This makes it easy to resynchronize and make the encoded banks, and very little affected by the lost byte.

You can include all possible 231 UCS code

UTF-8 coding characters can be theoretically up to 6 bytes long, but the 16-bit BMP characters use only 3 bytes long.

The order of the BiGendian UCS-4 byte string is scheduled.

Bytes 0XFE and 0xFF have never been used in UTF-8 encoding.

The following bytes are used to represent a character. Which string used depends on the serial number of the character in Unicode.

U-00000000 - U-0000007F: 0xxxxxxx U-00000080 - U-000007FF: 110xxxxx 10xxxxxx U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx U-04000000 - U-7FFFFFF: 1111110X 10xxxxxx 10xxxxxx 10xxxxxxxxxxxxxxxx

The location of the XXX is filled in bits of the binary representation of the character coded. The more you rely on X has the specific meaning. Use the shortest one enough to express a multi-character buffer string of character coding. Note in the multi-character string In the beginning "1" of the beginning of the first byte is the number of the entire string byte.

For example: Unicode character u 00a9 = 1010 1001 (copyright symbol) encoded in UTF-8:

11000010 10101001 = 0xc2 0xA9

Character U 2260 = 0010 0010 0110 0000 (not equal) encoded: 11100010 10001001 10100000 = 0xE2 0x89 0xA0

This encoded official name is spelling UTF-8, where UTF represents UCS Transformation Format. Do not use other names (such as UTF8 or UTF_8) in any document, of course, unless you refer to a variable name. Not this code itself.

What programming language supports Unicode? Most modern programming languages ​​developed after 1993 have a special data type called Unicode / ISO 10646-1. In ADA95, it is called Wide_Character, called char in Java.

ISO C also details the mechanism for processing multi-byte coding and wide character (Wide Characters), and more in Amendment 1 to ISO C in September 1994. These mechanisms are mainly designed for various East Asian codes. They are much robust than the need to handle UCS. UTF-8 is an example of the encoding of the ISO C standard call multi-byte string, and the WCHAR_T type can be used to store Unicode characters.

转载请注明原文地址:https://www.9cbs.com/read-87939.html

New Post(0)