[转] The understanding of character encoding and Unicode, ISO 10646, UCS, UTF8, UTF16, GBK, GB2312 [HOLEN @ DONEWS]

xiaoxiao2021-03-06 26

Understanding character encoding and Unicode, ISO 10646, UCS, UTF8, UTF16, GBK, GB2312

-------------------------------

Unicode: Unicode.org's encoding mechanism to enable the world's usual text. In 1.0 is 16-bit encoding, from U 0000 to U FFF. Each 2byte code corresponds to a character; 2.0 start to abandon 16-bit restrictions, the original 16-bit is the basic bond plane, and additionally increases 16 bit planes, equivalent to 20-bit encoding, code range 0 to 0x10fff .ucs: ISO EN 10646 standard defined Universal Character SET, 4BYTE Coding. UNICODE and UCS: ISO and Unicode.org are two different organizations, so different standards were initially developed; but since Unicode 2.0 started, Unicode used the same fonts and words as ISO 10646-1. ISO also promises that ISO 10646 will not assign a value of UCS-4 exceeding 0x10fffff, so that the two can be consistent .ucs Coding method: Unicode.org's encoding mechanism, to enable the world's common text.

In 1.0 is 16-bit encoding, from U 0000 to U FFFF. Each 2byte code corresponds to a character; the 16-bit limit is abandoned at 2.0, the original 16-bit is the basic flat plane, and the additional 16 positions , Equivalent to 20-bit encoding, encoding range 0 to 0x10FFF.

UCS:

The Universal Character Set of the ISO 10646 standard is used, using 4byte coding.

Unicode and UCS relationship:

ISO and Unicode.org are two different organizations, therefore initially developed different standards; but since Unicode 2.0 starts, Unicode uses the same fonts and words as ISO 10646-1, ISO also promises ISO10646 will not give UCS-4 encoding exceeded 0x10fffff enables the two to be consistent.

UCS encoding method:

UCS-2, which is basically the same as Unicode's 2Byte encoding. UCS-4, 4Byte encoding, currently plus 2 all zerte.utf: Unicode / UCS Transformation FormatF-8, 8bit encoding, ASCII does not work in UCS-2 Transform, other characters make long-length coding, each character 1-3 byte. Usually an external code. There is the following advantages: * There is independent of the CPU byte order, can communicate * fault tolerance between different platforms, any one byte After the damage, you will only lead to a losses of a code bit, no chain errors (such as the GB code is wrong) UTF-16, 16bit encoding, is a changeable length, which is approximately equivalent to 20-bit encoding. Between 0 and 0x10fff, it is basically the implementation of Unicode encoding. It is a long code, related to the CPU font, but because of the most space, often the external code for network transmission .utf-16 is Unicode's preferred Encoding. UTF-32, only 32-bit encoding of the Unicode range (0 to 0x10fffff), equivalent to the subset of UCS-4. Autf and Unicode relationship: Unicode is a character set, which can be seen as an internal code. And UTF It is a coding method because Unicode is not appropriately transmitted and processed in some occasions. UTF-16 is directly Unicode encoding, no transformation, but it contains 0x00 in encoding, head 256//wwcrix One byte is 0x00, which has special significance in the operating system (C language), which will cause problems. Direct encoding of Unicode can avoid this problem with UTF-8 encoding, and bring some advantages. China National Standard Coding : GB 13000: Completely equivalent to ISO 10646-1 / Unicode 2.1, will be synchronized with ISO 10646 / Unicode standard changes .GBK: Expansion of GB2312 to accommodate Unicode 2.1 other than the GB2312 character set Part, and increased the characters from part of Unicode. GB 18030-2000: Based on GB 13000, the GB 13 is the GBK extension of Unicode 3.0, which covers all Unicode codes, the status is equivalent to UTF-8, UTF-16, is a Unicode Coding form. East-length coding, character encoding with single byte / double bytes / 4 bytes. GB18030 is compatible with GB2312 / GBK. GB 18030 is the mandatory real-fire system of all non-handheld / embedded computer systems in China. Standard. ------------------------------ What is UCS and ISO 10646? International Standard ISO 10646 defines a general character set (Universal Character Set, UCS). UCS is a supercoming of all other character sets. It guarantees that it is two-way compatible with other character sets. That is, if you translate any text string to the UCS format, then translate back Original code, you will not lose any information. UCS contains characters for expressing all known languages. Not only includes Latin, Greek, Slavic, Hebra, Arabic, Armenian and Georgian description, It also includes pictograms such as Chinese, Japanese and Korean, as well as a fake name, a fake name, a Bangladesh, a Punchakukhi, Tamir, Indian, Malayalam, Thai , Laoto, Chinese Pinyin (Bopomofo), Hangul, Devangari, Gujarati, Oriya, Telugu, and other numbers are also unclear. For languages that have not been added, because they are studying how to encode them in their computer, In the end they will be added. These languages include Tibetian, Kohi, Runic (ancient Nordic text), Ethiopian, other pictograms, and various print-European languages, including the art language selected. For example, TENGWAR, CIRTH and KLINGON. UCS also includes a large number of graphics, printing, math and scientific symbols, including all by TEX, PostScript, MS-DOS,

MS-Windows, Macintosh, OCR fonts, and characters provided by many other word processing and publishing systems. ISO 10646 defines a 31-bit character set. However, in this huge coding space, only the first 65534 were allocated so far. The code bits (0x0000 to 0xFFD). The 16-bit set of this UCS is called Basic Multilingual Plane, BMP. The characters encoded in 16-bit BMP belong to a very special character (such as pictographic text), And only experts will only use them in history and scientific fields. According to the current plan, maybe there will never be characterized from 0x000000 to 0x10fff, which covers the 21-bit of more than 1 million potential future characters from 0x000000 to 0x10fff. Outside the coded space. ISO 10646-1 standard first published in 1993, defined the architecture of the character set and BMP. The second part of the character encoded other than the BMP is being prepared, but maybe It is necessary to do for a few years. The new characters remain continuously added to the BMP, but the existing characters are stable and will not change again. UCS is not only assigned to each character, but also gives a formal Name. Represents a hexadecimal number of UCS or Unicode values, usually add "U " in front, just like U 0041 represents character "Latin uppercase letter a". UCS character u 0000 to U 007F with US- ASCII (ISO 646) is consistent, U 0000 to U 00FF and ISO 8859-1 (Latin-1) are also consistent. From U E000 to U F8FF, a wide range of codes other than BMP is private With the reserved. What is a combination character? Some codes are allocated to the combined character. They are similar to the unassayable stress button on the typewriter. Single combination characters are not a complete character. It is a similar to the resonator or other Indicates the tag, adds behind the previous character. Therefore, the resonator can be added behind any character. Those the most importantly aggravated characters, as used in the normal language of the geographical formation of the Orthographies of Common Languages, There is your own location in UCS to ensure backward compatibility with the old character set. There are both its own coding position, but also to follow a combined character of a combined character, called a pre-authority Character (Precomposed Characters ). The pre-election character in the UCS is for an old code that does not have a pre-serving character, such as ISO 8859, remaining backward compatibility. The combined character mechanism allows the residential or other indicator to be added after any character. It is especially useful in scientific symbols, such as mathematical equations and international audio symbols, may need to combine one or more indicators after a basic character. Combine characters followed by modified characters. For example, the vocal sound in German Character ("Latin uppercase letter A plus a tangible"), can represent the pre-ash character of the UCS code U 00C4, or it can also be represented as a general "Latin uppercase letter A" followed by a "combined tone": u 0041 U 0308 This combination is required to stack multiple accents, or multiple combined characters can be used when they need to be combined and combined with the above and below, such as in Thailand, one basic character can be Plus two combined characters. What is the UCS implementation level? Not all systems need to support all advanced mechanisms in UCS like combined characters. Therefore ISO 10646 specifies the following three implementation levels: Level 1

Combined characters and Hangul Jamo characters are not supported (a special, more complex Korean code, using two or three sub-characters to encode a Korean Merphie Festival)

Level 2

Similar to the level 1, in some text, allowing a fixed combination character (for example, Hebrew, Arabic, Devangari, Bengali, Guluchi, Gujarati, Oriya, Tamil, Telugo, Print. Enad, Malayalam, Thai and Lao). If there is no minimum combination character, UCS cannot express these languages completely.

Level 3

All UCS characters are supported, such as mathematicians can add a TILDE on any of the characters or an arrow (or both).

What is Unicode? History, there are two independent, creating a single character set. One is an ISO 10646 project of the International Standardization Organization (ISO) and the other is (most of the United States) multilingual software manufacturer. The Unicode project of the composition of the association organized. Fortunately, participants in the two projects have realized that the world does not need two different single-character sets. They combine the work results of both parties, and create a single code Table together. Two projects still exist and publish their own standards independently, but the Unicode Association and ISO / IEC JTC1 / SC2 agree to keep the Unicode and ISO 10646 standard code tables, and adjust any future Extension. So Unicode and ISO 10646 where the Unicode Association announced the Unicode standard, closely contains the basic multi-language surface of the ISO 10646-1 implementation level 3. All characters in both standards are in the same position and have The same name. UNICODE standard extends many semantic symbols related to characters, in general, for a better reference for high-quality printing publishing systems. Unicode details Draw some languages (such as Arabic) Expression The form of algorithm, processing bidirectional text (such as Latin and Hebrew Mixed text) algorithm and sorting and string comparison algorithms, and many other things. On the other hand, ISO 10646 standard, like a wide-known ISO 8859 Like a simple character set. It specifies some terms related to standards, defines some coded alias, including specifications, specify how to use UCS to connect other ISO standards, such as ISO 6429 and ISO 2022. There are also closely related to ISO, such as ISO 14651 is about the UCS string sort. Considering the Unicode standard has an easy to book, and in any good bookstore, there is in any good bookstore. Take a small part of the ISO version and include more assistance information, so it is a wide range of references, but it is generally believed that the font used for printing ISO 10646-1 standards Some aspects are higher than those used to print unicode 2.0. Professional font designers are always recommended to be implemented, but some of the sample grades have a significant difference. ISO 10646-1 standard uses four Different style variants to display condensation characters such as Chinese, Japanese and Korean (CJK), and only Chinese variants in the form of Unicode 2.0. This leads to universally believes that Unicode is unable to receive Japanese users, although it is wrong. What is UTF-8? First UCS And Unicode just assigns an integer to the coding table. There are several ways to represent a string of characters as a string byte. The most obvious way is to store the Unicode text as 2 or 4 byte sequences. String. The official names of these two methods are UCS-2 and UCS-4, respectively, unless otherwise specified, the most bytes are such a (BiGendian Convention). Convert an ASCII or Latin-1 to UCS -2 Simply insert 0x00 before each ASCII byte. If you want to convert into UCS-4, three 0x00 must be inserted before each ASCII byte. UCS-2 (or UCS-4) is used under UNIX ) Will result in very serious problems. Use these coded strings that will contain some special characters, such as '/ 0' or '/', have special meaning in the file name and other C library function parameters. In addition, big Most of the tools under UNIX of the ASCII file, if no major modification is unable to read 16 characters. Based on these reasons, UCS-2 is not suitable as unicode, in terms of file name, text file, environment variable, etc. Coding. UTF-8 encoding in ISO 10646-1 Annex R and RFC 2279 does not have these issues. It is a significant approach to UniCode under UNIX style operating system. UTF-8 has characteristics:

UCS characters U 0000 to U 007F (ASCII) are encoded as bytes 0x00 to 0x7f (ASCII compatible). This means that only 7 ASCII characters are the same in ASCII and UTF-8 encoding mode. All> u 007f UCS characters are encoded as a plurality of bytes of strings, each byte is a marking bit set. Therefore, ASCII bytes (0x00-0x7f) cannot be part of any other characters. Representation The first byte of the multi-character string of non-ASCII characters is always in the range of 0xc0 to 0xFD, and pointed out how many bytes of this character contain. The remaining bytes of the multi-character string are in the 0x80 to 0xBF range. This makes it easy to resynchronize and make the encoded banks, and very little affected by the lost byte. You can include all possible 231 UCS code UTF-8 encoding characters can be up to 6 bytes, however The 16-bit BMP character is only available for only 3 bytes long. The order of the BiGendian UCS-4 byte string is a predetermined. Bytes 0xfe and 0xff have never been used in the UTF-8 encoding. The following bytes are used to represent A character. Which string used depends on the serial number in Unicode .u-00000000 - U-0000007F: 0xxxxxx U-00000080 - U-000007FF: 110xxxxx 10xxxxxx U-00000800 - U-0000FFF: 1110xxxxx U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The location of the XXX is filled in bits of the binary representation of the character coded. The more you rely on X has the specific meaning. Use the shortest one enough to express a multi-character buffer string of character coding. Note in the multi-character string The number of beginning "1" in the first byte is the number of the entire string zone. For example: Unicode character u 00a9 = 1010 1001 (copyright symbol) is encoded in UTF-8:

11000010 10101001 = 0xc2 0xA9

And character u 2260 = 0010 0010 0110 0000 (not equal) is encoded:

11100010 10001001 10100000 = 0xE2 0x89 0xA0

This encoded official name is spelling UTF-8, where UTF represents UCS Transformation Format. Do not use other names (such as UTF8 or UTF_8) in any document, of course, unless you refer to a variable name. Not this code itself. What programming language supports unicode? Most modern programming languages developed after approximately 1993 have a special data type called Unicode / ISO 10646-1 characters. Call Wide_Character in ADA95, in Java Call char. ISO C also illustrates the mechanism for handling multi-byte coding and wide character (Wide Characters), and add more in Amendment 1 to ISO C in September 1994. These mechanisms are mainly coding for various East Asia. Designed, they are much more robust than handling UCS. UTF-8 is an example of the encoding of the ISO C standard calling multi-byte strings, and the WCHAR_T type can be used to store Unicode characters.

转载请注明原文地址:https://www.9cbs.com/read-40389.html

9cbs

New Post(0)