Unicode and ISO10646

zhaozj2021-02-16  98

Unicode and ISO10646 (Author: Tseng Bear)

First, origin

In the early 1960s, the US Congress Library (LCRary of Congress, LC)

Henriette Avram et al. Starts to combine the machine read catalog format, while James Agenboard

Others also develop English character sets and exchange codes to be used as an American library book

Co-standard. LC exchange code is then developed into national standard ASCII (Americanan)

Standard CodeFor Information Interchange, and further play

Become a worldwide computer character encoding standard ISO646 (whose full name is 7-bit Coded

Character Set for Information Interchange). Today, though

One byte (byte) has increased from 7 bits (BIT) to 8 bits.

ASCII and ISO646 are still the foundation standards you want in the world.

According to ASCII and ISO646, the 128 encoding positions available in 7 bits (editors)

The code range is 0 ~ 127) is distinguished from two parts: 94 graphic character code and 34 control words

The code. Graphic characters include 52 cases of English letters, 10 Arabic numbers, 9 standards

Point symbols, 6 brackets, and 17 other symbols, coding ranges from 33 to 126. control

Characters include 10 transport control characters, 6 episode adjustment characters, 4 device control words

Yuan, 4 information separation characters and 10 special control characters, which are encoded to 0 ~ 32 and 127.

When a computer or network device receives a series of bit signals, it is usually received while receiving one side.

Separation is byte (ie every 8 digits), and immediately resolve the just received.

The group is a control character code or a graphic character code. If it is the relevant control related to the information

When character (eg, transmission control, Bell encoded to 7), when a computer or network

The device will intercept the character and make a corresponding action immediately (such as the Bell character drives the request.

The device is called), otherwise it will not be processed to the subsequent device. In other words, the computer and

Network devices will eat a specific control character code in the serial string.

With the increasingness and price of computer functions, its application is also increasing

The wider. But the various coding requirements that follow, make a single byte coding party

The coding space is too small, it becomes insufficient to respond to various applications. Chinese

Coding, non-English pinyin letter and graphic symbol, etc.

2 or more bytes are required to encode. At the same time, in order to prevent these multi-byte characters

The code must be avoided when the code is "eaten by the computer or network device".

34 handles of 0 ~ 32 and 127 per byte. This kind of practice is severely wasted code empty

In terms of multi-byte expanded international standard ISO2022, two 8-bit bit

The tuple can only provide up to 188 control characters and 35,344 coded spaces,

A total of 35,532 encoding positions, but the 16-bit coding space is actually up to 65, 536.

Compared with the two, the 16-bit encoding of ISO2022 can only reach 54% of the maximum coding space.

It seems that the utilization is very poor. At the same time, the encoding of the application level, because the manufacturer is lacking

Knowing, often you have your code, I edited my code, after the fruit, the chaos that trigger a dozen Pentium

Icon.

In order to accommodate characters and symbols all over the world, some Member States of ISO in 1984

Promoting a new international character set coding standard. New Standard by Working Group ISO / IEC

JTC1 / SC2 / WG2 (Note 1) Responsible for the proposed (hereinafter referred to as WG2), the last policy named "Universal Multiple-Oct Coded Character Set"

(Abbrevant as UCS), its number is subscribed to ISO / IEC 10646. According to WG2 original planning,

The encoding structure of ISO10646 is hits the ISO2022 eight-bit extension coding structure to avoid C0 and

C1 two handles (Note 2), but break all bytes of all bytes in each character code BIT-8

(Ie the leftmost bit, its value is 28 = 128) must be set to 0 or set to 1 limit,

To improve the use of coding space. At the same time, in order to have sufficient position to accommodate all over the world

Language characters and symbols, and to match the microprocessor to 8, 16, 32 or 64

Bit is a trend of an arithmetic processing unit, the length code length of ISO 10646 is specified as

4 eight (OCTET).

The third draft of the ISO 10646 draft is published, and its encoding structure is immediately subject to a US computer.

The opposition of the industry. At the beginning of 1988, the Joe Becker initiative of Xerox, USA was newly edited.

Code structure, additional world character encoding criteria: Basic Basics of Computer Character Set Coding

The unit is expanded from the current 7 or 8 digits to 16 positions and fully utilized 65,536

Code location to accommodate characters and common symbols worldwide. New character set

The coding standard is named "Unicode". A group of Xerox companies and Apple

The company's engineer constitutes a work group and is responsible for the original design of Unicode. year 1991

January, more than ten computer hardware software, network and information service providers, including: IBM, DEC,

Sun, Xerox, Apple, Microsoft, Novell, company, jointly funded

Unicode Consolecium and established non-profit from the association

Unicode Company. After the establishment of the Unicode Association, the original working group was expanded as

UNICODE Technical Committee, Juan Iicode

Character collection, finishing, encoding, etc. Promote UNICODE to become an international standard work,

The Unicode is responsible. The first edition of Unicode draft was published in September 1989.

After multiple revisions, the first edition of Unicode Standards was published in 1991 and 92 (the

The first, second volume of Unicode Standard, Version 1.0).

Due to the continued lobbying and pressure of the Unicode Association, WG2 finally waited for the original ISO2022.

The eight-bit extension coding structure, rehabilizes Unicode's encoding mode, that is, continuous coding no longer avoided

C0 and C1 handle areas. In October 1991, after several months of consultation, WG2 and Unicode

The association reached an agreement, and Unicode incorporated into ISO10646 to become 0 literal. After that, Chinese characters

Yuan's collection, finishing and coding, etc., and the WG2 is dominated, and the Unicode Association actively helps

WG2, but both parties still publish their own coding standards. Since the integration of both parties is

The first version of the Unicode Standard was released, so the second standard

Especially in the first chapter, the coding area and character set revision items made in response to merge work.

Purpose. ISO publishes the first edition of ISO10646-1 (Note 4) in September, and the second version has been

The revision was published in March this year. In 2006, the second edition of the Unicode Standard 2 in 1996 was published in 1996, and the third version of ISO10646-1 corresponds to the third version of the Unicode Standard.

Published in January this year. In March this year, WG2 38th meeting held in Beijing officially passed

ISO10646-2 (Note 5) Final draft, scheduled to be sent to Member States after the editorial completion of May

Check, if there is no accident, the future will be officially published.

Second, the encoding structure and character set

Regular form of ISO10646 character code (can be referred to as UCS-4) is 32 bits, divided

Four eight digits, as shown in [Figure 1]. These four eight digits, named left and right

Eight (g-octet), eight-bit (p-ocTet), column eight (R-OCTET)

Ho's eight-bit (C-OCTET), represents groups (Group) in the encoding structure, respectively.

Field, Row and Cell. ISO10646 specifies its character code

The B32 must be 0, so the entire coding space can be divided into 128 groups (group eight)

The value is 00 ~ 7FH (Note 6)), each group is composed of 256 literals (surface eight)

The yuan is 00 ~ ffh), each of which is composed of 256 columns (column eight digits 00 ~ ffh),

Each column contains 256 classes (八 位 00 ~ ff), which is a coded position. except

In addition, ISO 10646 also specifies the last two encoding positions of each literal FFFEH and

FFFH, reserved is not available. So, ISO10646 The entire coding space is 256 × 128 =

32,768 Fang, each above 256 × 256-2 = 65, 534 encoding position,

65534 × 32768 = 2, 147, 418, 112 encoding positions.

ISO10646's 0th group of 0th sheets (the value of the eight digits and eight digits is 00h)

"Basic Multi-Lingual Plane, BMP),

Its coding characters are the same as Unicode. 32,767 FRS outside BMP is divided into auxiliary words

Supplementary Plaso and Private USE PLANES.

Assisted literals are used to accommodate WG2 sends, organize and encode all the language characteristics;

Use the literal content WG2 not specified, retain the user to add ISO10646 unrecorded

Rong characters. A total of 8,226 dedicated literals, including 0FH, 10H, and E0 of 00H group ~

FFH has a total of 34 above, and 8,192 strokes in a total of 32 groups in 60 ~ 7FH. apart from

These 8, 226 dedicated literals, the rest of the 24,541 literals are auxiliary literal.

When the computer system uses only the BMP character code, you can omit the eight digits and the eight digits.

Yuan, thus shortening the character code from 32 digits, called ISO10646

The fundamental form of the character code (can be referred to as UCS-2), which can also be regarded as Unicode.

ISO10646 All literals, currently only 0, first and second literals

Encoding characters. ISO10646's BMP and Unicode's encoded characters, in accordance with its UCS-2 coding, as follows:

(1) 0000 ~ 007FH: Basic Latin alphabet. The 0000 ~ 001FH is C0 control

Code, 0020h is space (Space), 0021 ~ 007EH is ASCII graphics

The characters, 007fh is the handle DEL. In fact, this 128 character code as long as

Remove the first 8 bits to the 8-bit form of the Accompany.

(2) 0080 ~ 00A0h: Handle area. The 0080 ~ 009FH is a C1 handle,

00a0h is an uninterrupted space (No-Break Space).

(3) 00a1 ~ 1FFFH: Pinyin text area. Accommodates all kinds of basic Latin letters

Pinyin text characters, including European languages, Greek, Slavic Chinese,

Hebrew, Arabic, Armenian, India, Malay,

Thai, Pupuon, Cambodia, full text, Mongolian, Tibetan, Indian language, etc.

(4) 2000 ~ 28ffh: Symbol Zone. Contains various symbols, including punctuation, up and down

Sign, coin symbol, number, arrow, mathematical symbol, engineering symbol, optical distinct

Symbol, circle or cracked text, form drawing symbol, geographic icon,

Blind use word, decorative graphics, etc.

(5) 2e80 ~ 33ffh: China Japan and South Korea symbol area. Recommend the first Kangxi Dictionary, Sino-Japanese Korean

Help department, phony symbol, Japanese pseudonym, Korean audio, Zhong Japan,

Practice, ring or rockery numbers, month, and Japan's pseudonymal combination,

Unit, year, month, date, time, etc.

(6) 3400 ~ 4DFFH: Sino-Japanese Korean recognition with expressions, expansion A, total storage 6,582

Sino-Japanese and Korean Chinese characters.

(7) 4E00 ~ 9FFFH: Sino-Japanese Korean recognizes the same recognition area, totaling 20,902

Japanese and Korean characters.

(8) A000 ~ A4ffH: Yiwen field area, accommodating the Chinese Southern Yi people and the roots.

(9) AC00 ~ D7FFH: Korean Pinyin combination zone, the text of the Han Wen Yin

word.

(10) D800 ~ DFFFH: S zone, dedicated to UTF-16, detail.

(11) E000 ~ F8ffH: Special Zone, its contents WG2 does not regulate, retaining for use

Add ISO10646 uncomfortable characters yourself.

(12) F900 ~ FAFFH: China-Japanese and Korean and Tong Dynasty meaning, total collection of 302 China

Han Chinese characters. What is compatible with expressions and leaves later.

(13) FB00 ~ FFFDH: Text Expressive Form Zone, accommodation combination Latin text, Hebrew

Wen, Arabic, Zhongri Korean straight punctuation, small symbol, half-width symbol, full corner

Symbol, etc.

WG2 episodes of experts from all over the world to jointly organize all kinds of languages ​​and symbols in the world, land

Continued into ISO10646. WG2 is divided into concepts and non-tables according to language characteristics

The two types of written words, the meaning of the meaning is that the Chinese characters used in East Asian countries have originated in China.

It mainly includes Taiwan, China, Japan, South Korea, Vietnam, Singapore and Hong Kong and Macao.

Han character used. All other words except Chinese characters, will be classified as non-integrated text, absolutely

Most of the pinyin text. ISO10646's BMP and Unicode also collected non-interested words,

Symbols and expressions. However, the number of ancient and modern languages ​​and symbols all over the world is huge,

It is not enough to accommodate the BMP. WG2 As of now, the non-intended text collected and organized.

Symbol section, deducting the BMP, the rest is included in the first literal, due to its content item

It is much more integrated, this article does not plan. The expression part of the deduction has been encredient in BMP, and the rest is written in the second literal, and its content is:

(1) Sino-Japanese Korean recognizes the identical text to expand the B area: Total 42,807 Sino-Japanese Han Yue Chinese characters,

The coding range is 0002-0100 ~ 0002-A836H.

(2) CNS11643 Compatible Character: 527 CNS11643 characters that are recognized, compatible

The code range is 0002-F800 ~ 0002-Fa16h.

WG2 forms an ISO10646 encoded character set, first collecting each of the membership or observer

The text and symbols of their own country are proposes WG2 proposal (Note 7), and WG2 holds a meeting review every half year.

Character set proposal, by applying or waiting for more character sets after collecting more characters

coding. Non-intentional text or symbol, because the word set is small or only a country is used,

It is often discussed directly at the WG2 meeting. But the Chinese characters set is large and for multiple countries and

Area jointly used, WG2 Sets the Ideo Text Subscription Group (Ideograph Rapporteur)

GROUP, IRG) Taken to collect Chinese characters in all countries, and compare the identity and make a whole

After the sex is set, it is proposed to WG2. The character set suggested by IRG, WG2 is always direct

Accept the encoding. The Chinese characters proposed by IRG's members are from China, which is inevitable.

The glyph is the same or extremely approximate. In order to avoid the ISO 10646 coding table, repeat words

User is troubled, IRG has developed an agreement of Idiom. Anyone should agree with the rules

Chinese characters, all merge into a word to give a code. However, in order to respect the various countries

The main control of the word, WG2 is especially different from the ISO10646 code table.

Histical shapes.

Identity rules not only applied to Chinese characters of different sources, but also apply to the same source

Chinese characters. For example, in my country's Chinese code national standard CNS11643's word collection

Two extremely similar "map" words respectively give two different code 1-6837h and 6-5b5bh,

These two "picture" words, depending on the rules, must be merged into one, so the latter is recognized by the former

Take it. In other words, CNS 1-6837H and 6-5B5BH correspond to UNICODE 5716H

(Or ISO10646 0000-5716H), but UNICODE's 5716h is only corresponding to CNS

1-6837h. When we convert data from the CNS code to Unicode, then rebounded by Unicode

At the time of the CNS code, the result of 6-5B5BH → 5716h → 1-6837h will occur, this phenomenon is called going

Round-Trip Conversion error. Solve the way without him, must be

Unicode or ISO10646 word concentrate more than the identified character and additionally give the encoding

(Called compatibility characters), do the CNS sequence set and ISO10646 word set one-to-one (but not

It is not possible to be reflected). ISO10646 CNS compatibility collections accommodated in line 2,

In order to reach the right to return code, my country has strive for results. In other words,

CNS11643 Existing Chinese characters have been fully encoded in the ISO 10646 coding table.

Third, UTF-16 and UTF-8

ISO10646's coding space is enough to accommodate all texts and symbols used by old and modern humans,

But the words or symbols currently used in the currently used, most of them have been included in BMP, their make

The frequency may exceed 99% or even 99.99%. In other words, more than 99% of users or

In the case of use, the 16-bit Unicode is already a need for footing, 32-bit ISO10646

Regular coding appears to cut chicken with a cow knife. 32-bit coding is not as long as it is more than one-fold memory or storage space than 16-bit coding, and it also takes to spend on network transmission and information processing.

A longer time. For economic benefits, future computers and networks choose Unicode

The possibility is clearly higher than the choice of 32-bit ISO10646 regular encoding.

The problem is that in the world of Unicode, you must use 646 1st, 2nd, and even

What should I do when the literal characters are higher? The solution proposed by the UNICODE Association is called

UTF-16 or agency law (Surrogates). UTF is "A UCS (or Unicode)

The abbreviation of Transformation Format, UTF-16 means the original 32-bit

The ISO 10646 character code is converted to 2 or more 16-bit Unicode. Current practice is group

In combination of two Unicode characters to represent an ISO10646 character, such as [Figure 2],

So it is also known as a representative law. Two Unicode, which is representative, is located in front (left)

It is called a high half-character, limited to one of D800 ~ DBFFH, located behind

(Right Party) is called a low half-character, and the limit can only be selected from DC00 ~ DFFFH. high

The encoding position of the low half-character is 1,024 = 4 × 256, so the UTF-16 is available.

(4 × 256) × (4 × 256) = 16 × 65536 encoding positions, that is, 16 above.

For BMP, there is of course no need to use UTF-16 transcoding, so UTF-16 is mainly used to

ISO10646's 1st and 14th literal (because the 15 literal is dedicated, WG2 is not allowed

code). Will ISO 10646 characters (encoding range 0001-0000 ~ 000E-fffh)

UFT-16 is converted into a rule in the form of Unicode combination, as shown in Figure 3, in the figure

Convenient to correspond to 16-based, specifically manifest the character code into a set of four positions, the middle

Separate. It is not difficult to transform, the original 32-bit ISO10646 character code, from right left

Take 10 bits (i.e., Y10 ~ Y1), additional 6 digits 110111,

That is, it becomes a low half character. Then take the left to the left (i.e., Y16 ~ Y11 in the figure) as

The lowest 6 digits of the high half characters, then remove 4 bits left left (ie Z4 ~ Z1 in the figure,

It is equivalent to Y20 ~ Y17, and its value can display 0th to 15th, minimize it, place it

Only the left side of the 6 position, and finally add a specific 6 bit 110110 at the foremost.

Become a high half-character. The above introduction is the concept of UTF-16 conversion, as for UTF-16

For formulas and restore formulas, please refer to ISO10646 standard or Unicode standard.

Special note, the UTF-16 is designed for the Unicode World. That is to say, when

The Internet and most of the computers have been adopted, and UTF-16 can be used to table

The character code of ISO10646 1st to 14th. Unfortunate facts are not this, now

It is the ASCII world rather than Unicode world. When a unicode leaves the greenhouse (some

After Unicode is an internal operating system or application), it will be cut into two 8-bit bytes immediately by the network device, and strictly check if any bytes are C0 handles.

At least those Unicoe will become incomplete. In order to make unicode and ISO10646

The character code can be accepted in the ASCII world, the Unicode Association specifically proposes UTF-8 to solve

problem. UTF-8 means the original 32-bit ISO10 or original 16-bit Unicode

Convert to multiple 8-bit bytes.

Use UTF-8 transcoding rules to convert a Unicode or ISO10646 character code into 1 ~ 4

The encoding of the byte, as shown in Figure 4]. UTF-8 conversion rules are simple: if the original

The character code is within the scope of 0000 ~ 007FH (or 0000-0000 ~ 0000-007fh),

Directly intercepted the rightmost position, the result of the conversion is actually the ASCII code. If the original character

The code is greater than 007FH (or 0000-007FH), that is, when it exceeds the ASCII range, it must be converted.

2 to 4 bytes. UTF-8 regulations, continuously set to "1" after the first byte of conversion

The number of marking bits is expressed to several bytes: 110 indicates that the conversion result is 2 bits.

The tuple, 1110 represents 3 bytes, and 11110 represents 4 bytes. As for the follower

After the bit is set to 0, it is in a separate marking and character code bits. 2nd to 4th

The first two bits of the byte are set to 10 when doing identification, the remaining 6 digits are as a character code

Bit. Total, 2-byte UTF-8 code left 11 character code bits, can be used to convert

The original character code of 0080 ~ 07ffh. 16 character code bits left in 3 bytes can be used to convert

The original character code of 0800 ~ FFFFH. And 4 bytes remain 21 character code bits, can be used

Convert 0001-0000 ~ 001F-FFFH (ie ISO10646 1st to 31nd Field)

The code. Please note that although 4 bytes of UTF-8 encodes can include 1 ~ 3 bytes of code,

3 bytes of encoding can include 1 ~ 2 bytes of code, and 2 bytes of coded package

A code of 1 byte, but the UTF-8 must select the shortest when it is transcoding. In other words,

The ASCCI zone can only be converted into a single bit, and the original character code of 0080 ~ 07ffh can only be converted to 2

Position group length, so on.

Reference data:

Note 1: ISO is an international standard organization (International Organization for

STANDARDIZATION; IEC is UNIT

Air Technology Commission (International Electro-Technical) (INTERNATIONAL ELECTRO-Technical

English abbreviation; JTC1 Series is a total of ISO and IEC

Group's First Joint Technical Committee (Join Technical Committee One),

Responsible for the development of international standards related to information processing, information technology; JTC1 / SC2

Sub-committee two is located in JTC1;

JTC1 / SC2 / WG2 is the second working group under JTC1 / SC2 (Working

Group TWO. In my country, the Economic Department Standard Inspection Bureau has corresponding ISO / IEC

JTC1 / SC2 organization, called information and communication national standard drafting committees,

Corresponding to ISO / IEC JTC1 / SC2 / WG2, called Chinese Information Standard Subcommittee. The author serves as a member of these two organizations.

Note 2: The C0 handle refers to the 32 handles encoded to 0 ~ 31 (BIT-8 is 0),

The C1 handle area means the 32 handles of 128 to 159 (which BIT-8 is 1).

For details on the C0 and C1 handles, please refer to ISO6429 or CNS13479.

Note 3: Some people in China will translate Unicode as "unified code".

Note 4: ISO10646-1 is the first part of the standard, formal name is "ISO / IEC 10646-1:

Universal Multiple-Oct Coded Character Set - UCS - Part 1:

Architecture and Basic Multilingual Plane (BMP) ",

The content is only the encoding structure of ISO 10646 and the 0-grade coding table.

Note 5: ISO10646-2 is the second part of the standard, formal name "ISO / IEC 10646-2:

Universal Multiple-Oct Coded Character Set - UCS - Part 2:

CJK Unified Ideographs Supplementary Plane, General Scriptsand

Symbols Plane, General Purpose Plane, mainly including

The first literal language, the symbolic coding table, and the second literal Chinese and Korean

Identic text and compatible coding table.

Note 6: At this point, h represents the 16-encycloped number, the value of each number is 0 ~ 15, respectively

It is from 0 to 9 and A (10), B (11), C (12), E (13), D (14) and F (15).

Note 7: my country's observers and IRG members of China, this article is currently two

The organization of my country's representative.

转载请注明原文地址:https://www.9cbs.com/read-12898.html

New Post(0)