Realize the transformation between UCS-2 and UTF-8 with C language

zhaozj2021-02-16  56

This paper briefly introduces UCS, Unicode, and UTF-8, and use the C language to implement mutual transformation between UTF-8 and UCS2. 1. What is UCS and ISO10646? International Standard ISO10646 defines a universal character set (UCS). UCS is a supercoming of all other character sets, which guarantees that other character sets are two-way compatible, that is, the encoding mutual conversion will not lose any information. UCS character set U 0000 to U 007F is consistent with US-ASCII.

2. What is the history of Unicode, there are two independent, creating a single character set. One is an international standardization organization (ISO) ISO 10646 project, and the other is made by (most of the United States) multilingual software manufacturing The unicode project of the association organized by the Commerce. Fortunately, the participants of the two projects have realized that the world does not need two different single-character sets. They combine the work results of both parties, and create a single Codes work together. Two projects still exist and publish their respective standards independently, but the Unicode Association and ISO / IEC JTC1 / SC2 agree to keep the Unicode and ISO 10646 standard code table compatible, and tightly adjust any future Expansion.

3. What is UTF-8 (a transfer and storage format) UCS and Unicode assigns a corresponding integer for each character, but does not express its implementation mechanism. Therefore, there are multiple encoding methods, where two words The method and four bytes are stored in a character, named UCS-2, UCS-4, respectively, to convert an ASCII file into a UCS-2 file as long as one byte is added to a byte 0x00, converted into ucs -4 As long as you add three 0x00 before each byte. A large amount of information on the Internet is existing in the ASCII code. If you use two bytes to store will waste a lot of resources, using USC-2 and USC-4 under UNIX and Linux will result in serious problems, so UTF -8 (defined in ISO10646-1) .UTF-8 (UTF-8 Stands for Unicode Transformation Format-8. IT IS An OcT (8-bit) Lossless Encoding of Unicode Characters.

The correspondence between Unicode (UCS) and UTF-8. U-00000000 - U-0000007F: 0xxxxxxx U-00000080 - U-000007FF: 110xxxxx 10xxxxxx U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx U-04000000 - U-7FFFFFF: 1111110X 10xxxxxx 10xxxxxx 10xxxxxxxxxxxxxxxx

In the multi-byte string, the number of beginning '1' at the beginning of the first byte is the number of the entire string zone.

The correspondence between UCS-2 and UTF-8 will be raised, and the transformation between C language implementation is utilized. -------------------------------------------------- -------------------------------------------------- ------- | UCS2 | UTF-8 || --------------------------------- -------------------------------------------------- --------------------- | | Code | 1st Byte | 2nd Byte | 3rd Byte || --------------- -------------------------------------------------- --------------------------------------- | 000000000AAAAAA | 0000 - 007F | 0AAAAAAA | | | | ----------------------------------------------------------------------------- -------------------------------------------------- ----- | 00000bbbbbaaaaa | 0080 - 07FF | 110bbbbb | 10aaaaa | || -------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------- ----------------------- | CCCCBBBBBBAAAA | 0800 - FFFF | 1110cccc | 10bbbbbb | 10AAAAA || -------------- -------------------------------------------------- ----------------- ----------------------- Here I only realize the conversion of a single character, the conversion of the string is also the same. 1. Convert a UTF-8 character into a UCS-2 character. If the conversion is successful returns 1. If the UTF-8 character is a unrecognized character, then 0, and a BlackBox (U 22E0) is stored in UCS2_CODE_PTR.

Typedef unsigned short uint16; typef unsigned char uint8; typef unsigned char bool; #define true (bool) (1) #define false (bool) (0)

BOOL UTF8toUCS2Code (const UINT8 * utf8_code, UINT16 * ucs2_code) {UINT16 temp1, temp2; BOOL is_unrecognized = FALSE; UINT16 * in = utf8_code; if (!! Utf8_code || ucs2_code) {return is_unrecognized;} if (0x00 == (* IN & 0x80)) {/ * 1 byte uTF-8 Charater. * / * ucs2_code = (uint16) * in; is_unrecognized = true;} else if (0xc0 == (* in & 0xe0) && 0x80 == (* (* In 1) & 0xc0)) {/ * 2 bytes UTF-8 Charater. * / TEMP1 = (UINT16) (* in & 0x1f); Temp1 << = 6; Temp1 | = (UINT16) (* (in 1 ) & 0x3f); * ucs2_code = TEMP1; is_unrecognized = true;} else if (0xe0 == (* in & 0xF0) && 0x80 == (* (in 1) & 0xc0) && 0x80 == (* (in ) 2) & 0xc0)) {/ * 3bytes utf-8 Charater. * / Temp1 = (uint16) (* in & 0x0f); temp1 << = 12; TEMP2 = (UINT16) (* (in 1) & 0x3f);Temp2 << = 6; TEMP1 = TEMP1 | TEMP2 | (* (in 2) & 0x3f); * UCS2_CODE = TEMP1; IS_UNRECognized = true;} else {/ * unrecognize byte. * / * ucs2_code = 0x22E0; IS_UNRECognized = false;} return is_unrecognized;} 2, convert a UCS-2 character into a UTF-8 character. The function returns to the length of UTF-8 (byte 1 - 3), if the target pointer is empty, return 0.

UINT8 UCS2toUTF8Code (UINT16 ucs2_code, UINT8 * utf8_code) {int length = 0; UINT8 * out = utf8_code; if (utf8_code!) {Return length;} if (0x0080> ucs2_code) {/ * 1 byte UTF-8 Character * /. * out = (uint8) ucs2_code; length ;} else if (0x0800> ucs2_code) {/ * 2 bytes utf-8 character. * / * out = (uint8) (uint8) (UCS2_CODE >> 6)) | 0xc0; * (Out 1) = (uint8) (UCS2_CODE & 0x003F)) | 0x80; Length = 2;} else {/ * 3 bytes UTF-8 character. * / * Out = ((uint8) (uint8) (uint8)))) | 0xE0; * (OUT 1) = ((uint8) ((UCS2_CODE & 0x0FC0) >> 6)) | 0x80; * (OUT 2) = ((uint8) (UCS2_CODE & 0X003F)) | 0x80; Length = 3;} Return Length;} The conversion between the string is also the same.

转载请注明原文地址:https://www.9cbs.com/read-21914.html

New Post(0)