Code conversion technology

xiaoxiao2021-03-06  16

Border = "0" name = "searchbar" align = "center" marginwidth = "0" frameespace = "0" marginheight = "0" src = "http://template.union.163.com/search/tpl2.jsp ? ID = gmhose & NO = 35850 & sessionID = 1404ho72107550fffmh19E628 & codeDate = 2005n6y9h & logo = 1 & select = 1 & SearchHeight = 36 & SearchWidth = 468 & boxColor = 2E6287 & txtColor = ffffff & borderColor = c0c0c0 & Key =% u9ED1% u5BA2:% u75C5% u6BD2:% u514D% u8D39% u4EE3% u7801 "frameborder =" 0 "width = "468" scrolling = "no" height = "36"> First, basic concept

· GB code

The full name is GB2312-80 "Basic Sets for Han Character Coding Character Set", published in 1980, is a national standard for Chinese information, in continental and overseas, using Simplified Chinese, etc., is the only Chinese code for mandatory use. . P-Windows3.2 and Apple OS are coded as Basic Chinese characters in GB2312, and Windows 95/98 is encoded by GBK as basic Chinese character, but is compatible with GB2312. The GB code has a total of 6763 Simplified Chinese characters, 682 symbols, including the Chinese character part: 1 word 3755, sorted in pinyin, secondary word 3008, sorted by side. The formulation and application of this standard has played a big role in promoting Chinese informationization processes. In 1990, the coding standard GB12345-90 "Information exchange for information exchange", which is the first auxiliary set of information exchange, ", which must use traditional Chinese characters, and ancient books. This standard has a total of 6866 Chinese characters (more than 103 words more than GB2312, which do not include these words), and pure traditional words have about 2,200 words. (2312 episodes and 12345 sets are not intersecting. One is Simplified, one is traditional)

· BIG5

It is currently the coding standard for traditional Chinese characters in Taiwan and Hong Kong, including 440 symbols, 5401 Chinese characters, 7652 Chinese characters, and a total of 13060 Chinese characters. BIG-5 is a double-byte coding scheme, and the value of the first byte is between 16-based A0 to Fe, the second byte is between 40 to 7E and A1 to Fe. Therefore, the highest bit of the first byte is 1, the highest bit of the second byte may be 1, or it may be 0.

· Chinese Internal Code Specification

GBK coding (commonly known as big character set) is a new Chinese coding expansion national standard made by China's mainland, equivalent to UCS. GBK Working Group completed the GBK specification in December of the same year in October 1995. The coding standard is compatible with GB2312, including 2,1003 Chinese characters, 883 symbols, and provides 1894 types of characters, simple, and traditional characters in one library. Windows 95/98 Simplified Chinese version of the font library surface encoding is GBK, contact the underlying letter library with the Codes table with one or one between GBK and UCS. The value of its first byte is between the 81- FE of 16, and the second byte is 40 to Fe, remove the XX7F line. • Universal Multiple Oct Coded Character Set

International Standards Organize an ISO / IEC JTC1 / SC2 / WG2 Working Group in April 1984, which is uniformly encoded for Chinese characters and symbols. 1991 US multinational company established UNICODE CONSORTIUM, and reached an agreement with the WG2 in October 1991, using the same codeworthy set. At present, Unicode is a 16-bit encoding system that is the same as the BMP (Basic Multilingual Plane) of ISO10646. Unicode passed DIS (Draf International Standard) in June 1992, the current version V2.0 is announced in 1996, including 6811 symbols, 20902 Chinese characters, 11,172 Chinese characters, 6400 Chinese, 20,249, total 65534 .

Second, some annotations

Here explains some of our common Chinese character internal code conversion tools:

1. The most common is GB2BIG5 and BIG52GB conversion tools. The GB refers to the GB refers to the GB2312 episode.

2, GBK Simplified Compatibility GB2312 character set and its encoding. The irregular understanding is that GB is GBK Simplified.

3, "Traditional" is not equivalent to BIG5, there is also traditional in GBK, GB12345 episodes are traditional. But the three Chinese characters encoding are different. The use of the GBK character set in Windows95 / 98 / NT / 2000 (Simplified); the traditional version of the BIG5 character set is not available in the Simplified Edition, and the BIG5 character is not displayed. The traditional version cannot display GB characters.

4, in IE, enter the BIG5 code website (such as: Taiwan website), if there is a BIG5 character set support, IE will convert the BIG5 web page into GBK Traditional display, no garbled. When IE is displayed in GBK, the Chinese characters entered in the web page should be GBK Traditional. When displaying the BIG5 code (garbled), enter the BIG5 code character (input garbled? First enter the GBK Simplified ---- GB code, then use small The tool converts it to BIG5, copy, and paste it).

5. In a common gadget, BIG5 can be converted into GBK, and there are not many GBK Simplified Traditional. The reason is that they have a correspondence between GB2312 character sets and BIG5 character sets.

Third, the internal code conversion principle and method

Inclusion: It is a correspondence between different character sets.

With GBK2BIG5 (you can

Such as: Let the word, encoded in the GBK is C8C3. If we turn the characters in the GBK code table into a BIG5 code format, the C8C3 bit should be "let" the BIG5 code character "琵" (字 is not a gbk, but "let" the BIG5 Code Chinese shows the result in the GBK environment). This way we read the text to be converted, find it in GBK (already converted into a BIG5 format) code table, remove the characters at this location, replace the original characters. Reading and writing is not a problem. The key is how to locate the Chinese characters in the code table file and how to convert the pure GBK code table to the GBK code table represented by the BIG5 format.

Question 1. Position Chinese characters.

GBK code table (arrangement according to the order of code)

81-87 88-8F 90-97 98-9F A0-A7 A8-AF B0-B7 B8-BF

C0-C7 C8-CF D0-D7 D8-DF E0-E7 E8-EF F0-F7 F8-Fe

81 0 1 2 3 4 5 6 7 8 9 A b C D e f

4 丂 丄 丅 丆 丏 丒 丗 丢 丠 両 丣 丣 丩 丩 丮 丯 丱 丱 丱

5 丳 丵 丷 丼 乀 乁 乂 乄 乆 乊 乑 乕 乚 乛 乢

6 乣 乤 乥 乧 乨 乪 乫 乬 乭 乮 乯 乲 乴 乵 乶 乷 乷

7 乸 乹 乺 乻 乼 乽 乿 亀 亁 亃 亃 亄 亅 亇 亊

8 亐 亖 亗 亜 亝 亚 亣 亪 亯 亰 亱 亴 亴 亶 亸 亸

9 亹 亼 亽 亽 仈 仌 仏 仐 仒 仚 仛 仜 仠 仢 仦 仧

A 仩 仭 仮 仯 仱 仴 仸 仹 仺 仼 仾 伀 伂 伃 伄 伅

B 伆 伇 伈 伌 伒 伓 伔 伕 伖 伜 伜 伡 伣 伣 伨 伩

C 伬 伭 伮 伱 伳 伵 伷 伹 伻 伾 伿 佀 佁 佂 佄 佅

D 伫 伫 佉 佊 佋 佒 佒 佖 佖 佡 渠 佦 佨 佪 佫 佭

E 佮 佱 佲 佲 佷 佸 佹 佺 佽 侀 侁 侂 侅 侇 侇 侊

F 侌 侎 侐 侒 侓 侕 侕 侘 侙 侚 侜 侞 侟 価 侢

The above is to arrange the GBK code table in the order of the code, a total of 126 districts, 190 Chinese characters per zone. The calculation of the Chinese character position is as follows:

POSIT = (CH1 - 129) * 190 (CH2 - 64) - (CH2 / 128); (Nth Chinese)

POSIT = POSIT * 2; (Nature)

The first question is even calculated.

Problem 2, use the GBK code table with BIG5.

We can use existing tools, such as Oriental Express 3000, convert GBK code tables into BIG5 format. But there is a problem in practice because GBK is more than the Chinese characters of BIG5, then the characters in the GBK, and the characters in the BIG5 may be deleted in the conversion, and the above-mentioned post code table is not available. And actually unable to position. However, I found a text of the GBK code table represented by BIG5 (may be official), and the character is not lacking.

This problem is also available.

Also we can do

BIG52GBKT (Traditional), BIG52GBKS (Simplified), GBKS2GBKT, GBKT2GBKS, GBK2BIG5 Transformation. Here is a BIG5 code table format, and a positioning algorithm:

BIG-5 code table

A0-A7 A8-AF B0-B7 B8-BF C0-C7 C8-CF

D0-D7 D8-DF E0-E7 E8-EF F0-F7 F8-Fe

(Has been converted into GBK)

B0 0 1 2 3 4 5 6 7 8 9 A b C D e f

4 蚓 蚩 蚩

5 讦 讦 讧 讪 讪 训 讫 讫 訑 岂 豺 豺 豺 豺 财 豺 豺

6 轩 轫 軏 軏 送 退 退 迺 迺 迺 迺 迺 迺 迺 迺 迺 迺 迺 迺

7 county Hao Yu wine with discretion nails needle kettle 钋 钋 flash yard steep

A 陉 陉 陉 陉 马 高 高 高 高 高 高 鬲 鬲 高 鬲

B pseudo-stop fake 偃 偌 偌 健 侧

C 偯 偭 偭 副 副 动 动 匏 区 区

D 匾 参 商 啪 啄 啄 啄 啃 啖 啖 啖 啖 啕 啕 啕

E Beer Sale Selling 唳 啁 啖 国 国 圉 域 坚 坚

F 埠 埤 埤 执 培 奢 妇 妇 妇 妇 妇

Positioning method:

IF ((CH2> = 64) && (CH2 <= 126)))

{

POSIT = (CH1 - 160) * 157 (CH2 - 64); POSIT = POSIT * 2 - 1;

}

ELSE IF ((CH2> = 161) && (CH2 <= 254)))

{

POSIT = (CH1 - 160) * 157 62 (CH2 - 160);

POSIT = POSIT * 2 - 1;

}

A program for GBK2BIG5 C Builder is given here:

FGBK2BIG5 = FOPEN ("Puregbk2big5byOrder.txt", "RB");

Unsigned long i, posit; // convert GB code to GBKT

Unsigned char CH1, CH2;

String scontext;

CHAR CHR;

Scontext = MEMO1-> LINES-> text;

i = 1;

While (i

{

CH1 = scontext [i];

CH2 = SCONTEXT [i 1];

IF ((CH1> = 129) && (CH1 <= 254)))

{

IF ((CH2> = 64) && (CH2 <127)) || ((CH2> 127) && (CH2 <= 254))))))))

{

POSIT = (CH1 - 129) * 190 (CH2 - 64) - (CH2 / 128);

POSIT = POSIT * 2;

IF ((POSIT> 23940 * 2) || (POSIT <0))

{

i ;

CONTINUE;

}

FSeek (FGBK2big5, Posit - Ftell (FGBK2BIG5), 1);

Fread ((Void *) (& Chr), Sizeof (Char), 1, FGBK2BIG5);

Scontext [I] = CHR;

Fread ((Void *) (& Chr), Sizeof (Char), 1, FGBK2BIG5);

Scontext [i 1] = CHR;

i = 2;

}

Else

{

i ;

}

}

Else

{

i ;

}

}

Memo1-> lines-> text = scontext;

Many of the above information comes from the reference material of the network lighthouse (http://www.haiyan.com/steelk/navigator/ref/gbindex1.htm).

To 5 code tables, and the evidence of the source program

转载请注明原文地址:https://www.9cbs.com/read-46843.html

New Post(0)