Chinese character coding standard and identification

xiaoxiao2021-03-06 37

Chinese characters code standard and identification ※ Source: · BBS Shuimu Tsinghua Station Smth.org · [From: 162.105.10.32]

Code page

This section is written according to the following article, it is recommended to seriously study the high theory of these experts.

Refer to 1 <> chanted shaft

<< Computer World >> Weekly 97-1-17

Reference 2 << 张轴材建建历历历历程程程程历程> 程程>>

Reporter Huang Weimin Xiao Chunjiang 99-8-30

Reference 3 << Chinese platform puts "Root" Retain >> Wu Jian << China Computer News >>

Publishing date: 1998-12-21 Total number: 348 This year: 51

Reference 4 << for all of the UNIX Chinese platform >> Sun Yufang << China Computer User >>

Date of publication: 1998-07-06 Total number: 323 This year: 26

Reference 5 cjk.inf:

ftp://ftp.ora.com/pub/examples/nutshell/ujip/

Doc / cjk.inf

Because I am just amateur level, not an expert, there are not many terms in reference materials.

Understand, there have been no standard official text, wrong and blur

Inevitable. At the same time, because the relevant departments of the state have propaganda, promotion and implement national standards

The face is not enough, resulting in a small business that is like me or a small business in the field.

Insufficient resources is in an unfavorable competitive position.

When ASCII is developed, there is no consider multilingual, especially the object Chinese Chinese characters.

Iconic text support. In order to put a lot of solutions, the code page

System (ISO2022) is a generally implemented program, but ISO10646 / GB13000 / Unicode

It is the direction of the future development.

China's Chinese character encoding standard GB2312 is 7BITS standard, specifically a double 7-bit byte standard.

ASCII is a single 7-bit byte standard, how is the computer distinguished? One is in the eighth position "1",

Tip computer transfer to double-byte coding, this is the most common implementation, also called EUC

(Extended Unix Code) Code. The other is to use special tag prompting computer transfer to double

Byte encoding, such as Hz coding is used in starting, using the ended block identifies a double-byte coding area.

It is an implementation of GB2312. Target Chinese characters such as Chinese characters, code page

It is based on various countries, regions or industry standards, and is encoded according to EUC. Code page down

Compatible with ASCII is an inequality. Will bring the complexity of code, and will also lead

Causes of garbled problems caused by code page switching.

Unicode is a multi-byte equation. ISO10646 / GB13000 / Unicode is now

The implementation of UCS2 is consistent, that is, double-byte coding standards have been implemented. The following discussed

ISO10646 / GB13000 / Unicode, just referring to this situation of UCS2. Unicode pair

ASCII takes a policy implementation of the previous plus "0" byte. Such as "a" ASCII code is 0x41,

The Unicode code is 0x00, 0x41.

Here is mainly from the National Standard (GB) series to understand Unicode. If you don't look at the reference 5

(English), I still don't know that the country is about the standard of Chinese character encoding. Chinese actually

It is a very helpless thing to understand the Chinese character coding standard from English.

Common Chinese Coding Standard Source: CJK.inf

GB2312-1980 (GB0) (Simplified) GB7589-1987 (GB2) (Simplified)

GB7590-1987 (GB4) (Simplified) GB13000-1993

GB6345.1-1986 (GB0 Correction)

GB8565.2-1988 (GB8, GB0 expansion)

GB / T12345-90 (GB1) (Traditional) GB / T13131-9X (GB3) (Traditional) GB / T13132-9X (GB5) (Traditional)

Among them, the transverse represents the character set series. Longitudinal representation of various series of development standards. among them

GB2312 is a basic set, which is the most common standard. GB7589 / GB7590 is an extension

Set, you may not be able to coexist with GB2312 during use, you need to switch. GB7589 / GB7590

It is arranged according to the part (department) and pen (strokes), but what is wrong, how to arrange,

What areas, unclear. After two corrections and expansion, the GB2312 series has been

Some of the GB2312-1980 standards (refer to 5). Because there is no standard text, I don't know

The font is being used to belong to which standard. According to the latest Unicode 3.0, the country

The latest standard is GB16500-95, and I don't know which series. ISO / IEC 10646

These national standards are equivalent to GB13000-1993 / JIS0221-1995 / KSC5000-1995.

The goal of formula is to include the text of each language, which is the most Chinese characters (Unicode2.0 has

20902 Chinese characters). About the standard characteristics can be seen from the reference 1, the wind wind in the process

Rain and rain can be seen 2. In short, this is a country involved and dominant.

International standard.

GBK is an intermediate product that GB2312 transition to GB13000. It is a big GB2312

Expansion, encoding downward compatible with the EUC code of GB2312, word exchange (character set) and GB13000

The same is 3 times the GB2312. So GBK also contains BIG5, Shift-JIS, KSC

Word. Note that only the word exchange is included, and the coding is different from the original standard. In specific

In the application, you can display GB2312, BIG5, SHIFT-JIS, KSC with GBK fonts.

String. But except for GB2312 strings, all other converts.

Because the language is unknown, it is unclear who dominates the GBK. Because some English

It is said that Microsoft has developed GBK, and the country has not been described. Currently

These reference materials only know, 94 years ISO / IEC 10646 released, Microsoft Development

WINDOWS95 Chinese version, to develop Chinese extended encoding. 96 "Chinese Character Expansion Code Specification"

GBK is released (refer to 1 ~ 3). According to the standard release, it is estimated in the late year, which is 95 years.

Windows95 and subsequent version of the Chinese version supports GBK.

The EUC coding range of GB2312 is the first byte 0xa1 ~ 0xfe (actually only 0xF7),

Two-bytes 0xA1 ~ 0xFE. GBK expands this. The first byte is 0x81 ~ 0xfe, second

The byte is divided into two parts, one is 0x40 ~ 0x7e, and the second is 0x80 ~ 0xfe. Among them, the GB2312 phase

The same area, the word is identical. The extension part is probably according to the part (department) and pen smooth (stroke)

Take it from the GB13000 to the GBK. Therefore, GBK is not GB13000, although both

The word is the same, but the encoding system is different. One is the ISO2022 series does not equalize, one

It is equal to the equation, and the encoding area is also different. Note that GBK is actually not national standards.

There is a GB2312 base set before that, it is a more advanced GB13000.

GBK is just a transition and expansion specification. So there is gb2312-> unicode in Unicode,

GB12345-> Unicode's conversion table, without a GBK-> Unicode conversion table. only

Microsoft made Code Page 936 (cp936.txt) can be counted as GBK-> Unicode

Convert form. But pay attention to this is a document made by a business company, rather than the national or international standard organization, which is likely to have inconsistent with the standard. Recently

The body website discovers some useful standard files, interested in download. But pay attention

GBK-BIG5.TAB and GB-BIG5.TAB are a bit awkward.

http://www.founderpku.com/fontweb/download/gbk-big5.tab

http://www.founderpku.com/fontweb/download/gb-big5.tab

http://www.founderpku.com/fontweb/gb2312.htm

http://www.founderpku.com/fontweb/gbk.htm

Make other standard mutual conversion tables, will and traditional conversion tables using these conversion tables

Difference. If you use GBK <=> Unicode <=> BIG5 to make GBK <=> BIG5 conversion table,

Will be different from traditional GB <=> BIG5 conversion tables. Mainly Chinese characters have simple and traditional.

The former is GBK (traditional Chinese characters) <=> BIG5 (Traditional Chinese characters), the latter is GB (Simplified) <=> BIG5 (Traditional).

There is also a difference between some tabors. Converted by Chinese characters, interested readers, you can see

Http://www.basistech.com/articles/c2c.html

http://www.cjk.org

// ****************************************

The relationship between internal code and font

Although there is no standard text, it is still possible to understand those words in common standards. TLC4.0

The font has a PCF font with GB2312, GB12345, BIG5, GBK standard. You can use XFD practical

Program view. in

Http://www.debian.org/chine has a 16-lady Unicode

PCF font. If FreeType is installed, you can use the XMBDFed software to view the TTF font.

If you use MS Word, it may be simpler.

In daily use, we are actually familiar with the word (internal code). Under Chinese, we lose

If you enter a two-eighth byte, you will get a Chinese character, you will think that this double-eight byte is corresponding.

The shape of the shape. This is wrong. In fact, the internal code is for the font, just look up the index of the glyph. Such as

Replace another font, the same string will present different glyphs,

It is garbled. I have seen the TTF font library of GB2312, BIG5 and ISO10646 / GB13000. For the operating system

In the case of the application, the favorite nature is the ISO10646 / GB13000 TTF font. Because

At this time, just provide a set of code and a set of fonts, modify the external configuration file, you can use it in different

Language environment. This is international and localization. There is a skill is ISO10646 / GB13000

The TTF font can be used to turn into other standard fonts when used.

GBK-> Unicode, BIG5-> Unicode These conversion tables. A system is upgraded to support Unicode 3.0,

It is also difficult. Simple place is just to modify the conversion table (such as /Windows/nls*.*).

It is difficult to upgrade the font. The development font is very difficult, you can see the development word in Fang Zhengli website.

The step of the library .Win9x used the TTF font library of Beijing Zhongyi Company, MS is impossible to develop a set

Text library. The ISO10646 / GB13000 TTF fonts I have seen, the latest is the 99th edition, Unicode 2.1,

Founder Font Library. To see all glyphs of Unicode 3.0, only these professional font library developers

Before doing it. If you want to see it now, you only ask the shaft. Because every new standard,

There is a 48X48 high-precision glyph of all Chinese characters. It is always tempting using the TTF font.

question. But now I don't know much, I can only talk about the problem of generating BDF / PCF font from TTF fonts.

Because now there is very little PCF font, there is only four kinds of Song, imitation Song, black body, and body. To have more fonts, there is a way to use the FreeType library. Generate BDF with TTFTOBDF program

Font, generate a PCF font with the BDFTOPCF program. But after the font scale generated by this method is compared

Grunge, and it is not appropriate to control. This may be that the conversion process of TTF-> BDF is lost, aspect ratio

Also different from the standard. The machine generated is mechanical, it is not possible to draw a hand-painted font.

of. At the same time, because TTF technology is mature, there is no need to continue to develop more PCF fonts.

X Window will accept and use a large number of TTF fonts. And the PCF font is mainly used in standard fonts in the future.

(Such as the Song), small lattice, fast download transmission in the Internet. Only use it in X Window

After the fonts of Unicode and TTF, they will experience the use of Unicode and TTF, which is both a capability.

It is also a burden. Because no matter what format font file, it is finally converted to

Fixed dot matrix in memory. If it is 16x16 points, a Chinese character uses 32 bytes. Unicode3.0

There are 27786 Chinese characters, at least 868KB of memory. If you want Chinese English, you can also install it.

There is a large number of Chinese fonts, and the memory needs to be understood. If you use TTF, you need another piece.

Save operations and storage. Therefore, even if X Window provides font cache and deferglyphs,

Still nothing. And the Chinese characters we use are actually very few. According to statistics, the frequencies of Chinese characters are commonly used.

The first 165 Chinese characters and> 50%, the first 1000 Chinese character frequency and> 95%; according to primary school teaching experience, knowledge

900 words, basically read, reading newspapers, writing; according to primary school teaching outline, elementary school graduation

Lenten 2,500 words; the frequency of the first-class font of GB2312 has been> 99%. I think my own literacy is about

4000 ~ 5000, compare the Chinese characters of Unicode, seeming to a illiterate :-). So use GB2312,

It is the use of GB13000, it is a two difficulties, we also pay for our choice.

Finally, the role of UTF8 is discussed by the relationship between the internal code and the font.

UTF8 is a transition solution for the existing ASCII system to the Unicode system. UTF8 is guaranteed

ASCII compatibility, expands to large character set directions. This is a solution recommended by Unicode. Because of

In order to solve the problem, it is not a good solution for existing Chinese systems.

CJK character encoding standard is currently a word / two bytes. The coding range in Chinese in UCS2 is

U 4E00 ~ U 9FFFF. In accordance with the UTF8 coding rules, a word / three bytes increases by 1/3.

At the same time, it is not compatible with the existing CJK system. CJK system To use UTF8, first convert to ucs2, then convert

UTF8. The next step is simply more. Because from the point of view of the font, the word code is only a shape

Index in the font. UTF8 is a variable length code, can not be directly indexed, need to be converted to UCS2 to make

Use the font.

With the development of GUI, the font base gradually turns to TTF. Coding standard for TTF fonts, GB2312 / GB2312

EUC standard; BIG5 standard; ISO10646 standard. Haven't seen UTF8 TTF, I don't know CJK

Which systems have used UTF8 encoding.

There is a characteristic of UnicoDDE to be the core code. Can continue to make it on the user surface

Use UCS2 inside the system with the original coding standard to operate and operate. System can be changed using users

Variable logo or module to identify the coding criterion required by the user and then convert. In this way, the system only needs to provide a set of TTFs of ISO 10646, which may not modify internal code.

Support for Chinese, Japanese, and Korean. The Chinese version of Windows95 and the following is to adopt this solution. Now

Some X Window TTF servers, X-TT and XFSFT can also use this solution.

The former has been implemented in the TurboLinux Chinese version, and the latter I tried, the effect is not bad. and also

An interesting phenomenon is the 12-point PCF font of the Hongqi Linux1.1 version.

/usr/x11r6/lib/x11/fonts/misc/gb12st.pcf.gz. This is not strict

GB2312 encoded a font. View using the XFD utility, as if it is a TTF font encoded from Unicode

Converting, some GBK words, unfortunately. If they can have some GBK coding standard PCF font

Enough.

The CJK system turns to UCS2 and the ASCII system turn to UTF8, and the code modification of both is quite. Just before

More conversion tables, need memory. However, the ASCII system uses UCS2 and needs to increase 50% of space.

At present, most of the information in the computer is also the information of ASCII, it seems that this is also a problem.

// *********************************************

Source and production of internal code conversion tables

Due to historical and geographical reasons, there are many Chinese standards in the computer to coexist with Internet.

in. This is reality. So there is a built-in conversion. The procedures in this regard have a lot now. Do not

Most of the MS Windows version, and there are many problems, so it is necessary to make one

A complete internal code conversion table.

source

Since the Unicode / ISO10646 / GB13000 standard, this work has become simple and

petty. Therefore, there is a guideline when making a conversion table: reference is based on international and national standards.

Conversion tables for commercial companies, individuals and small software. Below is the source of information:

A) International and National Standards Organization

International Standard Organization UNICODE

http://www.unicode.org)

GB <=> Unicode conversion table:

ftp://ftp.unicode.org/public/mappings/eastasia/gb

BIG5 <=> Unicode conversion table:

ftp://ftp.unicode.org/public/mappings/eastasia/other

JIS <=> Unicode conversion table:

ftp://ftp.unicode.org/public/mappings/eastasia/jis

KSC <=> Unicode conversion table:

ftp://ftp.unicode.org/public/mappings/eastasia/ksc

Because GBK is not a national standard, Unicode does not provide GBK <=> Unicode conversion

Table, but only a version of Microsoft's Code Page:

ftp://ftp.unicode.org/public/mappings/vendors/micsft/

Windows / cp {936, 950} .txt

China National Standard Net into the door is too difficult to 8,000 yuan / individual. So did not get formal

GB2312-1980 and GB13000-1993 standards.

2) Commercial companies

2.1 Founder Group Font Ministry

http://www.founderpku.com/fontweb/

Because Founder is a complex of production, learning, research, struggling for many years in the field version and font field, there is very

Special status. They provide conversion tables, almost equal to national standards. GB2312 standard:

http://www.founderpku.com/fontweb/gb2312.htm

GBK standard:

http://www.founderpku.com/fontweb/gbk.htm

GB <=> BIG5 conversion table:

http://www.founderpku.com/fontweb/download/gb-big5.tab

GBK <=> BIG5 conversion table:

http://www.founderpku.com/fontweb/download/gbk-big5.tab

2.2MICROSoft

http://www.microsoft.com/

No one is ignored by Microsoft. Sometimes, even if they are wrong, I finally

of. In some English materials, GBK is made into Microsoft. Microsoft from

Depart from a business perspective, offering Code Pages:

GBK glyph:

http://www.microsoft.com/TyPography/unicode/936gif.zip

GBK <=> Unicode conversion table:

http://www.microsoft.com/TyPography/unicode/936.txt

BIG5 shaped table:

http://www.microsoft.com/TyPography/unicode/950gif.zip

BIG5 <=> Unicode conversion table:

http://www.microsoft.com/typography/unicode/950.txt

There are also information in the Windows97 / 98 Chinese version:

GBK standard: /Windows/gbk.txt

Code pages: / windows / system / cp {932, 936, 949, 950} .nls

3) Personal and shared software

Many individuals and small groups have also explored this in this regard.

3.1 TEXTPRO

Http://person.zj.cninfo.net/~buddha

Because of their special needs, TextPro is indeed unique in BIG5 => GBK / GB.

Place. At the same time, there is a GBK (Traditional) => GB (Simplified) conversion table, very characteristic. Because of the traditional =>

Simplified is a multi-to-one mapping, so it is difficult to have a simplified => Traditional conversion table. Especially based on words

The conversion of the mapping to the word is impossible. Currently, some people are based on dictionary and context.

The mapping of the word. Interested to see

Http://www.basistech.com/articles/c2c.html

3.2 Stone Chi

http://stonec.yeah.net

The internal code conversion table based on Richwin is provided. Collect a lot of information, the internal code standard

Have a deep understanding. At the same time, there is also a Chinese search software worthy of a taste.

3.3 njstar

http://www.njstar.com and

Magicwin

http://www.magicwin.com.my

They have some days in this field. However, the conversion table is not very complete.

Make

It is made according to the above guidelines and arrangements. If you have a blank at the previous level, you will fill in the next level.

Supplement; if there is a conflict, the above level is accurate.

A) GB <=> Unicode and BIG5 <=> Unicode conversion table according to Unicode, GB <=> BIG5

Convert table.

2) According to Microsoft's GBK <=> Unicode and BIG5 <=> Unicode conversion table

GBK <=> BIG5 conversion table.

At this point, the standard conversion is actually completed. The characteristics of Unicode are one yard, one

Code word. Chinese characters in various countries and regions have been encircled with Unicode, and there are the same Unicode code Chinese characters, called CJK identity Chinese characters. But some Chinese characters cannot be recognized because of various reasons.

Similarly, if the conversion table of these Chinese characters can only be practical, it is possible to map multi-map

Conversion table.

3) Use Fang Zheng's GBK <=> BIG5 conversion table to fill (1) of the GB <=> BIG5 conversion table.

IV) GBK <=> BIG5 conversion table with Microsoft's GBK <=> BIG5 conversion table is filled (3).

5) Use TextPro and Stonec GBK <=> BIG5 conversion table to fill (four) GBK <=> BIG5

Try the table.

6) NJSTAR's conversion table is not very full, but in the BIG5 => GBK conversion table, half of the C6

Sections and C7, C8 are quite complete. The conversion table above is not blank here, it is very few conversions.

Maybe this area is an expansion symbol

District, there can be no. For insurance, use NJSTAR to fill this area.

7) check. Check the code table through the computer and found that in the Chinese character encoding.

Conflict is basically caused by different understandings of tab.

Eight) visual verification. That is, one word in the word naked eye. This is the most important step.

But because the knowledge and energy are limited, this will not be done.

// ***************************************

Chinese character encoding identification

Because of historical and regional reasons, Chinese characters have many coding standards. The most common is GB2312 and BIG5.

Before Unicode is fully accepted, they will coexist for a long time. So actually

It is necessary to distinguish them in use. This is the encoding identification.

There are now many software available under the Windows platform to identify and display GB2312 and BIG5 characters.

Strings are quite accurate. But because of the unmarried business opportunities, these algorithms are not disclosed.

I only see two algorithms now:

1) Algorithm 1

http://www.mandarintools.com

2) Algorithm 2

Http://202.38.128.58/~yumj/www/chrecog.html

The specific principle can look at the inventor's home page. Because these two algorithms are statistics by a large number of articles

Out, the practical application is the identification of a row. So there is a need for the effectiveness of short sentences and phrases

authenticating. Here is a suitably used method to analyze the identification rate of commonly used phrases. Schematic

Most of them consist of these meaningful phrases. Because both sides are not only coded, the habit is not

with. Therefore, the GB phrase 1.3MB, BIG5 phrase 900KB is collected, respectively. By comparing some

Interesting things.

1) Algorithm 1 takes a lot of memory, slower, but the identification rate is higher, and it is stable. The error is 8.6%,

Algorithm 2 is exactly the opposite, the error is 17.6%. Comprehensive two can improve some identification.

Identification of two algorithms

Algorithm 1 algorithm 2 comprehensive

GB file 5% 2.6% 0.7%

BIG5 file 3.6% 15% 5%

2) The value of the average value 184 mentioned in algorithm 2 does exist. But the best algorithm is not the author

The second byte algorithm. Instead, the algorithm added to the first byte and the second byte. Three algorithms to analyze GB

The phrase is all normal distribution: the first word annual algorithm is 195, the slope is steep, and the average set

in. The second-way algorithm is peak at 207, slope gentle, indicating the average dispersion. Double-byte

The method is between the two.

Analysis BIG5 phrase:

The first byte algorithm has peak 174, but the slope is much gentle.

The second-byte algorithm has a peak of 160, more gentle, rectangular distribution, that is, BIG5 commonly used phrase

The second byte of the coding range distribution is evenly distributed.

The double-byte phase plus algorithm is both the two of them.

So better algorithms are:

FLAG = (A * C1 C2) / (A 1) (A = 5 ~ 7 is preferably 15% in 184, the GB phrase has 5% of the words less than 184, and the BIG5 phrase has 15% of the words. average

The value is greater than 184, and the integrated error is 17.6%. That is, for the string of the GB code, algorithm 2 is not easy

Follow the string of the BIG5 code. If it is a GB file to convert to a BIG5 code, the error should be low

Some, 15%, as if it is the encoding of the GB code and the comparison of words.

3) Algorithm 1 identification rate increases

There are 6,763 Chinese characters of GB2312 standards, and BIG5 is more. And algorithm 1 takes only 600 words

The right, it seems to be less. Weighted coefficients are arranged from 1 to 600 rules, and it does not seem to reflect the rules of the word frequency.

law. For GB2312, it should be 1200 words according to the regular 2: 8 regularity; according to the primary school syllabus,

2,500 words in primary school graduation; according to primary school teaching experience, 900 literacy, basically read,

Reading, writing. Therefore, this weighted range should be around 900 to 1000 words. But what words, words

How much is much, it should be a speech that our language text expert speaks.

4) New possible algorithm

Different Chinese characters are different from both sides of the strait, and the common use is also different. So from common words

Group analysis is more different, and the recognition rate is higher. Unfortunately, there is no information, so I only hope that now.

No algorithm. At the same time, I hope more people have provided more better algorithms in the spirit of GPL big market.

转载请注明原文地址:https://www.9cbs.com/read-70669.html

9cbs

New Post(0)