Chinese characters code standard and identification ※ Source: · BBS Shuimu Tsinghua Station Smth.org · [From: 162.105.10.32]
Code page
This section is written according to the following article, it is recommended to seriously study the high theory of these experts.
Refer to 1 <> chanted shaft
<< Computer World >> Weekly 97-1-17
Reference 2 << 张 轴 材 建 建 历 历 历 历 程 程 程 程 历 程> 程 程>>
Reporter Huang Weimin Xiao Chunjiang 99-8-30
Reference 3 << Chinese platform puts "Root" Retain >> Wu Jian << China Computer News >>
Publishing date: 1998-12-21 Total number: 348 This year: 51
Reference 4 << for all of the UNIX Chinese platform >> Sun Yufang << China Computer User >>
Date of publication: 1998-07-06 Total number: 323 This year: 26
Reference 5 cjk.inf:
ftp://ftp.ora.com/pub/examples/nutshell/ujip/
Doc / cjk.inf
Because I am just amateur level, not an expert, there are not many terms in reference materials.
Understand, there have been no standard official text, wrong and blur
Inevitable. At the same time, because the relevant departments of the state have propaganda, promotion and implement national standards
The face is not enough, resulting in a small business that is like me or a small business in the field.
Insufficient resources is in an unfavorable competitive position.
When ASCII is developed, there is no consider multilingual, especially the object Chinese Chinese characters.
Iconic text support. In order to put a lot of solutions, the code page
System (ISO2022) is a generally implemented program, but ISO10646 / GB13000 / Unicode
It is the direction of the future development.
China's Chinese character encoding standard GB2312 is 7BITS standard, specifically a double 7-bit byte standard.
ASCII is a single 7-bit byte standard, how is the computer distinguished? One is in the eighth position "1",
Tip computer transfer to double-byte coding, this is the most common implementation, also called EUC
(Extended Unix Code) Code. The other is to use special tag prompting computer transfer to double
Byte encoding, such as Hz coding is used in starting, using the ended block identifies a double-byte coding area.
It is an implementation of GB2312. Target Chinese characters such as Chinese characters, code page
It is based on various countries, regions or industry standards, and is encoded according to EUC. Code page down
Compatible with ASCII is an inequality. Will bring the complexity of code, and will also lead
Causes of garbled problems caused by code page switching.
Unicode is a multi-byte equation. ISO10646 / GB13000 / Unicode is now
The implementation of UCS2 is consistent, that is, double-byte coding standards have been implemented. The following discussed
ISO10646 / GB13000 / Unicode, just referring to this situation of UCS2. Unicode pair
ASCII takes a policy implementation of the previous plus "0" byte. Such as "a" ASCII code is 0x41,
The Unicode code is 0x00, 0x41.
Here is mainly from the National Standard (GB) series to understand Unicode. If you don't look at the reference 5
(English), I still don't know that the country is about the standard of Chinese character encoding. Chinese actually
It is a very helpless thing to understand the Chinese character coding standard from English.
Common Chinese Coding Standard Source: CJK.inf
GB2312-1980 (GB0) (Simplified) GB7589-1987 (GB2) (Simplified)
GB7590-1987 (GB4) (Simplified) GB13000-1993
GB6345.1-1986 (GB0 Correction)
GB8565.2-1988 (GB8, GB0 expansion)
GB / T12345-90 (GB1) (Traditional) GB / T13131-9X (GB3) (Traditional) GB / T13132-9X (GB5) (Traditional)
Among them, the transverse represents the character set series. Longitudinal representation of various series of development standards. among them
GB2312 is a basic set, which is the most common standard. GB7589 / GB7590 is an extension
Set, you may not be able to coexist with GB2312 during use, you need to switch. GB7589 / GB7590
It is arranged according to the part (department) and pen (strokes), but what is wrong, how to arrange,
What areas, unclear. After two corrections and expansion, the GB2312 series has been
Some of the GB2312-1980 standards (refer to 5). Because there is no standard text, I don't know
The font is being used to belong to which standard. According to the latest Unicode 3.0, the country
The latest standard is GB16500-95, and I don't know which series. ISO / IEC 10646
These national standards are equivalent to GB13000-1993 / JIS0221-1995 / KSC5000-1995.
The goal of formula is to include the text of each language, which is the most Chinese characters (Unicode2.0 has
20902 Chinese characters). About the standard characteristics can be seen from the reference 1, the wind wind in the process
Rain and rain can be seen 2. In short, this is a country involved and dominant.
International standard.
GBK is an intermediate product that GB2312 transition to GB13000. It is a big GB2312
Expansion, encoding downward compatible with the EUC code of GB2312, word exchange (character set) and GB13000
The same is 3 times the GB2312. So GBK also contains BIG5, Shift-JIS, KSC
Word. Note that only the word exchange is included, and the coding is different from the original standard. In specific
In the application, you can display GB2312, BIG5, SHIFT-JIS, KSC with GBK fonts.
String. But except for GB2312 strings, all other converts.
Because the language is unknown, it is unclear who dominates the GBK. Because some English
It is said that Microsoft has developed GBK, and the country has not been described. Currently
These reference materials only know, 94 years ISO / IEC 10646 released, Microsoft Development
WINDOWS95 Chinese version, to develop Chinese extended encoding. 96 "Chinese Character Expansion Code Specification"
GBK is released (refer to 1 ~ 3). According to the standard release, it is estimated in the late year, which is 95 years.
Windows95 and subsequent version of the Chinese version supports GBK.
The EUC coding range of GB2312 is the first byte 0xa1 ~ 0xfe (actually only 0xF7),
Two-bytes 0xA1 ~ 0xFE. GBK expands this. The first byte is 0x81 ~ 0xfe, second
The byte is divided into two parts, one is 0x40 ~ 0x7e, and the second is 0x80 ~ 0xfe. Among them, the GB2312 phase
The same area, the word is identical. The extension part is probably according to the part (department) and pen smooth (stroke)
Take it from the GB13000 to the GBK. Therefore, GBK is not GB13000, although both
The word is the same, but the encoding system is different. One is the ISO2022 series does not equalize, one
It is equal to the equation, and the encoding area is also different. Note that GBK is actually not national standards.
There is a GB2312 base set before that, it is a more advanced GB13000.
GBK is just a transition and expansion specification. So there is gb2312-> unicode in Unicode,
GB12345-> Unicode's conversion table, without a GBK-> Unicode conversion table. only
Microsoft made Code Page 936 (cp936.txt) can be counted as GBK-> Unicode
Convert form. But pay attention to this is a document made by a business company, rather than the national or international standard organization, which is likely to have inconsistent with the standard. Recently
The body website discovers some useful standard files, interested in download. But pay attention
GBK-BIG5.TAB and GB-BIG5.TAB are a bit awkward.
http://www.founderpku.com/fontweb/download/gbk-big5.tab
http://www.founderpku.com/fontweb/download/gb-big5.tab
http://www.founderpku.com/fontweb/gb2312.htm
http://www.founderpku.com/fontweb/gbk.htm
Make other standard mutual conversion tables, will and traditional conversion tables using these conversion tables
Difference. If you use GBK <=> Unicode <=> BIG5 to make GBK <=> BIG5 conversion table,
Will be different from traditional GB <=> BIG5 conversion tables. Mainly Chinese characters have simple and traditional.
The former is GBK (traditional Chinese characters) <=> BIG5 (Traditional Chinese characters), the latter is GB (Simplified) <=> BIG5 (Traditional).
There is also a difference between some tabors. Converted by Chinese characters, interested readers, you can see
Http://www.basistech.com/articles/c2c.html
http://www.cjk.org
// ****************************************
The relationship between internal code and font
Although there is no standard text, it is still possible to understand those words in common standards. TLC4.0
The font has a PCF font with GB2312, GB12345, BIG5, GBK standard. You can use XFD practical
Program view. in
Http://www.debian.org/chine has a 16-lady Unicode
PCF font. If FreeType is installed, you can use the XMBDFed software to view the TTF font.
If you use MS Word, it may be simpler.
In daily use, we are actually familiar with the word (internal code). Under Chinese, we lose
If you enter a two-eighth byte, you will get a Chinese character, you will think that this double-eight byte is corresponding.
The shape of the shape. This is wrong. In fact, the internal code is for the font, just look up the index of the glyph. Such as
Replace another font, the same string will present different glyphs,
It is garbled. I have seen the TTF font library of GB2312, BIG5 and ISO10646 / GB13000. For the operating system
In the case of the application, the favorite nature is the ISO10646 / GB13000 TTF font. Because
At this time, just provide a set of code and a set of fonts, modify the external configuration file, you can use it in different
Language environment. This is international and localization. There is a skill is ISO10646 / GB13000
The TTF font can be used to turn into other standard fonts when used.
GBK-> Unicode, BIG5-> Unicode These conversion tables. A system is upgraded to support Unicode 3.0,
It is also difficult. Simple place is just to modify the conversion table (such as /Windows/nls*.*).
It is difficult to upgrade the font. The development font is very difficult, you can see the development word in Fang Zhengli website.
The step of the library .Win9x used the TTF font library of Beijing Zhongyi Company, MS is impossible to develop a set
Text library. The ISO10646 / GB13000 TTF fonts I have seen, the latest is the 99th edition, Unicode 2.1,
Founder Font Library. To see all glyphs of Unicode 3.0, only these professional font library developers
Before doing it. If you want to see it now, you only ask the shaft. Because every new standard,
There is a 48X48 high-precision glyph of all Chinese characters. It is always tempting using the TTF font.
question. But now I don't know much, I can only talk about the problem of generating BDF / PCF font from TTF fonts.
Because now there is very little PCF font, there is only four kinds of Song, imitation Song, black body, and body. To have more fonts, there is a way to use the FreeType library. Generate BDF with TTFTOBDF program
Font, generate a PCF font with the BDFTOPCF program. But after the font scale generated by this method is compared
Grunge, and it is not appropriate to control. This may be that the conversion process of TTF-> BDF is lost, aspect ratio
Also different from the standard. The machine generated is mechanical, it is not possible to draw a hand-painted font.
of. At the same time, because TTF technology is mature, there is no need to continue to develop more PCF fonts.
X Window will accept and use a large number of TTF fonts. And the PCF font is mainly used in standard fonts in the future.
(Such as the Song), small lattice, fast download transmission in the Internet. Only use it in X Window
After the fonts of Unicode and TTF, they will experience the use of Unicode and TTF, which is both a capability.
It is also a burden. Because no matter what format font file, it is finally converted to
Fixed dot matrix in memory. If it is 16x16 points, a Chinese character uses 32 bytes. Unicode3.0
There are 27786 Chinese characters, at least 868KB of memory. If you want Chinese English, you can also install it.
There is a large number of Chinese fonts, and the memory needs to be understood. If you use TTF, you need another piece.
Save operations and storage. Therefore, even if X Window provides font cache and deferglyphs,
Still nothing. And the Chinese characters we use are actually very few. According to statistics, the frequencies of Chinese characters are commonly used.
The first 165 Chinese characters and> 50%, the first 1000 Chinese character frequency and> 95%; according to primary school teaching experience, knowledge
900 words, basically read, reading newspapers, writing; according to primary school teaching outline, elementary school graduation
Lenten 2,500 words; the frequency of the first-class font of GB2312 has been> 99%. I think my own literacy is about
4000 ~ 5000, compare the Chinese characters of Unicode, seeming to a illiterate :-). So use GB2312,
It is the use of GB13000, it is a two difficulties, we also pay for our choice.
Finally, the role of UTF8 is discussed by the relationship between the internal code and the font.
UTF8 is a transition solution for the existing ASCII system to the Unicode system. UTF8 is guaranteed
ASCII compatibility, expands to large character set directions. This is a solution recommended by Unicode. Because of
In order to solve the problem, it is not a good solution for existing Chinese systems.
CJK character encoding standard is currently a word / two bytes. The coding range in Chinese in UCS2 is
U 4E00 ~ U 9FFFF. In accordance with the UTF8 coding rules, a word / three bytes increases by 1/3.
At the same time, it is not compatible with the existing CJK system. CJK system To use UTF8, first convert to ucs2, then convert
UTF8. The next step is simply more. Because from the point of view of the font, the word code is only a shape
Index in the font. UTF8 is a variable length code, can not be directly indexed, need to be converted to UCS2 to make
Use the font.
With the development of GUI, the font base gradually turns to TTF. Coding standard for TTF fonts, GB2312 / GB2312
EUC standard; BIG5 standard; ISO10646 standard. Haven't seen UTF8 TTF, I don't know CJK
Which systems have used UTF8 encoding.
There is a characteristic of UnicoDDE to be the core code. Can continue to make it on the user surface
Use UCS2 inside the system with the original coding standard to operate and operate. System can be changed using users
Variable logo or module to identify the coding criterion required by the user and then convert. In this way, the system only needs to provide a set of TTFs of ISO 10646, which may not modify internal code.
Support for Chinese, Japanese, and Korean. The Chinese version of Windows95 and the following is to adopt this solution. Now
Some X Window TTF servers, X-TT and XFSFT can also use this solution.
The former has been implemented in the TurboLinux Chinese version, and the latter I tried, the effect is not bad. and also
An interesting phenomenon is the 12-point PCF font of the Hongqi Linux1.1 version.
/usr/x11r6/lib/x11/fonts/misc/gb12st.pcf.gz. This is not strict
GB2312 encoded a font. View using the XFD utility, as if it is a TTF font encoded from Unicode
Converting, some GBK words, unfortunately. If they can have some GBK coding standard PCF font
Enough.
The CJK system turns to UCS2 and the ASCII system turn to UTF8, and the code modification of both is quite. Just before
More conversion tables, need memory. However, the ASCII system uses UCS2 and needs to increase 50% of space.
At present, most of the information in the computer is also the information of ASCII, it seems that this is also a problem.
// *********************************************
Source and production of internal code conversion tables
Due to historical and geographical reasons, there are many Chinese standards in the computer to coexist with Internet.
in. This is reality. So there is a built-in conversion. The procedures in this regard have a lot now. Do not
Most of the MS Windows version, and there are many problems, so it is necessary to make one
A complete internal code conversion table.
source
Since the Unicode / ISO10646 / GB13000 standard, this work has become simple and
petty. Therefore, there is a guideline when making a conversion table: reference is based on international and national standards.
Conversion tables for commercial companies, individuals and small software. Below is the source of information:
A) International and National Standards Organization
International Standard Organization UNICODE
http://www.unicode.org)
GB <=> Unicode conversion table:
ftp://ftp.unicode.org/public/mappings/eastasia/gb
BIG5 <=> Unicode conversion table:
ftp://ftp.unicode.org/public/mappings/eastasia/other
JIS <=> Unicode conversion table:
ftp://ftp.unicode.org/public/mappings/eastasia/jis
KSC <=> Unicode conversion table:
ftp://ftp.unicode.org/public/mappings/eastasia/ksc
Because GBK is not a national standard, Unicode does not provide GBK <=> Unicode conversion
Table, but only a version of Microsoft's Code Page:
ftp://ftp.unicode.org/public/mappings/vendors/micsft/
Windows / cp {936, 950} .txt
China National Standard Net into the door is too difficult to 8,000 yuan / individual. So did not get formal
GB2312-1980 and GB13000-1993 standards.
2) Commercial companies
2.1 Founder Group Font Ministry
http://www.founderpku.com/fontweb/
Because Founder is a complex of production, learning, research, struggling for many years in the field version and font field, there is very
Special status. They provide conversion tables, almost equal to national standards. GB2312 standard:
http://www.founderpku.com/fontweb/gb2312.htm
GBK standard:
http://www.founderpku.com/fontweb/gbk.htm
GB <=> BIG5 conversion table:
http://www.founderpku.com/fontweb/download/gb-big5.tab
GBK <=> BIG5 conversion table:
http://www.founderpku.com/fontweb/download/gbk-big5.tab
2.2MICROSoft
http://www.microsoft.com/
No one is ignored by Microsoft. Sometimes, even if they are wrong, I finally
of. In some English materials, GBK is made into Microsoft. Microsoft from
Depart from a business perspective, offering Code Pages:
GBK glyph:
http://www.microsoft.com/TyPography/unicode/936gif.zip
GBK <=> Unicode conversion table:
http://www.microsoft.com/TyPography/unicode/936.txt
BIG5 shaped table:
http://www.microsoft.com/TyPography/unicode/950gif.zip
BIG5 <=> Unicode conversion table:
http://www.microsoft.com/typography/unicode/950.txt
There are also information in the Windows97 / 98 Chinese version:
GBK standard: /Windows/gbk.txt
Code pages: / windows / system / cp {932, 936, 949, 950} .nls
3) Personal and shared software
Many individuals and small groups have also explored this in this regard.
3.1 TEXTPRO
Http://person.zj.cninfo.net/~buddha
Because of their special needs, TextPro is indeed unique in BIG5 => GBK / GB.
Place. At the same time, there is a GBK (Traditional) => GB (Simplified) conversion table, very characteristic. Because of the traditional =>
Simplified is a multi-to-one mapping, so it is difficult to have a simplified => Traditional conversion table. Especially based on words
The conversion of the mapping to the word is impossible. Currently, some people are based on dictionary and context.
The mapping of the word. Interested to see
Http://www.basistech.com/articles/c2c.html
3.2 Stone Chi
http://stonec.yeah.net
The internal code conversion table based on Richwin is provided. Collect a lot of information, the internal code standard
Have a deep understanding. At the same time, there is also a Chinese search software worthy of a taste.
3.3 njstar
http://www.njstar.com and
Magicwin
http://www.magicwin.com.my
They have some days in this field. However, the conversion table is not very complete.
Make
It is made according to the above guidelines and arrangements. If you have a blank at the previous level, you will fill in the next level.
Supplement; if there is a conflict, the above level is accurate.
A) GB <=> Unicode and BIG5 <=> Unicode conversion table according to Unicode, GB <=> BIG5
Convert table.
2) According to Microsoft's GBK <=> Unicode and BIG5 <=> Unicode conversion table
GBK <=> BIG5 conversion table.
At this point, the standard conversion is actually completed. The characteristics of Unicode are one yard, one
Code word. Chinese characters in various countries and regions have been encircled with Unicode, and there are the same Unicode code Chinese characters, called CJK identity Chinese characters. But some Chinese characters cannot be recognized because of various reasons.
Similarly, if the conversion table of these Chinese characters can only be practical, it is possible to map multi-map
Conversion table.
3) Use Fang Zheng's GBK <=> BIG5 conversion table to fill (1) of the GB <=> BIG5 conversion table.
IV) GBK <=> BIG5 conversion table with Microsoft's GBK <=> BIG5 conversion table is filled (3).
5) Use TextPro and Stonec GBK <=> BIG5 conversion table to fill (four) GBK <=> BIG5
Try the table.
6) NJSTAR's conversion table is not very full, but in the BIG5 => GBK conversion table, half of the C6
Sections and C7, C8 are quite complete. The conversion table above is not blank here, it is very few conversions.
Maybe this area is an expansion symbol
District, there can be no. For insurance, use NJSTAR to fill this area.
7) check. Check the code table through the computer and found that in the Chinese character encoding.
Conflict is basically caused by different understandings of tab.
Eight) visual verification. That is, one word in the word naked eye. This is the most important step.
But because the knowledge and energy are limited, this will not be done.
// ***************************************
Chinese character encoding identification
Because of historical and regional reasons, Chinese characters have many coding standards. The most common is GB2312 and BIG5.
Before Unicode is fully accepted, they will coexist for a long time. So actually
It is necessary to distinguish them in use. This is the encoding identification.
There are now many software available under the Windows platform to identify and display GB2312 and BIG5 characters.
Strings are quite accurate. But because of the unmarried business opportunities, these algorithms are not disclosed.
I only see two algorithms now:
1) Algorithm 1
http://www.mandarintools.com
2) Algorithm 2
Http://202.38.128.58/~yumj/www/chrecog.html
The specific principle can look at the inventor's home page. Because these two algorithms are statistics by a large number of articles
Out, the practical application is the identification of a row. So there is a need for the effectiveness of short sentences and phrases
authenticating. Here is a suitably used method to analyze the identification rate of commonly used phrases. Schematic
Most of them consist of these meaningful phrases. Because both sides are not only coded, the habit is not
with. Therefore, the GB phrase 1.3MB, BIG5 phrase 900KB is collected, respectively. By comparing some
Interesting things.
1) Algorithm 1 takes a lot of memory, slower, but the identification rate is higher, and it is stable. The error is 8.6%,
Algorithm 2 is exactly the opposite, the error is 17.6%. Comprehensive two can improve some identification.
Identification of two algorithms
Algorithm 1 algorithm 2 comprehensive
GB file 5% 2.6% 0.7%
BIG5 file 3.6% 15% 5%
2) The value of the average value 184 mentioned in algorithm 2 does exist. But the best algorithm is not the author
The second byte algorithm. Instead, the algorithm added to the first byte and the second byte. Three algorithms to analyze GB
The phrase is all normal distribution: the first word annual algorithm is 195, the slope is steep, and the average set
in. The second-way algorithm is peak at 207, slope gentle, indicating the average dispersion. Double-byte
The method is between the two.
Analysis BIG5 phrase:
The first byte algorithm has peak 174, but the slope is much gentle.
The second-byte algorithm has a peak of 160, more gentle, rectangular distribution, that is, BIG5 commonly used phrase
The second byte of the coding range distribution is evenly distributed.
The double-byte phase plus algorithm is both the two of them.
So better algorithms are:
FLAG = (A * C1 C2) / (A 1) (A = 5 ~ 7 is preferably 15% in 184, the GB phrase has 5% of the words less than 184, and the BIG5 phrase has 15% of the words. average
The value is greater than 184, and the integrated error is 17.6%. That is, for the string of the GB code, algorithm 2 is not easy
Follow the string of the BIG5 code. If it is a GB file to convert to a BIG5 code, the error should be low
Some, 15%, as if it is the encoding of the GB code and the comparison of words.
3) Algorithm 1 identification rate increases
There are 6,763 Chinese characters of GB2312 standards, and BIG5 is more. And algorithm 1 takes only 600 words
The right, it seems to be less. Weighted coefficients are arranged from 1 to 600 rules, and it does not seem to reflect the rules of the word frequency.
law. For GB2312, it should be 1200 words according to the regular 2: 8 regularity; according to the primary school syllabus,
2,500 words in primary school graduation; according to primary school teaching experience, 900 literacy, basically read,
Reading, writing. Therefore, this weighted range should be around 900 to 1000 words. But what words, words
How much is much, it should be a speech that our language text expert speaks.
4) New possible algorithm
Different Chinese characters are different from both sides of the strait, and the common use is also different. So from common words
Group analysis is more different, and the recognition rate is higher. Unfortunately, there is no information, so I only hope that now.
No algorithm. At the same time, I hope more people have provided more better algorithms in the spirit of GPL big market.