Basic knowledge of simplified bodies with Java

zhaozj2021-02-08  418

Chinese character coding standard and identification (1)

Code page

This section is written according to the following article, it is recommended to seriously study the high theory of these experts.

Refer to 1 <> chanted shaft

<< Computer World >> Weekly 97-1-17

Reference 2 << 张 轴 材 建 建 历 历 历 历 程 程 程 程 历 程> 程 程>>

Reporter Huang Weimin Xiao Chunjiang 99-8-30

Reference 3 << Chinese platform puts "Root" Retain >> Wu Jian << China Computer News >>

Publishing date: 1998-12-21 Total number: 348 This year: 51

Reference 4 << for all of the UNIX Chinese platform >> Sun Yufang << China Computer User >>

Date of publication: 1998-07-06 Total number: 323 This year: 26

Reference 5 cjk.inf: ftp: //ftp.ora.com/pub/examples/nutshell/ujip/

Doc / cjk.inf

Because I am just amateur level, not an expert, there are not many terms in reference materials.

Understand, there have been no standard official text, wrong and blur

Inevitable. At the same time, because the relevant departments of the state have propaganda, promotion and implement national standards

The face is not enough, resulting in a small business that is like me or a small business in the field.

Insufficient resources is in an unfavorable competitive position.

When ASCII is developed, there is no consider multilingual, especially the object Chinese Chinese characters.

Iconic text support. In order to put a lot of solutions, the code page

System (ISO2022) is a generally implemented program, but ISO10646 / GB13000 / Unicode

It is the direction of the future development.

China's Chinese character encoding standard GB2312 is 7BITS standard, specifically a double 7-bit byte standard.

ASCII is a single 7-bit byte standard, how is the computer distinguished? One is in the eighth position "1",

Tip computer transfer to double-byte coding, this is the most common implementation, also called EUC

(Extended Unix Code) Code. The other is to use special tag prompting computer transfer to double

Byte encoding, such as Hz coding is used in starting, using the ended block identifies a double-byte coding area.

It is an implementation of GB2312. Target Chinese characters such as Chinese characters, code page

It is based on various countries, regions or industry standards, and is encoded according to EUC. Code page down

Compatible with ASCII is an inequality. Will bring the complexity of code, and will also lead

Causes of garbled problems caused by code page switching.

Unicode is a multi-byte equation. ISO10646 / GB13000 / Unicode is now

The implementation of UCS2 is consistent, that is, double-byte coding standards have been implemented. The following discussed

ISO10646 / GB13000 / Unicode, just referring to this situation of UCS2. Unicode pair

ASCII takes a policy implementation of the previous plus "0" byte. Such as "a" ASCII code is 0x41,

The Unicode code is 0x00, 0x41.

Here is mainly from the National Standard (GB) series to understand Unicode. If you don't look at the reference 5

(English), I still don't know that the country is about the standard of Chinese character encoding. Chinese actually

It is a very helpless thing to understand the Chinese character coding standard from English.

Common Chinese Coding Standard Source: CJK.inf

GB2312-1980 (GB0) (Simplified) GB7589-1987 (GB2) (Simplified)

GB7590-1987 (GB4) (Simplified) GB13000-1993

GB6345.1-1986 (GB0 Correction)

GB8565.2-1988 (GB8, GB0 expansion)

GB / T12345-90 (GB1) (Traditional) GB / T13131-9X (GB3) (Traditional)

GB / T13132-9X (GB5) (Traditional) wherein the transverse representation character set series. Longitudinal representation of various series of development standards. among them

GB2312 is a basic set, which is the most common standard. GB7589 / GB7590 is an extension

Set, you may not be able to coexist with GB2312 during use, you need to switch. GB7589 / GB7590

It is arranged according to the part (department) and pen (strokes), but what is wrong, how to arrange,

What areas, unclear. After two corrections and expansion, the GB2312 series has been

Some of the GB2312-1980 standards (refer to 5). Because there is no standard text, I don't know

The font is being used to belong to which standard. According to the latest Unicode 3.0, the country

The latest standard is GB16500-95, and I don't know which series. ISO / IEC 10646

These national standards are equivalent to GB13000-1993 / JIS0221-1995 / KSC5000-1995.

The goal of formula is to include the text of each language, which is the most Chinese characters (Unicode2.0 has

20902 Chinese characters). About the standard characteristics can be seen from the reference 1, the wind wind in the process

Rain and rain can be seen 2. In short, this is a country involved and dominant.

International standard.

GBK is an intermediate product that GB2312 transition to GB13000. It is a big GB2312

Expansion, encoding downward compatible with the EUC code of GB2312, word exchange (character set) and GB13000

The same is 3 times the GB2312. So GBK also contains BIG5, Shift-JIS, KSC

Word. Note that only the word exchange is included, and the coding is different from the original standard. In specific

In the application, you can display GB2312, BIG5, SHIFT-JIS, KSC with GBK fonts.

String. But except for GB2312 strings, all other converts.

Because the language is unknown, it is unclear who dominates the GBK. Because some English

It is said that Microsoft has developed GBK, and the country has not been described. Currently

These reference materials only know, 94 years ISO / IEC 10646 released, Microsoft Development

WINDOWS95 Chinese version, to develop Chinese extended encoding. 96 "Chinese Character Expansion Code Specification"

GBK is released (refer to 1 ~ 3). According to the standard release, it is estimated in the late year, which is 95 years.

Windows95 and subsequent version of the Chinese version supports GBK.

The EUC coding range of GB2312 is the first byte 0xa1 ~ 0xfe (actually only 0xF7),

Two-bytes 0xA1 ~ 0xFE. GBK expands this. The first byte is 0x81 ~ 0xfe, second

The byte is divided into two parts, one is 0x40 ~ 0x7e, and the second is 0x80 ~ 0xfe. Among them, the GB2312 phase

The same area, the word is identical. The extension part is probably according to the part (department) and pen smooth (stroke)

Take it from the GB13000 to the GBK. Therefore, GBK is not GB13000, although both

The word is the same, but the encoding system is different. One is the ISO2022 series does not equalize, one

It is equal to the equation, and the encoding area is also different. Note that GBK is actually not national standards.

There is a GB2312 base set before that, it is a more advanced GB13000.

GBK is just a transition and expansion specification. So there is gb2312-> unicode in Unicode,

GB12345-> Unicode's conversion table, without a GBK-> Unicode conversion table. only

Microsoft made Code Page 936 (cp936.txt) can be counted as GBK-> Unicode

Convert form. But pay attention to this is a document made by a business company, not a country or

The international standard organization is made, it is possible to have inconsistent with the standard. Recently, some useful standard files have been found in the Founder Font website, interested in download. But pay attention

GBK-BIG5.TAB and GB-BIG5.TAB are a bit awkward.

http://www.founderpku.com/fontweb/download/gbk-big5.tab

http://www.founderpku.com/fontweb/download/gb-big5.tab

http://www.founderpku.com/fontweb/gb2312.htm

http://www.founderpku.com/fontweb/gbk.htm

Make other standard mutual conversion tables, will and traditional conversion tables using these conversion tables

Difference. If you use GBK <=> Unicode <=> BIG5 to make GBK <=> BIG5 conversion table,

Will be different from traditional GB <=> BIG5 conversion tables. Mainly Chinese characters have simple and traditional.

The former is GBK (traditional Chinese characters) <=> BIG5 (Traditional Chinese characters), the latter is GB (Simplified) <=> BIG5 (Traditional).

There is also a difference between some tabors. Converted by Chinese characters, interested readers, you can see

Http://www.basistech.com/articles/c2c.html

http://www.cjk.org

The relationship between internal code and font

Although there is no standard text, it is still possible to understand those words in common standards. TLC4.0

The font has a PCF font with GB2312, GB12345, BIG5, GBK standard. You can use XFD practical

Program view. There is a 16 o'ICODE in http://www.debian.org/chinese

PCF font. If FreeType is installed, you can use the XMBDFed software to view the TTF font.

If you use MS Word, it may be simpler.

In daily use, we are actually familiar with the word (internal code). Under Chinese, we lose

If you enter a two-eighth byte, you will get a Chinese character, you will think that this double-eight byte is corresponding.

The shape of the shape. This is wrong. In fact, the internal code is for the font, just look up the index of the glyph. Such as

Replace another font, the same string will present different glyphs,

It is garbled. I have seen the TTF font library of GB2312, BIG5 and ISO10646 / GB13000. For the operating system

In the case of the application, the favorite nature is the ISO10646 / GB13000 TTF font. Because

At this time, just provide a set of code and a set of fonts, modify the external configuration file, you can use it in different

Language environment. This is international and localization. There is a skill is ISO10646 / GB13000

The TTF font can be used to turn into other standard fonts when used.

GBK-> Unicode, BIG5-> Unicode These conversion tables. A system is upgraded to support Unicode 3.0,

It is also difficult. Simple place is just to modify the conversion table (such as Windows)

Ls *. *).

It is difficult to upgrade the font. The development font is very difficult, you can see the development word in Fang Zhengli website.

The step of the library .Win9x used the TTF font library of Beijing Zhongyi Company, MS is impossible to develop a set

The text library. I have seen the ISO10646 / GB13000 TTF font, the latest is the 99th edition, Unicode2.1

,

Founder Font Library. To see all glyphs of Unicode 3.0, only these professional font library developers

Before doing it. If you want to see it now, you only ask the shaft. Because every new standard,

There is a 48X48 high-precision glyph of all Chinese characters. It is always tempting using the TTF font.

question. But now I don't know much, I can only talk about the problem of generating BDF / PCF font from TTF fonts.

Because now there is very little PCF font, there is only four kinds of Song, imitation Song, black body, and body. I want to have more

A multi-font, there is a way to suit the use of the FreeType library. Generate a BDF font with a TTFTOBDF program, and then generate a PCF font with a BDFTOPCF program. But after the font scale generated by this method is compared

Grunge, and it is not appropriate to control. This may be that the conversion process of TTF-> BDF is lost, aspect ratio

Also different from the standard. The machine generated is mechanical, it is not possible to draw a hand-painted font.

of. At the same time, because TTF technology is mature, there is no need to continue to develop more PCF fonts.

X Window will accept and use a large number of TTF fonts. And the PCF font is mainly used in standard fonts in the future.

(Such as the Song), small lattice, fast download transmission in the Internet. Only use it in X Window

After the fonts of Unicode and TTF, they will experience the use of Unicode and TTF, which is both a capability.

It is also a burden. Because no matter what format font file, it is finally converted to

Fixed dot matrix in memory. If it is 16x16 points, a Chinese character uses 32 bytes. Unicode3.0

There are 27786 Chinese characters, at least 868KB of memory. If you want Chinese English, you can also install it.

There is a large number of Chinese fonts, and the memory needs to be understood. If you use TTF, you need another piece.

Save operations and storage. Therefore, even if X Window provides font cache and deferglyphs,

Still nothing. And the Chinese characters we use are actually very few. According to statistics, the frequencies of Chinese characters are commonly used.

The first 165 Chinese characters and> 50%, the first 1000 Chinese character frequency and> 95%; according to primary school teaching experience, knowledge

900 words, basically read, reading newspapers, writing; according to primary school teaching outline, elementary school graduation

Lenten 2,500 words; the frequency of the first-class font of GB2312 has been> 99%. I think my own literacy is about

4000 ~ 5000, compare the Chinese characters of Unicode, seeming to a illiterate :-). So use GB2312,

It is the use of GB13000, it is a two difficulties, we also pay for our choice.

Finally, the role of UTF8 is discussed by the relationship between the internal code and the font.

UTF8 is a transition solution for the existing ASCII system to the Unicode system. UTF8 is guaranteed

ASCII compatibility, expands to large character set directions. This is a solution recommended by Unicode. Because of

In order to solve the problem, it is not a good solution for existing Chinese systems.

CJK character encoding standard is currently a word / two bytes. The coding range in Chinese in UCS2 is

U 4E00 ~ U 9FFFF. In accordance with the UTF8 coding rules, a word / three bytes increases by 1/3.

At the same time, it is not compatible with the existing CJK system. CJK system To use UTF8, first convert to ucs2, then convert

UTF8. The next step is simply more. Because from the point of view of the font, the word code is only a shape

Index in the font. UTF8 is a variable length code, can not be directly indexed, need to be converted to UCS2 to make

Use the font.

With the development of GUI, the font base gradually turns to TTF. Coding standard for TTF fonts, GB2312 / GB2312

EUC standard; BIG5 standard; ISO10646 standard. Haven't seen UTF8 TTF, I don't know CJK

Which systems have used UTF8 encoding.

There is a characteristic of UnicoDDE to be the core code. Can continue to make it on the user surface

Use UCS2 inside the system with the original coding standard to operate and operate. System can be changed using users

Variable logo or module to identify the coding criterion required by the user and then convert. In this way, the system

Simply provide a set of ISO10646 TTFs, not modifying internal code, you can provide Chinese, Japanese, Korean support for multiple users. The Chinese version of Windows95 and the following is to adopt this solution. Now

Some X Window TTF servers, X-TT and XFSFT can also use this solution.

The former has been implemented in the TurboLinux Chinese version, and the latter I tried, the effect is not bad. and also

An interesting phenomenon is the 12-point PCF font of the Hongqi Linux1.1 version.

/usr/x11r6/lib/x11/fonts/misc/gb12st.pcf.gz. This is not strict

GB2312 encoded a font. View using the XFD utility, as if it is a TTF font encoded from Unicode

Converting, some GBK words, unfortunately. If they can have some GBK coding standard PCF font

Enough.

The CJK system turns to UCS2 and the ASCII system turn to UTF8, and the code modification of both is quite. Just before

More conversion tables, need memory. However, the ASCII system uses UCS2 and needs to increase 50% of space.

At present, most of the information in the computer is also the information of ASCII, it seems that this is also a problem.

转载请注明原文地址:https://www.9cbs.com/read-891.html

New Post(0)