Linux programmer must read: Chinese and GB18030 standard

xiaoxiao2021-03-06  97

Author: Leon

Linux's Chinese is Linux truly in China to get a problem that needs to be resolved. From XTeam launched the world's first Chinese Linux, China has gone through localization (Localization, L10n) to internationalization (INTERSITIONALIZATION, I18N) development path, and now has launched Linux products that can smoothly process Chinese.

Localization, that is, the problem to be solved by L10N is how to transform information in the system into a local text. For Linux, it is to let the application's interface, the prompt information becomes Chinese. Internationalization, that is, I18N solution is how transparently handles various languages. If you don't need to make changes to the application, you can display, input, and process various languages. At present, I18n is the best way to solve the processing of various languages ​​in the world.

Implement I18n on Linux, and what you want is:

Let Linux core support i18n

Supports GLIBC I18n. GLIBC is the bottom-level support software in the Linux system, and the application can implement I18N through the Locale mechanism provided in GLIBC.

Make XWindow I18n. XWindow is the most common graphical interface system under Linux, which uses the XLocale mechanism to provide I18n support for applications.

Make other applications such as Java, Mozilla support I18n. Cross-platform applications such as Java, Mozilla provide their own I18N support.

Currently, in addition to the display and input of various language characters in addition to the Linux core, other parts can support I18n.

In addition to I18n, another key affecting Linux Chinese information processing is Chinese information encoding. In China, Chinese information coding is the responsibility of the government to formulate and supervise implementation. This is to ensure that the encoding of all systems is consistent, and can be operated with each other. From the start of the computer, my country has promulgated a variety of Chinese information coding standards, commonly used, GB2312-1980, GB12345, GB13000 (GBK), and the latest standard GB18030. It is worth mentioning that the latest GB18030 standard will be implemented as a mandatory standard, and all software that does not support GB18030 standard will not be sold as a product.

Starting from GB2312-1980, Chinese characters are double-byte coding. In order to separate from the basic ASCII character set in the system, the first bit of each byte of all Chinese character encoded is 1. For example: "ah" code is encoded is 0xB0A1. The Chinese character encoding rules of GB2312 are: the value of the first byte is between 0xB0 to 0xF7, and the value of the second byte is between 0xA0 to 0xFE. The GB12345 and GB13000 are expansion of GB2312-1980, all of which have been included in GB2312, and additional more code bits. Its coding rules are roughly: the value of the first byte is between 0x81 to 0xFe, and the value of the second byte is between 0x40 to 0xFE. Since GB13000 is an extension to GB2312, it is also GBK.

GB18030 is also an extension to GB2312, and its encoding length becomes 1 to 4 bytes from two bytes. These include:

Single byte, its value from 0 to 0x7f

Double bytes, the value of the first byte range from 0x81 to 0xfe, the value of the second byte from 0x40 to 0xFe (not included 0x7f)

Four bytes, the value of the first byte range from 0x81 to 0xFe, the value of the second byte from 0x30 to 0x39, the value of the third byte from 0x81 to 0xFe, the fourth byte value from 0x30 to 0x30 0x39.

It can be seen that the capacity of GB18030 is very large, with a total code bit of around 1.6 million. In addition, it is compatible with the GB13000 standard. Therefore, all GB13000-based software can operate on a system platform that supports GB18030 without modification. In the Linux system, the GB18030 standard has certain difficulties due to the complexity of the GB18030 standard. However, fortunately, under the joint efforts of the majority of Linux developers, the current Linux system has basically realized GB18030 standard:

In GLIBC, there is already a Locale and handler of GB18030, and the application can correctly identify and process GB18030 encoding.

For xwindow, there is currently no GB18030 support for XFree86 organizations. But domestic manufacturers have actively participated in the work. For example, in the latest xteamlinux4.0, not only the XWINDOW system supporting GB18030, but the commonly used KDE and GNOME systems have also supported GB18030. On KDE, you can even print the files of GB18030 directly. In addition, the XTEAMLINUX4.0 also includes the latest Chinese input method for GB18030.

Other applications, since Java's code is relatively closed, the support of GB18030 is not clear. However, since Unicode is used as coding inside the Java, the GB18030 should not be a problem. Mozilla's GB18030 support has adopted its own proprietary way: divide GB18030 to 2 bytes and 4 bytes, respectively, respectively. But this support method also requires some additional procedures. Currently, Mozilla in xteamlinux 4.0 has been able to properly process GB18030 encoding, such as automatically displaying a GB18030 encoded web page.

About the implementation of GB18030 has a lot of details, more detailed content can refer to GB18030 standard text.

Excerpt from:

Blue forest

转载请注明原文地址:https://www.9cbs.com/read-104321.html

New Post(0)