UTF-8 and Unicode FAQ

xiaoxiao2021-03-06  47

UTF-8 and Unicode FAQ

By Markus Kuhn

China Linux Forum Translation Group Xlonestar [Translation] February 2000

This article describes the information needed to use Unicode / UTF-8 on the POSIX system (Linux, UNIX). In the future, Unicode is already close to the location of the ASCII and Latin-1 encoding. It not only allows you to handle any language text that exists on the earth, but also provides a comprehensive mathematical and technical symbol set, so scientific information exchange can be simplified.

UTF-8 coding provides a simple and backward-compatible method that allows the operating system that completely surrounds the ASCII design, such as UNIX, can also use Unicode. UTF-8 is Unix, Linux Already similar system using Unicode Now you are time to know it.

What is UCS and ISO 10646?

International Standard ISO 10646 defines a universal character set, ucs. UCS is a supercoming of all other character sets. It guarantees that it is two-way compatible with other character sets. That is, if you put any text string Translate to the UCS format, then translate the return to the original coding, you won't lose any information.

UCS contains characters for expressing all known languages. Not only is the description of Latin, Greek, Slavic, Hebra, Arabic, Armenian and Georgia, etc., including Chinese, Japanese and Korean Text, as well as a fake name, a fake name, a Bangladesh, a Punjar, aleum, Tamir, Indian, Laota, Chinese Pinyin (Bopomofo), HANGUL, DEVANGARI, Gujarati, Oriya, Telugu, and other numbers are also unclear. For languages ​​that have not been added, since it is studying how it is best to encode them, they will eventually be added. These languages ​​include Tibetian, high cotton, RUNIC (ancient Nordic text), Ethiopian, other pictograms, and various prints - the language of the European, including the art language selected, such as TENGWAR, CIRTH and CLIN ( Klingon). UCS also includes a large number of graphical, printing, math, and scientific symbols, including all TEX, PostScript, MS-DOS, MS-Windows, Macintosh, OCR font, and many other words processing and The character provided by the publishing system.

ISO 10646 defines a 31-bit character set. However, in this huge coding space, only the first 65534 code bits (0x0000 to 0xFFFD) have been distributed so far. This UCS is called basic multi-language face ( Basic Multilingual Plane, BMP). The characters encoded in 16-bit BMP belong to very special characters (such as pictograms), and only experts will only use them in history and scientific fields. According to the current plan, future Maybe there will never be assigned to 21-bit encoded space from 0x000000 to 0x10fff. ISO 10646-1 standard was first published in 1993, defined The character set and the architecture of the content in BMP. The second part of the character encoded other than BMP is being prepared, but maybe it will be completed for several years. New characters remain continuously added to BMP, but already The present character is stable and will not change again.

UCS not only assigns a code for each character, but also gives a formal name. Represents a hexadecimal number of UCS or Unicode values, usually add "U " in front, just like U 0041 represents character "Latin uppercase Letter a ". UCS characters U 0000 to U 007F are consistent with US-ASCII (ISO 646), U 0000 to U 00FF and ISO 8859-1 (Latin-1) are also consistent. From U E000 To U F8FF, a wide range of codes other than BMP is for private reservation. What is a combined character?

Some coded points in the UCS are assigned to the combined character. They are similar to the unparalleled accelerator key on the typewriter. Single combined characters are not a complete character. It is a residential or other indicator, add the previous character behind Thus, the resonator can be added behind any character. Those the most important character, like the original language of the normal language (Orthographies of Common Languages), there is its own location in UCS, To ensure the backward compatibility of the old character set. Existing your own coding position, but also expressed as a generic character follows a combined character, called a precomposed characters. UCS The pre-acting is to be the old code that is not a pre-serve, such as ISO 8859, which remains backward compatibility. The combined character mechanism allows the residential or other indicator to be added after any character, which is special in scientific symbols. Useful, such as mathematical equations and international audio symbols, may need to make one or more indication tags after one basic character.

Combined characters followed by modified characters. For example, the voweling characters in German ("Latin uppercase letter A plus a tone"), can represent both the pre-acts of the UCS code U 00c4, or A common "Latin uppercase letter A" follows a "Combined Sony": U 0041 U 0308 Combination. When you need to stack multiple accents, or add a combined mark on the top and below below Multiple combined characters can be used. For example, in Thailand, a basic character can add up to two combined characters.

What is the UCS implementation level?

Not all systems need to support all advanced mechanisms in UCS such as combined characters. Therefore ISO 10646 specifies the following three implementation levels:

Level 1

Combined characters and Hangul Jamo characters are not supported (a special, more complex Korean code, using two or three sub-characters to encode a Korean Merphie Festival)

Level 2

Similar to the level 1, in some text, allowing a fixed combination character (for example, Hebrew, Arabic, Devangari, Bengali, Guluchi, Gujarati, Oriya, Tamil, Telugo, Print. Enad, Malayalam, Thai and Lao). If there is no minimum combination character, UCS cannot express these languages ​​completely.

Level 3

All UCS characters are supported, such as mathematicians can add a TILDE on any of the characters or an arrow (or both).

What is unicode?

Historically, there are two independent, creating a single character set. One is an ISO 10646 project of the International Standardization Organization (ISO), and the other is a association organization that is made up of "more than the United States) multilingual software manufacturer. Unicode project. Fortunately, the participants of the two projects have been recognized that the world does not need two different single-character sets. They combine the work results of both sides, and work together to create a single coding table. The two projects still exist and publish their respective standards independently, but the Unicode Association and ISO / IEC JTC1 / SC2 agree to keep the Unicode and ISO 10646 standard code table compatible, and adjust any future extensions closely.

So where is UNICODE and ISO 10646?

The Unicode standard published by the Unicode Association toughly contains the basic multi-language surface of the ISO 10646-1 implementation level 3. All characters in the two standards are in the same location and have the same name. Theunicode standard has additionally defined many and characters. The semantic symbolics is generally a better reference for realizing high quality printing publishing systems. Unicode details the algorithm to draw some languages ​​(such as Arabic) expression, handling two-way text (such as Latin and Ji Algorithm and sorting and string compare algorithms, and many other things.

On the other hand, ISO 10646 standard is like a wide range of ISO 8859 standards, just a simple character set. It specifies some terms related to standards, defining some coded alias, including specification, Specifies how to use UCS to connect to other ISO standards, such as ISO 6429 and ISO 2022. There are also some closely related to ISO, such as ISO 14651 is about the UCS string sort.

Considering that the Unicode standard has an easy-to-book name, and in any good bookstore, there is a small part of the ISO version, and includes more assistive information, so it has become widely used. The reference is not surprising. However, it is generally believed that the font for printing ISO 10646-1 standards is higher than the quality of the IsoDe 2.0. Professional font designer is always recommended to say two Both standards are achieved, but some of the samples provided have a significant difference. ISO 10646-1 standard uses four different style variants to display ideas such as Chinese, Japanese and Korean (CJK), and Unicode 2.0 table There is only a variety of variants in Chinese. This leads to universally believes that Unicode is unable to receive Japanese users, although it is wrong.

What is UTF-8?

First, UCS and Unicode just assign an integer to the character's coding table. There are several ways to represent a string of characters as a string byte. The most apparent method is to store Unicode text to 2 or 4 bytes. The series of sequences. The formal names of these two methods are UCS-2 and UCS-4, respectively. Unless otherwise specified, most bytes are such a (BiGendian Convention). Conversion of an ASCII or Latin-1 It is only necessary to simply insert 0x00 before each ASCII byte. If you want to convert into UCS-4, three 0x00 must be inserted before each ASCII byte.

Using UCS-2 (or UCS-4) under UNIX will result in a very serious problem. Use these encoded strings that will contain some special characters, such as '' or '/', their filename and other C library function parameters There is a special meaning. In addition, most of the tools under UNIX using the ASCII file, if no major modification is not possible to read 16 characters. Based on these reasons, in the file name, text file, environment variable and other places UCS-2 is not suitable as external encoding as Unicode.

The UTF-8 encoding defined in ISO 10646-1 Annex R and RFC 2279 does not have these problems. It is a significant approach to Unicode under UNIX style operating system.

UTF-8 has a property:

UCS characters U 0000 to U 007F (ASCII) are encoded as bytes 0x00 to 0x7f (ASCII compatible). This means that only 7 ASCII characters are the same in ASCII and UTF-8 encoding mode. .

All> UCS characters of U 007F are encoded as a plurality of bytes of strings, each byte is a marking bit set. Therefore, ASCII bytes (0x00-0x7f) cannot be part of any other characters.

The first byte of the multi-character string that represents non-ASCII characters is always in the range of 0xc0 to 0xFD, and pointed out how many bytes of this character contain. The remaining bytes of the multi-character string are in the 0x80 to 0xBF range This makes it easy to resynchronize and make the encoded banks, and very little affected by the lost byte.

You can include all possible 231 UCS code

UTF-8 coding characters can be up to 6 bytes long, but the 16-bit BMP characters are only available for only 3 bytes long. The order of the BiGendian UCS-4 byte string is scheduled.

Bytes 0XFE and 0xFF have never been used in UTF-8 encoding.

The following bytes are used to represent a character. Which string used depends on the serial number of the character in Unicode.

U-00000000 - U-0000007F: 0xxxxxxx

U-00000080 - U-000007FF: 110xxxxx 10xxxxxx

U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx

U-00010000 - U-001FFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

U-00200000 - U-03FFFFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

U-04000000 - U-7FFFFFF: 1111110X 10xxxxxx 10xxxxxx 10xxxxxxx 10xxxxxx 10xxxxxx

The location of the XXX is filled in bits of the binary representation of the character coded. The more you rely on X has the specific meaning. Use the shortest one enough to express a multi-character buffer string of character coding. Note in the multi-character string In the beginning "1" of the beginning of the first byte is the number of the entire string byte.

For example: Unicode character u 00a9 = 1010 1001 (copyright symbol) encoded in UTF-8:

11000010 10101001 = 0xc2 0xA9

And character u 2260 = 0010 0010 0110 0000 (not equal) is encoded:

11100010 10001001 10100000 = 0xE2 0x89 0xA0

This encoded official name is spelling UTF-8, where UTF represents UCS Transformation Format. Do not use other names (such as UTF8 or UTF_8) in any document, of course, unless you refer to a variable name. Not this code itself.

What programming language supports Unicode?

Most of the modern programming languages ​​developed after 1993 have a special data type called Unicode / ISO 10646-1. In the ADA95 called Wide_Character, called char in Java.

ISO C also details the mechanism for processing multi-byte coding and wide character (Wide Characters), and more in Amendment 1 to ISO C in September 1994. These mechanisms are mainly designed for various East Asian codes. They are much robust than the need to handle UCS. UTF-8 is an example of the encoding of the ISO C standard call multi-byte string, and the WCHAR_T type can be used to store Unicode characters.

How to use Unicode under Linux?

Prior to UTF-8, Linux users in different regions use a wide variety of ASCII extensions. The most common European code is ISO 8859-1 and ISO 8859-2, Greece encoding ISO 8859-7, Russia Coding KOI-8, Japan Encoding EUC and Shift-JIS, and more. This makes the exchange of files are very difficult, and the application must pay special attention to these codes.

In the end, Unicode will replace all of these codes, mainly in the form of UTF-8. UTF-8 will be applied

Text file (source code, HTML file, email message, etc.)

file name

Standard input and standard output, pipeline

Environment variable

Cut and paste select buffer

Telnet, MODEM, serial connection to the terminal simulator

And the byte string represented by ASCII before other places

In UTF-8 mode, terminal simulator, such as XTERM or Linux Console Driver, convert each button into a corresponding UTF-8 string, and then send it to the Stdin of the front desk process. Similar, any process is output on STDOUT They will be sent to the terminal simulator, where there is a UTF-8 decoder, then displayed with a 16-bit font. Only in a multi-language word processor package in the functionality can be completely Unicode Functional support. The solution widely used in Linux for replacing ASCII and other 8-bit character sets is much simpler. The first step, the Linux terminal simulator and command line tool will only turn to UTF-8. This means ISO 10646-1 (without combined characters) with level 1 (no combination characters), and only supports languages, Greece, Slavic and many scientific symbols. At this level, UCS supports and ISO 8859 Support is similar, the only significant difference is that now we have thousands of characters can be used, and the characters can be represented by a multi-character string.

One day, Linux will certainly support combined characters, but even so, for combined strings, pre-acts (how to use) will still be preferred. Find more, in linux, unicode's preferred method for text encoding It should be defined in Normalization Form C in Unicode Technical Report # 15.

In the future, people can consider adding support for double-byte characters used in Japanese and Chinese (they are relatively simple), combined character support, and even language written from right to left, such as Hebrew (They are not that simple) support. But support for these advanced features should not hinder the quick application of simple tablet UTF-8 in Latin, Greece, Slavic and scientific symbols to replace a large number of European 8-bit codes. And provide a status of scientific symbols.

How can I modify my software?

There are two ways to support UTF-8. I call it soft conversion and hard conversion. When the soft conversion, the data is stored as a UTF-8 form, so there is a small software that needs to be modified. When the program is hard conversion, the program Convert the read UTF-8 data to a wide character array to process internal processing. When output, the string is converted back to UTF-8.

Most applications can work very well. This makes it possible to introduce UTF-8 into UNIX to be practicable. For example, programs like Cat and Echo do not need to be modified. They can still output them Is ISO 8859-2 or UTF-8 without knowing because they only handle word streams without processing them. They can only identify the ASCII characters and the control code such as '/ n', and this is under UTF-8 There is no change. Therefore, the UTF-8 encoding and decoding of these applications will be completed in the terminal simulator.

Those programs that pass the number of characters through the number of characters need some small modifications. In UTF-8 mode, they must not be numbed in the range of 0x80 to 0xBF, because these are just following the bytes, they are not Character. For example, the LS program must be modified because it is discharged to the user's directory table layout through the number file name. Similarly, all the assumed it is a wide font, and therefore formatting their programs What you must learn how to use the number of characters in UTF-8 text. The editor's function, such as deleting a single character, must be slight modifications to delete all bytes that may belong to this character (Vi) Editor (Vi, Emacs, etc.) and programs that use NCurses libraries.

Linux core use soft conversion can also work very well, only very small modifications to support UTF-8. Most of the core functions of the string (for example, file names, environment variables, etc.) are not affected. The following The place may have to be modified:

The console displays the keyboard driver (another VT100 simulator) must be encoded and decoded UTF-8, and several subsets must be supported by the Unicode character set. From Linux 1.2, these features have been there.

External file system drivers, such as VFAT and WINNT must be converted to file name character encoding. UTF-8 has joined in the list of available conversion options, so the mount command must tell the core driver User Process I want to see the UTF-8 file name. Since VFAT and WinNT have used at least Unicode, UTF-8 can play its advantage here to ensure that there is no information loss in the conversion. The TTY driver of the POSIX system supports a "cooked" mode, there are some original The line editing function. To make the character delete function work normally, Stty must set up UTF-8 mode in the TTY driver, so it will not count the follow-up characters in the range of 0x80 to 0xBF. Bruno haible has already Some Linux patches for STTY and core TTY drivers.

C Support for Unicode and UTF-8

Starting from GNU GLIBC 2.1, the WCHAR_T type has been officially set to only store independent of the current Locale, 32-bit ISO 10646 value. Glibc 2.2 will fully support multi-byte conversion functions in ISO C (WPrintf (), MBstowcs () , Etc.) These functions can be used to convert between WCHAR_T and any multi-byte encoding depending on the Locale, including UTF-8.

For example, you can write

WPrintf (l "SCH 鰊 E GRE! / N");

Your software will then print this text according to the encoding specified by your users in the environment variable LC_CTYPE (for example, en_us.utf-8 or de_de.iso_8859-1). Your compiler must be running In the corresponding local Locale in the corresponding local Locale, the wide string in the target file will be changed to the WCHAR_T string. When the output, the runtime library will convert the WCHAR_T string back to the LOCALE when executed. Code.

Note that this is similar to this:

CHAR C = L "a";

Only characters from U 0000 to U 007F (7-bit ASCII) are allowed. For non-ASCII characters, they cannot be converted directly from WCHAR_T to CHAR.

Now, functions like Readline () can work under UTF-8 locale.

How to activate UTF-8 mode?

If your application supports both character sets (ISO 8859 - *, KOI-8, etc.), but also supports UTF-8, then it must be used to know whether it should be used in UTF-8 mode. Luck In the next few years, people will only use UTF-8, so you can use it as a default, but even so, you still have to support traditional 8-bit character sets and support UTF-8.

The current application uses many different command line switches to activate their respective UTF-8 modes, for example:

XTERM command line option "-u8" and x resource "xterm * UTF8: 1"

GNAT / GCC command line option "-gnatw8"

Stty command line option "IUTF8"

Mined command line option "-u"

XEMACS ELISP is wrapped in MULE encoding in UTF-8 and internal

Vim 'FileEncoding' Options

LESS environment variable lesscharset = UTF-8

Remember that each application command line option or other configuration method is very monotonous, so it is urgent to need a standard method.

If you use hard conversion in your application, use some specific C library function to handle external character encoding and internal WCHAR_T encoding, then the C library will help you handle the problem of switching. You only The environment variable lc_cType is required to be correct, for example, if you use UTF-8, it is en.uTF-8, and if it is Latin-1, it is set to en.iso_8859-1.

However, most existing maintainers choose to replace soft conversion without using libc's wide character functions, not only because they have not been widely used, but because this will make software to make large-scale modifications. In this case Your application must know when you use UTF-8 mode. One way is to do the following: Follow the environment variable lc_all, lc_cType, LAN's order, find the first value variable. If this value contains UTF -8 substrings (perhaps lowercase or "-"), the default is UTF-8 mode (still reset to the command line switch), because this value is reliable and properly indicated that the C library should use a UTF- 8 locale.

Provide a command line option (or if the x client can use x resource) will still be useful, can be used to reset the default value specified by environment variables such as LC_CTYPE.

How can I get the UTF-8 version of xterm?

The XTERM version in XFree86 has recently been added to the UTF-8 extension recently. Usage is to get Xterm Patch # 119 (1999-10-16) or updated version, "./configure --enable -wide-chars; make "to compile, then use the command line option -U8 to call the xterm so that it converts the input and output to UTF-8. Use a * -iso10646-1 font in the UTF-8 mode. When you are * -Iso10646-1 fonts can also be used in the ISO 8859-1 mode, because ISO 10646-1 fonts are completely backwards compatible with ISO 8859-1 fonts.

The new XTERM version of UTF-8 is supported, as well as some ISO 10646-1 fonts, will be included in the XFree86 version 4.0.

Does XTERM support combined characters?

Xterm currently only supports ISO 10646-1, which is not to provide combination character support. Currently, combined characters will be treated as space characters. The future revision of XTERM is very likely to join some simple combination character support. Only the base character with one or more combined characters is bold. For the baseline below and the residents above the small characters, the result of this processing can be accepted. For The text of the specially designed bold character is used like Thai literals so that it can work very well. However, for some fonts, accents on higher characters, especially for "Fixed" The font family, the result is not completely satisfactory. Therefore, the preferences should continue to be prioritized in the available places.

Does XTERM support half a wide and full CJK font?

Xterm currently only supports the font of all glyphs equivalent. The future revision is likely to add half-wide and full-wide character support for the CJK language, similar to the KTERM. If selected ordinary font It is the X × Y pixel size, and the wide character mode is open, then xterm will try to load another 2x x y pixel size font (the same XLFD, just the value of the average_width property). It uses This font comes to display all Unicode characters that have been assigned EAST Asian Wide (w) or East Asian FullWidth (f) in Unicode Technical Report # 11. The following C function is used to test if a Unicode character is a wide character and needs Displayed with a greeque that covers two character units:

/ * This function tests, WHETHER THE ISO 10646 / Unicode Character Code

* UCS Belongs Into the East Asian Wide (W) or East Asian Fullwidth

* (F) category as defined in Unicode Technical Report # 11. In this

* Case, The Terminal Emulator SHOULD REPRESENT The Character Using A * a Glyph from a Double-Wide Font That Covers Two Normal (Latin)

* Character Cells. * /

Int Iswide (int UCS)

{

IF (UCS <0x1100)

Return 0;

Return

(UCS> = 0x1100 && ucs <= 0x115f) || / * Hangul Jamo * /

(UCS> = 0x2e80 && ucs <= 0xa4cf && (ucs & ~ 0x0011)! = 0x300A &&

UCS! = 0x303f) || / * cjk ... yi * /

(UCS> = 0xAc00 && ucs <= 0xD7A3) || / * Hangul Syllables * /

(UCS> = 0xf900 && ucs <= 0xfaff) || / * cjk compatibility IDEOGRAPHS * /

(UCS> = 0xfe30 && ucs <= 0xfe6f) || / * CJK Compatibility Forms * /

(UCS> = 0xff00 && ucs <= 0xff5f) || / * Fullwidth Forms * /

(UCS> = 0xffe0 && ucs <= 0xffe6);

}

Some C libraries also provide functions

#include

INT WCWIDTH (WCHAR_T WC);

INT WCSWIDTH (Const Wchar_T * PWCS, SIZE_T N);

Used to determine the wide character WC or N wide character code (or less than N wide character code pointed to by the PWCS, if an air wide character is encountered before the N wide character code) The number of columns. These functions are defined in the Single Unix Specification of Open Group. A Latin / Greek / Slavic / Wait for a column location, a CJK hierogue requires two, and a combined character requires zero.

Will the final xterm support from right to left?

At this moment, I have not added xterm from the right to left function. Hebrew and the Arab users have had to flip them in the left direction of the application to send Hebrew and the Arabic string to the terminal. In other words, Bidirectional processing must be completed in the application instead of in xterm. At least, Hebrew is in the form of Arabic in the form of availability, and the prompts are supported on the table, which is better than ISO 8859. Now Did not determine if the xterm supports two-way text and how it works. ISO 6429 = ECMA-48 and Unicode Bidi Algorithm provide start-up points for selection. You can also refer to Ecma Technical

Report TR / 53. Xterm does not process the formatting algorithm for Arabic, Hangul or Indian text, and is now not clearly handled in the VT100 simulator, or should be left to application software. If you are planning Supports two-way text output in your application, look at the free implementation of the FRIBIDI, DOV GROBGELD's Unicode two-way algorithm.

Where did I find ISO 10646-1 X11 font?

There are quite a few Unicode fonts in the past few months and is still growing rapidly.

Markus kuhn is extending with many other volunteers to extend the old -Misc-fixed - * - ISO8859-1 font to overwriting all European character tables (Latin, Greece, Slavic, International Phonetic alphabet. Mathematics and technical symbols, In some fonts, there are even Armenian, Georgia, Split Mamage, etc.). For more information, please refer to the Unicode Fonts and Tools for X11 pages. These fonts will be distributed together with XFree86. For example font -Misc-fixed-medium-r-semicondensed --13-120-75-75-C-60-ISO10646-1

(An extension of the old XTERM's Fixed default font, including more than 3,000 characters) is already part of Xfree86 3.9 Snapshot.

Markus also is also available in the ISO 10646 of all Adobe and B & H BDF fonts in X11R6.4 Distribution. These fonts already contain all PostScript font tables (about 30 additional characters, most of which are used by CP1252 MS-Windows, such as Smart quotes, dashes, etc., are not available under ISO 8859-1. They are fully available in ISO 10646-1.

Xfree86 4.0 will bring an integrated TrueType font engine, which makes your X application can use any Apple / Microsoft font for ISO 10646-1 encoding.

The future XFree86 version is likely to remove most of the old BDF fonts from the distribution version, replaced by ISO 10646-1 encoded version. The X server adds an automatic encoded converter, only when the old 8-bit software requests one Similar to ISO 8859- * encoded fonts, only one such font is created from the ISO 10646-1 font file. Modern software should prioritize ISO 10646-1 font encoding.

Clearlyu (CU12) is a very useful X11 12 o'clock, 100 DPI ProPortsAl ISO 10646-1 BDF font contains more than 3,700 characters, provided by Mark Leisher (sample image).

The GNU Unicode Font project of Roman Czyborra works in collecting a complete with free 8 × 16/16 × 16 Pixel Unicode font. There is currently 34,000 characters.

ETL-Unicode is an ISO 10646-1 BDF font for PRIMOZ Peterlin.

The Unicode X11 font name is ended in -iso10646-1. The value in the X logical font descriptor, XLFD) charset_registry and charset_encoding domains have been officially registered for all Unicode and ISO 10646-1 Each * -iso10646-1 font contains a few subsets in the entire Unicode character set, and the user must figure out which one of the characters they choose overwritten.

* -ISO10646-1 fonts typically specify a default_char value, pointing to a non-Unicode shape, used to indicate unavailable characters in the font (usually a dashed box, the size of a H, located at 0x1F or 0xFFFE). This. Make the user can at least know that there is a unsupported character. Xterm's small wide font is like 6x13, etc., will never override all Unicode, because many texts can only be widely used than European users. The pixel size can only be represented. The typical Unicode fonts used in Europe will only contain a subset of about 1000 to 3000 characters.

How can I find out which shapes in a X font?

The X protocol cannot make an app that makes it easy to find a Cell-SpaceD font to provide such a measure for font. So Mark Leisher and Erik Van de Poel (Netscape) specify a new _xfree86_glyph_ranges BDF property, Tell the application which Unicode subset of this BDF font. Mark Leisher provides some sample code to generate and scan this property, while XMBDFed 3.9 and later will automatically add it to each BDF file it produced. What is the problem related to the UTF-8 terminal simulator?

The VT100 terminal simulator accepts ISO 2022 (= ECMA-35) ESC sequence for switching at different characters.

UTF-8 is a "other encoding system" in the ISO 2022 (refer to 15.4 of ECMA 35). UTF-8 is outside the ISO 2022 SS2 / SS3 / G0 / G1 / G2 / G3 world, so if You switch from ISO 2022 to UTF-8, all SS2 / SS3 / G0 / G1 / G2 / G3 status is meaningless until you leave UTF-8 and switch back to ISO 2022. UTF-8 is a no country The encoding is also a self-ended short-by-sequence completely determines what character represents it represents. Switches in any country. G0 and G1 are the same in ISO 10646, while G2 / G3 is in ISO There is no existence in 10646, because any characters have a fixed location, so they do not switched. In UTF-8 mode, your terminal does not switch into a strange graphic character mode because you accidentally load a binary file. This makes a terminal more robust in UTF-8 mode than in ISO 2022 mode, and therefore there may be a way to lock the terminal lock in the UTF-8 mode without accidentally returning to the ISO 2022 world.

The ISO 2022 standard specifies a range of ESC% sequences to leave the ISO 2022 World (specified other coding systems, DOCs), many such sequences for UTF-8 have been registered into ISO 2375 International Register Of Coded Character Sets:

ESC% GA activates an UTF-8 mode that does not specify an implementation level from ISO 2022 and allows the ISO 2022 to be returned.

ESC% @ 从 从 u i i 20 2022, the condition is UTF-8 entered by ESC% G

ESC% / g switches into the UTF-8 level 1 and does not return.

Esc% / h switches into the UTF-8 level 2 and does not return.

ESC% / I switches into the UTF-8 level 3 and does not return.

When a terminal simulator is in UTF-8 mode, any ISO 2022 escape code sequence is ignored, for example, to switch G2 / G3, etc., is ignored. A terminal simulator in UTF-8 mode is uniquely executed ISO 2022 sequence is ESC% @ is returned from UTF-8 to the ISO 2022 program.

UTF-8 still allows you to use C1 control characters like CSI, although UTF-8 also uses bytes in the range of 0x80-0x9f. It is important to understand that the terminal simulator in the UTF-8 mode must be executed any control. The character stream received before the character uses the UTF-8 decoder. The C1 character needs to be decoded by UTF-8 as any other character greater than U 007F.

What have I support UTF-8 applications?

Yudit is the free X11 Unicode editor for Gaspar Sinai

MINED 98 is provided by Thomas Wolff, a text editor that can handle UTF-8.

LESS version 346 or higher, support UTF-8

C-Kermit 7.0 supports UTF-8 in terms of transmission, terminal, and file character set.

SAM is a Plan9 UTF-8 editor, similar to VI, can also be used for Linux and Win32. (Plan9 is the first complete steering UTF-8, which is a character encoded operating system.) 9TERM is provided by Matty Farrow, is Unix Transplantation of a PLAN9 operating system's UniiXe / UTF-8 terminal simulator.

Wily is a PLAN9 ACME editor UNIX implementation.

UCM-0.1 is a Unicode character mapping table of Juliusz Chroboczek, a gadget that enables you to select any unicode character and paste into your application.

What are the patches for improving UTF-8 support?

Robert Brady offers a Patch for Less 340 (now I have incorpted LESS 344)

Bruno Haible provides multiple patches for stty, Linux core TTY, etc.

Otfried Cheong writes the Unicode Encoding for GNU Emacs Toolbox to allow MULE to handle UTF-8 files.

What is the name of PostScript glyphs with a UCS code?

Refer to Adobe's Unicode and Glyph Names Guide.

How is the cut and paste working in UTF-8?

Refer to Juliusz Chroboczek's client-wide Unicode text exchange, a new proposal for ICCCM, using a new atom utf8_string that can be used for properties type and selection targets to handle UTF-8 selection.

Is there any free library for processing Unicode?

IBM Classes for Unicode

Mark Leisher's UCDATA Unicode character vendial library and Wchar_t support test code.

What is the status quo of various X Widgets support for Unicode?

GScript - Unicode and complex text processing is a project that adds full-featured Unicode support for GTK .

Qt 2.0 now supports the use of * -iso10646-1 font.

Fribidi is a free implementation of Dov Grobgeld's Unicode two-way algorithm.

Is there any good mailing list about this topic?

You should subscribe to the list of Unicode@unicode.org mail, which is the best way to discover the author of the standard and many other leaders. Subscription method is to use "Subscribe" as title, "Subscribe Your@email.Address Unicode "As a body, send a message to Unicode-Request@unicode.org.

There is also a special list of UTF-8 supported by the application for the application of the application on the GNU / Linux system, Linux-UTF8@nl.linux.org. The subscription method is to send a message to the content, send a message to the content. Majordomo@nl.linux.org. You can also browse Linux-UTF8 Archive

Other related and XFree86 groups of "font" and "i18n" list, but you must be a formal developer to subscribe.

More reference

Bruno haible 's unicode howto.

The Unicode Standard, Version 2.0

Unicode Technical Reports

Mark Davis' Unicode FAQ

ISO / IEC 10646-1: 1993

Frank Tang's I? T? Rnati? Nàliz? Ti? N secrets

Unicode Support in The Solaris 7 Operating Environment

The USENIX paper by Rob Pike and Ken Thompson on the introduction of UTF-8 under Plan9 reports about the first operating system that migrated already in 1992 completely to UTF-8 (which was at the time still called UTF-2) .Li18nux is a Project Initiated by Several Linux Distributors to Enhance Unicode Support for Linux.

The Online Single Unix Specification Contains Definitions of All The ISO C Amendment 1 function, Plus Extensions Such as wcwidth ().

The Open Group's Summary Of Iso C Amendment 1.

GNU Libc

THE Linux Console Tools

The Unicode Consortium Character Database and Character Set Conversion Tables Are An Essential Resource for Anyone Developping Unicode Related Tools.

Other Conversion Tables Are Available from Microsoft And Keld Simonsen's WG15 Archive.

Michael Everson's ISO10646-1 Archive Contains Online Versions of Many of The More Recent ISO 10646-1 Amendments, Plus Many Other Goodies. See Also His Roadmaps To The Universal Character Set.

An Introduction INTO The Universal Character Set (UCS).

Otfried Cheong's Essey On Han Unification In Unicode

THE AMS Stix Project Is Working On Revising and Extending The Mathematical Characters for Unicode 4.0 And ISO 10646-2.

Jukka Korpela's Soft Hyphen (SHY) - A HARD PROBLEM? IS An Excellent Discussion of The Controversy Surrounding U 00AD.

James Briggs' Perl, Unicode and I18n FAQ.

I continually add new materials to this document, so please come regularly. Welcome all the improvement suggestions, and the free software community on improving UTF-8 supported advertisements. UTF-8 is used in Linux is a new thing Therefore, we can see a lot of progress in the future months.

Thank you for the valuable annotations of Ulrich Drepper and Bruno Haile

Markus kuhn << markus.kuhn@cl.cam.ac.uk>

Created on 1999-06-04 - recently updated on 2000-01-15 - http://www.cl.cam.ac.uk/~mgk25/unicode.html

转载请注明原文地址:https://www.9cbs.com/read-54347.html

New Post(0)