UTF-8 and Unicode FAQ

xiaoxiao2021-03-06  91

UTF-8 and Unicode FAQ

By Markus Kuhn

China Linux Forum Translation Group Xlonestar [Translation] February 2000

This article describes the information needed to use Unicode / UTF-8 on the POSIX system (Linux, UNIX). In the future, Unicode is already close to the location of the ASCII and Latin-1 encoding. It not only allows you to handle any language text that exists on the earth, but also provides a comprehensive mathematical and technical symbol set, so scientific information exchange can be simplified.

UTF-8 coding provides a simple and backward-compatible method that allows the operating system that completely surrounds the ASCII design, such as UNIX, can also use Unicode. UTF-8 is Unix, Linux Already similar system using Unicode Now you are time to know it.

What is UCS and ISO 10646?

International Standard ISO 10646 defines a universal character set, ucs. UCS is a supercoming of all other character sets. It guarantees that it is two-way compatible with other character sets. That is, if you put any text string Translate to the UCS format, then translate the return to the original coding, you won't lose any information.

UCS contains characters for expressing all known languages. Not only is the description of Latin, Greek, Slavic, Hebra, Arabic, Armenian and Georgia, etc., including Chinese, Japanese and Korean Text, as well as a fake name, a fake name, a Bangladesh, a Punjar, aleum, Tamir, Indian, Laota, Chinese Pinyin (Bopomofo), HANGUL, DEVANGARI, Gujarati, Oriya, Telugu, and other numbers are also unclear. For languages ​​that have not been added, since it is studying how it is best to encode them, they will eventually be added. These languages ​​include Tibetian, high cotton, RUNIC (ancient Nordic text), Ethiopian, other pictograms, and various prints - the language of the European, including the art language selected, such as TENGWAR, CIRTH and CLIN ( Klingon). UCS also includes a large number of graphical, printing, math, and scientific symbols, including all TEX, PostScript, MS-DOS, MS-Windows, Macintosh, OCR font, and many other words processing and The character provided by the publishing system.

ISO 10646 defines a 31-bit character set. However, in this huge coding space, only the first 65534 code bits (0x0000 to 0xFFFD) have been distributed so far. This UCS is called basic multi-language face ( Basic Multilingual Plane, BMP). The characters encoded in 16-bit BMP belong to very special characters (such as pictograms), and only experts will only use them in history and scientific fields. According to the current plan, future Maybe there will never be assigned to 21-bit encoded space from 0x000000 to 0x10fff. ISO 10646-1 standard was first published in 1993, defined The character set and the architecture of the content in BMP. The second part of the character encoded other than BMP is being prepared, but maybe it will be completed for several years. New characters remain continuously added to BMP, but already The present character is stable and will not change again.

UCS not only assigns a code for each character, but also gives a formal name. Represents a hexadecimal number of UCS or Unicode values, usually add "U " in front, just like U 0041 represents character "Latin uppercase Letter a ". UCS characters U 0000 to U 007F are consistent with US-ASCII (ISO 646), U 0000 to U 00FF and ISO 8859-1 (Latin-1) are also consistent. From U E000 To U F8FF, a wide range of codes other than BMP is for private reservation. What is a combined character?

Some coded points in the UCS are assigned to the combined character. They are similar to the unparalleled accelerator key on the typewriter. Single combined characters are not a complete character. It is a residential or other indicator, add the previous character behind Thus, the resonator can be added behind any character. Those the most important character, like the original language of the normal language (Orthographies of Common Languages), there is its own location in UCS, To ensure the backward compatibility of the old character set. Existing your own coding position, but also expressed as a generic character follows a combined character, called a precomposed characters. UCS The pre-acting is to be the old code that is not a pre-serve, such as ISO 8859, which remains backward compatibility. The combined character mechanism allows the residential or other indicator to be added after any character, which is special in scientific symbols. Useful, such as mathematical equations and international audio symbols, may need to make one or more indication tags after one basic character.

Combined characters followed by modified characters. For example, the voweling characters in German ("Latin uppercase letter A plus a tone"), can represent both the pre-acts of the UCS code U 00c4, or A common "Latin uppercase letter A" follows a "Combined Sony": U 0041 U 0308 Combination. When you need to stack multiple accents, or add a combined mark on the top and below below Multiple combined characters can be used. For example, in Thailand, a basic character can add up to two combined characters.

What is the UCS implementation level?

Not all systems need to support all advanced mechanisms in UCS such as combined characters. Therefore ISO 10646 specifies the following three implementation levels:

Level 1

Combined characters and Hangul Jamo characters are not supported (a special, more complex Korean code, using two or three sub-characters to encode a Korean Merphie Festival)

Level 2

Similar to the level 1, in some text, allowing a fixed combination character (for example, Hebrew, Arabic, Devangari, Bengali, Guluchi, Gujarati, Oriya, Tamil, Telugo, Print. Enad, Malayalam, Thai and Lao). If there is no minimum combination character, UCS cannot express these languages ​​completely.

Level 3

All UCS characters are supported, such as mathematicians can add a TILDE on any of the characters or an arrow (or both).

What is unicode?

Historically, there are two independent, creating a single character set. One is an ISO 10646 project of the International Standardization Organization (ISO), and the other is a association organization that is made up of "more than the United States) multilingual software manufacturer. Unicode project. Fortunately, the participants of the two projects have been recognized that the world does not need two different single-character sets. They combine the work results of both sides, and work together to create a single coding table. The two projects still exist and publish their respective standards independently, but the Unicode Association and ISO / IEC JTC1 / SC2 agree to keep the Unicode and ISO 10646 standard code table compatible, and adjust any future extensions closely.

So where is UNICODE and ISO 10646?

The Unicode standard published by the Unicode Association toughly contains the basic multi-language surface of the ISO 10646-1 implementation level 3. All characters in the two standards are in the same location and have the same name. Theunicode standard has additionally defined many and characters. The semantic symbolics is generally a better reference for realizing high quality printing publishing systems. Unicode details the algorithm to draw some languages ​​(such as Arabic) expression, handling two-way text (such as Latin and Ji Algorithm and sorting and string compare algorithms, and many other things.

On the other hand, ISO 10646 standard is like a wide range of ISO 8859 standards, just a simple character set. It specifies some terms related to standards, defining some coded alias, including specification, Specifies how to use UCS to connect to other ISO standards, such as ISO 6429 and ISO 2022. There are also some closely related to ISO, such as ISO 14651 is about the UCS string sort.

Considering that the Unicode standard has an easy-to-book name, and in any good bookstore, there is a small part of the ISO version, and includes more assistive information, so it has become widely used. The reference is not surprising. However, it is generally believed that the font for printing ISO 10646-1 standards is higher than the quality of the IsoDe 2.0. Professional font designer is always recommended to say two Both standards are achieved, but some of the samples provided have a significant difference. ISO 10646-1 standard uses four different style variants to display ideas such as Chinese, Japanese and Korean (CJK), and Unicode 2.0 table There is only a variety of variants in Chinese. This leads to universally believes that Unicode is unable to receive Japanese users, although it is wrong.

What is UTF-8?

First, UCS and Unicode just assign an integer to the character's coding table. There are several ways to represent a string of characters as a string byte. The most apparent method is to store Unicode text to 2 or 4 bytes. The series of sequences. The formal names of these two methods are UCS-2 and UCS-4, respectively. Unless otherwise specified, most bytes are such a (BiGendian Convention). Conversion of an ASCII or Latin-1 It is only necessary to simply insert 0x00 before each ASCII byte. If you want to convert into UCS-4, three 0x00 must be inserted before each ASCII byte.

Using UCS-2 (or UCS-4) under UNIX will result in a very serious problem. Use these coded strings that will contain some special characters, such as '/ 0' or '/', they are in the file name and other C library There is a special meaning in the function parameters. In addition, most of the tools under UNIX using the ASCII file, if no major modification is unable to read 16 characters. Based on these reasons, in the file name, text file, environment variable And other places, UCS-2 is not suitable as external coding as Unicode.

The UTF-8 encoding defined in ISO 10646-1 Annex R and RFC 2279 does not have these problems. It is a significant approach to Unicode under UNIX style operating system.

UTF-8 has a property:

UCS characters U 0000 to U 007F (ASCII) are encoded as bytes 0x00 to 0x7f (ASCII compatible). This means that only 7 ASCII characters are the same in ASCII and UTF-8 encoding mode. All> u 007f UCS characters are encoded as a plurality of bytes of strings, each byte is a marking bit set. Therefore, ASCII bytes (0x00-0x7f) cannot be part of any other characters. Representation The first byte of the multi-character string of non-ASCII characters is always in the range of 0xc0 to 0xFD, and pointed out how many bytes of this character contain. The remaining bytes of the multi-character string are in the 0x80 to 0xBF range. This makes it easy to resynchronize and make the encoded banks, and very little affected by the lost byte. You can include all possible 231 UCS code UTF-8 encoding characters can be up to 6 bytes, however The 16-bit BMP character is only available for only 3 bytes long. The order of the BiGendian UCS-4 byte string is a predetermined. Bytes 0xfe and 0xff have never been used in the UTF-8 encoding. The following bytes are used to represent A character. Which string used depends on the serial number of the character in Unicode.

U-00000000 - U-0000007F: 0xxxxxxx U-00000080 - U-000007FF: 110xxxxx 10xxxxxx U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx U-04000000 - U-7FFFFFF: 1111110X 10xxxxxx 10xxxxxx 10xxxxxxxxxxxxxxxx

The location of the XXX is filled in bits of the binary representation of the character coded. The more you rely on X has the specific meaning. Use the shortest one enough to express a multi-character buffer string of character coding. Note in the multi-character string In the beginning "1" of the beginning of the first byte is the number of the entire string byte.

For example: Unicode character u 00a9 = 1010 1001 (copyright symbol) encoded in UTF-8:

11000010 10101001 = 0xc2 0xA9

And character u 2260 = 0010 0010 0110 0000 (not equal) is encoded:

11100010 10001001 10100000 = 0xE2 0x89 0xA0

This encoded official name is spelling UTF-8, where UTF represents UCS Transformation Format. Do not use other names (such as UTF8 or UTF_8) in any document, of course, unless you refer to a variable name. Not this code itself.

What programming language supports Unicode?

Most of the modern programming languages ​​developed after 1993 have a special data type called Unicode / ISO 10646-1. In the ADA95 called Wide_Character, called char in Java.

ISO C also details the mechanism for processing multi-byte coding and wide character (Wide Characters), and more in Amendment 1 to ISO C in September 1994. These mechanisms are mainly designed for various East Asian codes. They are much robust than the need to handle UCS. UTF-8 is an example of the encoding of the ISO C standard call multi-byte string, and the WCHAR_T type can be used to store Unicode characters.

How to use Unicode under Linux?

Prior to UTF-8, Linux users in different regions use a wide variety of ASCII extensions. The most common European code is ISO 8859-1 and ISO 8859-2, Greece encoding ISO 8859-7, Russia Coding KOI-8, Japan Encoding EUC and Shift-JIS, and more. This makes the exchange of files are very difficult, and the application must pay special attention to these codes. Finally, Unicode will replace all of these codes, mainly through the UTF-8 form. UTF- 8 will apply

Text file (source code, HTML file, email message, etc.) File name standard input and standard output, pipe environment variable cut and paste select buffer Telnet, MODEM, and serial connections to the terminal simulator, and other places used ascii Byte string

In UTF-8 mode, terminal simulator, such as XTERM or Linux Console Driver, convert each button into a corresponding UTF-8 string, and then send it to the Stdin of the front desk process. Similar, any process is output on STDOUT Both will be sent to the terminal simulator, which is processed there in one UTF-8 decoder, and then displayed in a 16-bit font.

It is only possible to have a complete Unicode function support only in a multi-language word processor package in a functionally perfect, and the solution to replace ASCII and other 8-bit character sets in Linux is more simple. The first step, Linux The terminal simulator and command line tool will only turn to UTF-8. This means that only the ISO 10646-1 implementation of the level 1 (no combination character), and only supports the language like not need to process Latin, Greece , Slavic and many scientific symbols. At this level, UCS support is similar to ISO 8859, the only significant difference is that we can use there. There is a multi-character string to be represented.

One day, Linux will certainly support combined characters, but even so, for combined strings, pre-acts (how to use) will still be preferred. Find more, in linux, unicode's preferred method for text encoding It should be defined in Normalization Form C in Unicode Technical Report # 15.

In the future, people can consider adding support for double-byte characters used in Japanese and Chinese (they are relatively simple), combined character support, and even language written from right to left, such as Hebrew (They are not that simple) support. But support for these advanced features should not hinder the quick application of simple tablet UTF-8 in Latin, Greece, Slavic and scientific symbols to replace a large number of European 8-bit codes. And provide a status of scientific symbols.

How can I modify my software?

There are two ways to support UTF-8. I call it soft conversion and hard conversion. When the soft conversion, the data is stored as a UTF-8 form, so there is a small software that needs to be modified. When the program is hard conversion, the program Convert the read UTF-8 data to a wide character array to process internal processing. When output, the string is converted back to UTF-8.

Most applications can work very well. This makes it possible to introduce UTF-8 into UNIX to be practicable. For example, programs like Cat and Echo do not need to be modified. They can still output them Is ISO 8859-2 or UTF-8 without knowing because they only handle word streams without processing them. They can only identify the ASCII characters and the control code such as '/ n', and this is under UTF-8 There is no change. Therefore, the UTF-8 encoding and decoding of these applications will be completed in the terminal simulator.

Those programs that pass the number of characters through the number of characters need some small modifications. In UTF-8 mode, they must not be numbed in the range of 0x80 to 0xBF, because these are just following the bytes, they are not Character. For example, the LS program must be modified because it is discharged to the user's directory table layout through the number file name. Similarly, all the assumed it is a wide font, and therefore formatting their programs What you must learn how to use the number of characters in UTF-8 text. The editor's function, such as deleting a single character, must be slight modifications to delete all bytes that may belong to this character (Vi) Editor (Vi, Emacs, etc.) and programs that use NCurses libraries. LINUX core use soft conversion can also work very well, only very tiny modifications to support UTF-8. Most of the core functions of the string (for example: file name, Environment variables, etc.) are not affected. The following places must be modified:

The console displays the keyboard driver (another VT100 simulator) must be encoded and decoded UTF-8, you must start a few subsets of the Unicode character set. From Linux 1.2, these features have been there. External file system driver For example, VFAT and Winnt must be converted to file name character encoding. UTF-8 has already joined the list of available conversion options, so the mount command must tell the core driver user process I hope to see the UTF-8 file name. Since VFAT and WinNT no matter How to have already used Unicode, then UTF-8 can exert its advantage to ensure that there is no information loss in the conversion. The TTY driver of the POSIX system supports a "cooked" mode, with some original row editing features. To make the character delete function work normally, Stty must set up UTF-8 mode in the TTY driver, so it will not take the follow-up characters in the range of 0x80 to 0xBF. Bruno haible has some stty and core. The Linux patch of the TTY driver.

C Support for Unicode and UTF-8

Starting from GNU GLIBC 2.1, the WCHAR_T type has been officially set to only store independent of the current Locale, 32-bit ISO 10646 value. Glibc 2.2 will fully support multi-byte conversion functions in ISO C (WPrintf (), MBstowcs () , Etc.) These functions can be used to convert between WCHAR_T and any multi-byte encoding depending on the Locale, including UTF-8.

For example, you can write

WPrintf (l "SCH 鰊 E GRE! / N");

Your software will then print this text according to the encoding specified by your users in the environment variable LC_CTYPE (for example, en_us.utf-8 or de_de.iso_8859-1). Your compiler must be running In the corresponding local Locale in the corresponding local Locale, the wide string in the target file will be changed to the WCHAR_T string. When the output, the runtime library will convert the WCHAR_T string back to the LOCALE when executed. Code.

Note that this is similar to this:

CHAR C = L "a";

Only characters from U 0000 to U 007F (7-bit ASCII) are allowed. For non-ASCII characters, they cannot be converted directly from WCHAR_T to CHAR.

Now, functions like Readline () can work under UTF-8 locale.

How to activate UTF-8 mode?

If your application supports both character sets (ISO 8859 - *, KOI-8, etc.), but also supports UTF-8, then it must be used to know whether it should be used in UTF-8 mode. Luck In the next few years, people will only use UTF-8, so you can use it as a default, but even so, you still have to support traditional 8-bit character sets and support UTF-8.

The current application uses many different command line switches to activate their respective UTF-8 modes, for example:

XTERM command line option "-u8" and x resource "xterm * UTF8: 1" gnat / gcc command line option "-gnatw8" stty command line option "IUTF8" mined command line option "-u" XEMACS ELISP package to UTF- 8 and internal MULE Coding Translation Vim 'FileEncoding' Options LESS Environment Variable LessCharset = UTF-8 Remember that each application command line option or other configuration method is very monotonous, so it is urgent to need a standard method.

If you use hard conversion in your application, use some specific C library function to handle external character encoding and internal WCHAR_T encoding, then the C library will help you handle the problem of switching. You only The environment variable lc_cType is required to be correct, for example, if you use UTF-8, it is en.uTF-8, and if it is Latin-1, it is set to en.iso_8859-1.

However, most existing maintainers choose to replace soft conversion without using libc's wide character functions, not only because they have not been widely used, but because this will make software to make large-scale modifications. In this case Your app must be able to know when you use UTF-8 mode. One way is to do the following:

Find the first value variable according to the order of environment variables lc_all, lc_cType, and lang. If this value contains a UTF-8 substring (maybe it is lowercase or no "-"), the default is UTF-8 mode (still available The command line switch is reset) because this value is reliable and properly indicated that the C library should use a UTF-8 LOCALE.

Provide a command line option (or if the x client can use x resource) will still be useful, can be used to reset the default value specified by environment variables such as LC_CTYPE.

How can I get the UTF-8 version of xterm?

The XTERM version in XFree86 has recently been added to the UTF-8 extension recently. Usage is to get Xterm Patch # 119 (1999-10-16) or updated version, "./configure --enable -wide-chars; make "to compile, then use the command line option -U8 to call the xterm so that it converts the input and output to UTF-8. Use a * -iso10646-1 font in the UTF-8 mode. When you are * -Iso10646-1 fonts can also be used in the ISO 8859-1 mode, because ISO 10646-1 fonts are completely backwards compatible with ISO 8859-1 fonts.

The new XTERM version of UTF-8 is supported, as well as some ISO 10646-1 fonts, will be included in the XFree86 version 4.0.

Does XTERM support combined characters?

Xterm currently only supports ISO 10646-1, which is not to provide combination character support. Currently, combined characters will be treated as space characters. The future revision of XTERM is very likely to join some simple combination character support. Only the base character with one or more combined characters is bold. For the baseline below and the residents above the small characters, the result of this processing can be accepted. For The text of the specially designed bold character is used like Thai literals so that it can work very well. However, for some fonts, accents on higher characters, especially for "Fixed" The font family, the result is not completely satisfactory. Therefore, the preferences should continue to be prioritized in the available places.

Does XTERM support half a wide and full CJK font?

Xterm currently only supports the font of all glyphs equivalent. The future revision is likely to add half-wide and full-wide character support for the CJK language, similar to the KTERM. If selected ordinary font It is the X × Y pixel size, and the wide character mode is open, then xterm will try to load another 2x x y pixel size font (the same XLFD, just the value of the average_width property). It uses This font comes to display all Unicode characters that have been assigned EAST Asian Wide (w) or East Asian FullWidth (f) in Unicode Technical Report # 11. The following C function is used to test if a Unicode character is a wide character and needs Display with a greeque to cover two character units: / * this function tests, WHETHER THE ISO 10646 / Unicode Character Code

* UCS Belongs Into the East Asian Wide (W) or East Asian Fullwidth

* (F) category as defined in Unicode Technical Report # 11. In this

* Case, The Terminal Emulator Should Repesent The Character Using A

* a Glyph from a double-wide font this cover two Normal (Latin)

* Character Cells. * /

Int Iswide (int UCS)

{

IF (UCS <0x1100)

Return 0;

Return

(UCS> = 0x1100 && ucs <= 0x115f) || / * Hangul Jamo * /

(UCS> = 0x2e80 && ucs <= 0xa4cf && (ucs & ~ 0x0011)! = 0x300A &&

UCS! = 0x303f) || / * cjk ... yi * /

(UCS> = 0xAc00 && ucs <= 0xD7A3) || / * Hangul Syllables * /

(UCS> = 0xf900 && ucs <= 0xfaff) || / * cjk compatibility IDEOGRAPHS * /

(UCS> = 0xfe30 && ucs <= 0xfe6f) || / * CJK Compatibility Forms * /

(UCS> = 0xff00 && ucs <= 0xff5f) || / * Fullwidth Forms * /

(UCS> = 0xffe0 && ucs <= 0xffe6);

}

Some C libraries also provide functions

#include

INT WCWIDTH (WCHAR_T WC);

INT WCSWIDTH (Const Wchar_T * PWCS, SIZE_T N);

Used to determine the wide character WC or N wide character code (or less than N wide character code pointed to by the PWCS, if an air wide character is encountered before the N wide character code) The number of columns. These functions are defined in the Single Unix Specification of Open Group. A Latin / Greek / Slavic / Wait for a column location, a CJK hierogue requires two, and a combined character requires zero.

Will the final xterm support writing from right to left? There is no plan from right to left function from right to left. Hebrew and Arab users have had to deliver Hebrew and Arabic characters. Press the left direction before the terminal, in other words, the two-way process must be completed in the application, not in xterm. At least, Hebrew is in the form of Arabic in the form of availability, and prompts Support, it is better than ISO 8859. It is far from deciding whether XTERM supports two-way text and how it works. ISO 6429 = ECMA-48 and Unicode Bidi Algorithm provide start-up points for selection. You can also refer to Ecma TechnicalReport TR / 53. Xterm does not process the formatting algorithm for Arabic, Hangul or Indian text, and is now not clear to handle whether it is feasible and worth it in the VT100 simulator, or should be left to application. If you are planning The two-way text output is supported in your application, look at the free implementation of the FRIBIDI, DOV GROBGELD's Unicode two-way algorithm.

Where did I find ISO 10646-1 X11 font?

There are quite a few Unicode fonts in the past few months and is still growing rapidly.

Markus kuhn is extending with many other volunteers to extend the old -Misc-fixed - * - ISO8859-1 font to overwriting all European character tables (Latin, Greece, Slavic, International Phonetic alphabet. Mathematics and technical symbols, In some fonts, there are even Armenian, Georgia, Split Mamage, etc.). For more information, please refer to the Unicode Fonts and Tools for X11 pages. These fonts will be distributed together with XFree86. For example font -Misc-fixed-medium-r-semicondensed --13-120-75-75-C-60-ISO10646-1

(An extension of the old XTERM's Fixed default font, including more than 3,000 characters) is already part of Xfree86 3.9 Snapshot.

Markus also is also available in the ISO 10646 of all Adobe and B & H BDF fonts in X11R6.4 Distribution. These fonts already contain all PostScript font tables (about 30 additional characters, most of which are used by CP1252 MS-Windows, such as Smart quotes, dashes, etc., is not available under ISO 8859-1. They are fully available in ISO 10646-1. Xfree86 4.0 will bring an integrated TrueType font engine, which makes your X application can Use any Apple / Microsoft font for ISO 10646-1 encoding. The future XFree86 version is likely to remove most of the old BDF fonts from the distribution version, replaced by ISO 10646-1 encoded version. X server will add one Automated encoding converters, only when the old 8-bit software requests a font similar to ISO 8859- * encoded fonts, only in the ISO 10646-1 font file. Modern software should prioritize use ISO 10646-1 Font Coding. Clearlyu (CU12) is a very useful X11 12 o'clock, 100 DPI's ProPortsAl ISO 10646-1 BDF font contains more than 3,700 characters, provided by Mark Leisher (sample image). Roman Czyborra's GNU Unicode Font project works in collecting a complete with free 8 × 16/16 × 16 Pixel Unicode font. There is currently 34,000 characters. ETL-Unicode is an ISO 10646-1 BDF font, provided by Primoz Peterlin .

The Unicode X11 font name is ended in -iso10646-1. The value in the X logical font descriptor, XLFD) charset_registry and charset_encoding domains have been officially registered for all Unicode and ISO 10646-1 Each * -iso10646-1 contains a few subsets in the entire Unicode character set, and the user must figure out what kind of the fonts they choose overwritten. * - ISO10646-1 font A default_char value is also specified, pointing to a non-Unicode shape, used to indicate unavailable characters in this font (usually a dashed box, one H size, located in 0x1f or 0xffe). This makes the user can at least know this There is an unsupported character. Small widel fonts for xterm, such as 6x13, etc., will never overwrite all Unicode, because many texts can only be represented by Japanese Chinese characters only used in large pixel sizes that are widely used than European users. The typical Unicode fonts used in Europe will only contain a subset of about 1000 to 3000 characters.

How can I find out which shapes in a X font?

The X protocol cannot make an app that makes it easy to find a Cell-SpaceD font to provide such a measure for font. So Mark Leisher and Erik Van de Poel (Netscape) specify a new _xfree86_glyph_ranges BDF property, Tell the application which Unicode subset of this BDF font. Mark Leisher provides some sample code to generate and scan this property, while XMBDFed 3.9 and later will automatically add it to each BDF file it produced. .

What is the problem related to the UTF-8 terminal simulator?

The VT100 terminal simulator accepts ISO 2022 (= ECMA-35) ESC sequence for switching at different characters.

UTF-8 is a "other encoding system" in the ISO 2022 (refer to 15.4 of ECMA 35). UTF-8 is outside the ISO 2022 SS2 / SS3 / G0 / G1 / G2 / G3 world, so if You switch from ISO 2022 to UTF-8, all SS2 / SS3 / G0 / G1 / G2 / G3 status is meaningless until you leave UTF-8 and switch back to ISO 2022. UTF-8 is a no country The encoding is also a self-ended short-by-sequence completely determines what character represents it represents. Switches in any country. G0 and G1 are the same in ISO 10646, while G2 / G3 is in ISO There is no existence in 10646, because any characters have a fixed location, so they do not switched. In UTF-8 mode, your terminal does not switch into a strange graphic character mode because you accidentally load a binary file. This makes a terminal more robust in UTF-8 mode than in ISO 2022 mode, and therefore there may be a way to lock the terminal lock in the UTF-8 mode without accidentally returning to the ISO 2022 world.

The ISO 2022 standard specifies a range of ESC% sequences to leave the ISO 2022 World (specified other coding systems, DOCs), many such sequences for UTF-8 have been registered into ISO 2375 International Register Of Coded Character Sets:

ESC% g activates an UTF-8 mode that does not specify a level of implementation level from ISO 2022 and allows Return ISO 2022. ESC% @ from UTF-8 returns to ISO 2022, the condition is UTF-8 ESC% entered by ESC% G / G Switch into the UTF-8 level 1 and does not return. Esc% / h switches into the UTF-8 level 2 and does not return. ESC% / I switches into the UTF-8 level 3 and does not return.

When a terminal simulator is in UTF-8 mode, any ISO 2022 escape code sequence is ignored, for example, to switch G2 / G3, etc., is ignored. A terminal simulator in UTF-8 mode is uniquely executed ISO 2022 sequence is ESC% @ Returns the ISO 2022 program from UTF-8. Autf-8 still allows you to use C1 control characters like CSI, although UTF-8 also uses bytes in the range of 0x80-0x9f. It is important to understand UTF. The terminal simulator in mode mode must use a UTF-8 decoder to the received byte stream before performing any control characters. The C1 character needs to be decoded by UTF-8 as any other character greater than U 007F.

What have I support UTF-8 applications?

Yudit is the free X11 Unicode editor of Gaspar Sinai Mined 98 is provided by Thomas Wolff, which is a text editor that can handle UTF-8. LESS version 346 or higher, support UTF-8 C-Kermit 7.0 in transmission, terminal, and The file character set supports UTF-8. SAM is a PLAN9 UTF-8 editor, similar to VI, can also be used for Linux and Win32. (Plan9 is the first full turn to UTF-8, which as a character encoded operating system 9TERM is provided by Matty Farrow, a UniX transplant of a PLAN9 operating system. Wily is a UNIX implementation of a Plan9 ACME editor. UCM-0.1 is a Unicode character mapping table of Juliusz Chroboczek, one Gadgets allow you to select any unicode characters and paste into your app.

What are the patches for improving UTF-8 support?

Robert Brady offers a Patch for Less 340 (now I have incorporated LESS 344) Bruno Haible provides multiple patches for stty, linux core TTY. Otfried Cheong writes the Unicode Encoding for GNU Emacs Toolbox, enabling MULE Handle UTF-8 files.

What is the name of PostScript glyphs with a UCS code?

Refer to Adobe's Unicode and Glyph Names Guide.

How is the cut and paste working in UTF-8?

Refer to Juliusz Chroboczek's client-wide Unicode text exchange, a new proposal for ICCCM, using a new atom utf8_string that can be used for properties type and selection targets to handle UTF-8 selection.

Is there any free library for processing Unicode?

IBM CLASSES for Unicode Mark Leisher's UCData Unicode Character library and Wchar_t support test code.

What is the status quo of various X Widgets support for Unicode?

GScript - Unicode and Complex Text Processing is a project for gtk to increase full-featured Unicode support. Qt 2.0 now supports the use of * -iso10646-1 fonts. Fribidi is a free implementation of Dov GROBGELD's Unicode two-way algorithm.

Is there any good mailing list about this topic?

You should subscribe to the list of Unicode@unicode.org mail, which is the best way to discover the author of the standard and many other leaders. Subscription method is to use "Subscribe" as title, "Subscribe Your@email.Address Unicode "As a body, send a message to Unicode-Request@unicode.org.

There is also a special list of UTF-8 supported by the application for the application of the application on the GNU / Linux system, Linux-UTF8@nl.linux.org. The subscription method is to send a message to the content, send a message to the content. Majordomo@nl.linux.org. You can also browse Linux-Utf8 Archive additional XFree86 group "Font" and "I18N" list, but you must be a formal developer to subscribe.

More reference

Bruno Haible 's Unicode HOWTO The Unicode Standard, Version 2.0 Unicode Technical Reports Mark Davis.' Unicode FAQ ISO / IEC 10646-1:????? 1993 Frank Tang's I t rnati nàliz ti n Secrets Unicode Support in the Solaris 7 Operating Environment The USENIX paper by Rob Pike and Ken Thompson on the introduction of UTF-8 under Plan9 reports about the first operating system that migrated already in 1992 completely to UTF-8 (which was at the time still called UTF-2). Li18nux is a project initiated by several Linux distributors to enhance Unicode support for Linux. The Online Single Unix Specification contains definitions of all the ISO C Amendment 1 function, plus extensions such as wcwidth (). The Open Group's summary of ISO C Amendment 1. GNU libc The Linux Console Tools The Unicode Consortium character database and character set conversion tables are an essential resource for anyone developping Unicode related tools. Other conversion tables are available from Microsoft and Keld Simonsen's WG15 archiv e. Michael Everson's ISO10646-1 archive contains online versions of many of the more recent ISO 10646-1 amendments, plus many other goodies. See also his Roadmaps to the Universal Character Set. An introduction into The Universal Character Set (UCS). Otfried Cheong's essey on Han Unification in Unicode The AMS STIX project is working on revising and extending the mathematical characters for Unicode 4.0 and ISO 10646-2 Jukka Korpela's Soft hyphen (SHY) -.? a hard problem is an excellent discussion of the controversy surrounding U 00AD. James Briggs' Perl, Unicode and I18n FAQ. I continually add new materials to this document, so please seek regularly. Welcome all the improvement suggestions, as well as the free software community on improving UTF-8 support Advertising. UTF-8 is a new thing in Linux, so we can see a lot of progress in the future months.

Thanks to Ulrich Drepper and Bruno Haile's valuable annotation Markus Kuhn << Markus.kuhn@cl.cam.ac.uk> was founded in 1999-06-04 - recently updated on 2000-01-15 - http: / / www.cl.cam.ac.uk/~mgk25/unicode.html

转载请注明原文地址:https://www.9cbs.com/read-105102.html

New Post(0)