I18N and L10n

xiaoxiao2021-03-06 94

I18n is INTERNATIONALIZATION (18 characters in the middle of the word N). It is a code modification process, with the aim of implementing the code completely independent of any particular cultural information. Such information is saved in an external file and is loaded when the program is running. Some people may think that they have prepared it for internationalization of all specific cultures related to all particular cultures in the program. In fact, in addition to this, there are still some things to consider, mainly including:

Extract strings, icons, images from the program, indicate the code page (CODE Page) in the form of text, if you need to define the encoded page conversion to modify all text handle, to adapt to the code page to modify all and formats The implementation logic modification of the function (such as date, time, currency, value, etc.) related to the output (such as date, time, currency, value, etc.)

What is the code page? As we know, the computer only knows the value. Therefore, when we want it to handle text, it is assigned to each character in a language at a specific value. Simply put, this character with the value of the value is called the encoded page. In this case, you may often hear terms such as a Charset, Charmap, Coding, Coded Character Set. Although there are some subtle differences between each other, you can think they all refer to each language character, numerical mapping table. The ASCII code page is a very famous example, which maps the English alphabet and some control characters to some specific values.

What are the coding pages? What are the problems around the code page? ASCII encoding maps 127 characters, so the 7-bit (bit) binary is sufficient to represent 127 characters. The program typically processes text within the 8-bit buffer. This is a problem with the encoding page that handles other languages. For example, in this language, there are thousands of characters, and 8-bit two systems can only represent 256 different characters, and it is impossible to uniquely represent each Japanese characters. Therefore, people use a few bytes to represent a Japanese character. Now we encounter another problem. The number of bytes in the buffer is not equal to the number of characters in the buffer. Each simple string operation requires the byte to assemble the byte.

Recognizing this complexity, developers use a technique called Wide-Character to handle foreign language strings. Wide characters are basically 16-bit or 32-bit data types. The capacity is large enough to meet the needs of Asian languages. The processing string no longer uses the 8-bit buffer (char *), but uses a 16-bit buffer (unsigned short *). So each move your pointer, you can guarantee a character (instead of the original that may be half).

Different developers use different coded pages to bring confusion. That is to say, the same Japanese character may be represented by two bytes of 0x95 and 0x5c on a machine, and on the other machine may be 0xc9 and 0xbd. In this way, each exchange data must be converted (called a CODESet Conversion).

What is Unicode? How does it solve this problem? Different coded pages in each language increases the complexity of software that needs to support different languages. Therefore, people have developed a world standard called Unicode (http://www.unicode.org). UNICODE provides a unique specific value for each character, no matter what the software, no matter what language . That is, all characters used in the world are listed and give each character a unique specific value.

What is UTF-8? Is it a matter with Unicode? The initial goal of Unicode is to provide mapping for more than 65,000 characters with 1 16-bit encoding. But this is not enough, it can't cover all historical texts, or IMPLANTATION Head-Ache's, especially in network-based applications. Some software must do a lot of work to process 16-bit data. Therefore, Unicode uses some basic reserved characters to establish a three sets of encoding mode. They are UTF-8, UTF-16 and UTF-32, respectively. As shown in the name, in UTF-8, the character is encoded in an 8-bit sequence, with one or several bytes to represent a character. The greatest advantage of this approach is that the UTF-8 has been encoded as part of the ASCII character, for example, in UTF-8 and ASCII, "A" encoding is 0x41.

UTF-16 and UTF-32 are 16-bit and 32-bit encoding methods for Unicode. Considering the initial purpose, usually Unicode means UTF-16. When discussing Unicode, it is very important to figure out which coding method is very important. See http://www.unicode.org/unicore/standard/principles.html.

What is the meaning of "Cultural Specific Information" contains?

Hard-Coded strings related to a particular culture, any string related to a particular culture cannot be included in the program, which is placed in an external file to be translated into a variety of languages. Character Classification how to classify characters? For example, in English, you can divide characters into uppercase characters and lowercase characters. If you are a C programmer, you can check it with iSupper () and islower (). When facing a variety of languages, you need to consider more classification methods, and sometimes the case-oriented classification method has no meaning in some languages. Numerical and currency format currency symbols, as well as the way to packet numerical groups, each country is different. Date and time format year, month, where is the most in front? To sort / order If compare characters "a" and "B", you can actually compare their ASCII values to determine their order. However, there may be no such different coded pages. Therefore, the character sequence must be determined by special rules.

What is LOCALE?

The main goal of i18n is to extract all information related to specific cultures from the code. This means that you need to load this data at runtime, so your software can run normally in that language. In the developer's language, a Locale is just a handle (Handle) of this dynamically loaded data. Based on this technology, local components include language, geographical and encoded pages.

What is "code page"? Why do you need it?

Due to multiple encoded pages (such as Japanese, EUC, SHIFT-JIS, UTF-8, etc.), sometimes encoded page conversions when switching data between two systems. Another situation is that when you work on a distributed system and decide to save data with Unicode encoded pages, you must convert the local encoding when displayed; or when you receive the user's input, etc. Code page conversion is usually a surfract table, which is very expensive. There is only a few coded page conversions, such as the UTF-16 to UTF-8 provides a conversion algorithm.

What is a conversion sensitive code page Shift sensitive code?

In some Asian languages, some special byte sequences are used to distinguish between single-byte characters and double-byte characters. For example, in Japanese ISO-2022-JP coded pages, byte sequence / x1b / x24 / x40 is used to represent the double-byte character from the single-byte character, and byte sequence / x1b / x28 / x42 is used to represent From double byte characters to single-byte characters. In other words, when you analyze a string, if you encounter byte sequence / x1b / x24 / x40, you know that from this will be some double-byte characters. To mark the byte sequence from the single-byte character to the double-byte character, the other is called the Shift-Out sequence. These conversion points challenge the parsing string because you have to save the current state in order to perform related characters. ISO-2022-JP and ISO-2022-KR are two very famous Japanese and Korean conversion sensitive encoded pages, which are popular in many HTML pages. What is EBCDIC Code Page?

Ebcdic This disadvantage is that even English characters are not consistent with the ASCII value. There are comparison tables with EBCDIC and ASCII.

What is the good reference book for the Asian character set for beginners?

Ken Lunde's CJKV Information Processing.

What is international QA (Internationalization QA)?

It is the QA of language compatibility testing that special attention to the product. This includes testing the ability of the product to identify and initialize the language processing in its language environment, adapting to the ability of the environment. White box tests should typically include check that this code that implements I18n compatibility standards (such as using a correct API, etc.). Black box tests should typically include the regression test of all functions of the product in different locations and localized languages on the test interface. Specific cultural information also needs to be checked (such as dates, time display, etc.).

What is localization QA (Localization QA)?

Localization QA is performed after the software localization / translation. The focus not only includes functions, but also whether the translation of the information in the GUI context is appropriate. Also included checking the GUI layout, ensuring that there is no information to be truncated. It is usually tested by a person who is a native language in a language.

I only know English, can I do international QA?

of course can. The part of I18n in QA, does not require any knowledge of a specific language, although it is understood that this language will be used sooner or later, especially when the environmental language is set. If you do i18nqa, your product has not been translated, then its interface is still displayed in English. There may also be some error messages from others to run, you can do it clear, but usually you can find relevant content according to the error code to the English error information library.

In the Asian language environment, what problems often cause the software running?

Many Asian language code pages, the second byte of the multi-byte character is within the ASCII range. Such characters are easily interrupted by software.

How do I convert a string from one coding to other codes?

Use the iConv command in the * NIX environment, or download Jiconv.

How do I see a 16-way code for a character?

Using the od.exe in the * NIX environment, use hod.exe in the WIN environment

What is jchardet?

JCHARDET is a Java transplant for the Mozilla automatic character set detection algorithm code, and its source code can be downloaded from SourceForge. The initial author of this algorithm is Frank Tang, C source code at http://www.infomall.cn/cgi-bin/mallgate/20040514/Http://lxr.mozilla.org/mozilla/source/intl/chardet/, You can get more information about this algorithm from http://www.infomall.cn/cgi-bin/mallgate/20040514/http://www.mozilla.org/projects/intl/chardet.html.

What is jchardet?

Compilation and application

After extracting the downloaded Chardet.zip, go to ~ / mozilla / intl / chardet / java / directory, run Ant to generate chardet.jar in a dist / lib directory, add this JAR package to classpath. And then

Operation: java org.mozilla.intl.chardet.htmlcharsetdetector http://hedong.3322.org

RESULTS: Charset = GB18030

Operation: java org.mozilla.intl.chardet.htmlcharsetdtector http://www.wesnapcity.com/

RESULTS: Charset = ASCII

Operation: java org.mozilla.intl.chardet.htmlcharsetdetector http://www.wesnapcity.com/blog/

RESULTS: Charset = UTF-8

Programming

The following is a description of the HTMLCharSetDetector.java in jchardet.jar:

/ / Implement the NSICHARSETDETECTIONOBSERVER interface, this interface has only one notify () method. This Notify method will be called when the Jchardet engine thinks that the character set of the string has been identified (regardless of the wrong or wrong).

nsicharSetDetectionObserver CDO = New nsicharsetDetectionObserver () {

Public void notify (string charset) {

HtmlcharsetDetector.found = true;

System.out.println ("charset =" charset;

}

/ **

* Initialization nsdetector ()

* LANG is an integer to prompt language clues, and the language clues that can be provided have the following:

Japanese Chinese Simplified Chinese Traditional Chinese Korean Dont Know (default)

* /

NSDETECTOR DET = New NSDetector (LANG);

// Set an Oberver

DET.INIT (CDO);

BufferedInputStream IMP = New BufferedInputStream (Url.OpenStream ());

BYTE [] BUF = New byte [1024];

Boolean Done = false; // Have you already determined a character set

Boolean isascii = true; // Assume that the current string is ASCII encoding

While ((len = IMP.Read (buf, 0, buf.length))! = -1) {

// Check is all ASCII characters, when a character is not ASC encoding, all the data is not ASCII encoded. IF (isascii) ISASCII = DET.ISASCII (BUF, LEN);

// If it is not an ASCII character, call the DOIT method.

If (! isascii&! done) DONE = DET.DOIT (BUF, LEN, FALSE); // If it is not ASCII, it has not yet determined that the encoder set is not determined.

}

DET.DATAEND (); // Last to this method, at this time, Notify is called.

IF (isascii) {

System.out.println ("Charset = ASCII");

Found = True;

}

If (! found) {// If you are not found, you find the most likely those character sets.

String prob [] = DET.GETPROBABLECHARSETS ();

For (int i = 0; i

System.out.println ("probable charset =" prob [i]);

}

What kind of problem is JCHARDET works?

The Java string (and characters) class saves data in Unicode encoding. When processing international text from the outside, we need to provide encoding for these texts to accurately convert them to Unicode. This means you have to know the encoding of all the files to be processed by your Java code. Many Internet-based Java applications have to process data from random data sources, and many data coding cannot be exactly. For example, the data in an HTML page, if there is no metadata tag, it is difficult to indeed encode it, and it is also missed when it is converted to the Java Unicode string.

How does this algorithm work?

The browser handles this problem, is a check of one byte of a byte of the data to try to test the character set (when you click Menu View-> Auto-SELECT or Auto-Detect). This algorithm (initially developed by Frank Tang) checks the byte sequence. Based on the value of each byte, the ELIMINATION Log is gradually narrowed to the final determination of the character set. If this method is still difficult to determine, another method is used to do character set according to the frequency statistics of a character of a language.

Both ICONV and Jiconv can convert text information from one encoding to another.

iConv is a tool program on * NIX. Its basic format is:

iconv -f Encoding -t EncoDing InputFile

Iconv can know the character set

437, 500, 500v1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 869, 874, 904, 1026, 1046, 1047, 8859_1, 8859_2, 8859_3, 8859_4, 8859_5, 8859_6, 8859_7, 8859_8, 8859_9, 10646-1: 1993, 10646-1:

1993 / UCS4, ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110, ARABIC, ARABIC7, ASCII, ASMO-708, ASMO_449, BALTIC, BIG-5, BIG- FIVE, BIG5-HKSCS, BIG5, BIG5HKSCS, BIGFIVE, BS_4730, CA, CN-BIG5, CN-GB, CN, CP-AR, CP-Gr, CP-HU, CP037, CP038, CP273, CP274, CP275, CP278, CP280, CP281, CP282, CP284, CP285, CP290, CP297, CP367, CP420, CP423, CP424, CP437, CP813, CP819, CP850, CP851, CP852, CP855, CP856, CP857, CP860, CP861, CP862, CP863, CP864, CP865, CP866, CP868, CP869, CP870, CP871, CP874, CP875, CP880, CP891, CP903, CP904, CP905, CP912, CP915, CP916, CP918, CP920, CP922, CP930, CP932, CP933, CP935, CP936, CP937, CP939, CP949, CP950, CP1004, CP1026, CP1046, CP1047, CP1070, CP1079, CP1081, CP1084, CP1089, CP1124, CP1129, CP1250, CP1251, CP1252, cP1253, CP1254, CP1255, CP1256, CP1257, CP1258, CP1361, CP10007, CPIBM861, CSA7-1, CSA7-2, CSASCII, CSA_T500-1983, CSA_T500, CSA_Z243.4-1985-1, CSA_Z243.41985-2, CSA_Z243.419852, CSDecmcs, C SEBCDICATDE, CSEBCDICATDEA, CSEBCDICCAFR, CSEBCDICDKNO, CSEBCDICDKNOA, CSEBCDICES, CSEBCDICESA, CSEBCDICESS, CSEBCDICFISE, CSEBCDICFISEA, CSEBCDICFR, CSEBCDICIT, CSEBCDICPT, CSEBCDICUK, CSEBCDICUS, CSEUCKR, CSEUCPKDFMTJAPANESE, CSGB2312, CSHPROMAN8, CSIBM037, CSIBM038, CSIBM273, CSIBM274, CSIBM275, CSIBM277, CSIBM278, CSIBM280, CSIBM281, CSIBM284, CSIBM285, CSIBM290, CSIBM297, CSIBM420, CSIBM423, CSIBM424, CSIBM500, CSIBM851, CSIBM855, CSIBM856, CSIBM857, CSIBM860, CSIBM863, CSIBM864, CSIBM865, CSIBM866, CSIBM868, CSIBM869, CSIBM870, CSIBM871, CSIBM880, CSIBM891, CSIBM903, CSIBM904, CSIBM905, CSIBM918, CSIBM922, CSIBM930, CSIBM932, CSIBM933, CSIBM935, CSIBM937, CSIBM939, CSIBM943, CSIBM1026, CSIBM1124, CSIBM1129, CSISO4UNITEDKINGDOM,

CSISO10SWEDISH, CSISO11SWEDISHFORNAMES, CSISO14JISC6220RO, CSISO15ITALIAN, CSISO16PORTUGESE, CSISO17SPANISH, CSISO18GREEK7OLD, CSISO19LATINGREEK, CSISO21GERMAN, CSISO25FRENCH, CSISO27LATINGREEK1, CSISO49INIS, CSISO50INIS8, CSISO51INISCYRILLIC, CSISO58GB1988, CSISO60DANISHNORWEGIAN, CSISO60NORWEGIAN1, CSISO61NORWEGIAN2, CSISO69FRENCH, CSISO84PORTUGUESE2, CSISO85SPANISH2, CSISO86HUNGARIAN, CSISO88GREEK7, CSISO89ASMO449, CSISO90, CSISO92JISC62991984B, CSISO99NAPLPS, CSISO103T618BIT, CSISO111ECMACYRILLIC, CSISO121CANADIAN1, CSISO122CANADIAN2, CSISO139CSN369103, CSISO141JUSIB1002, CSISO143IECP271, CSISO150, CSISO150GREEKCCITT, CSISO151CUBA, CSISO153GOST1976874, CSISO646DANISH, CSISO2022CN, CSISO2022JP, CSISO2022JP2, CSISO2022KR, CSISO2033, CSISO5427CYRILLIC, CSISO5427CYRILLIC1981, CSISO5428GREEK, CSISO10367BOX, CSISOLATIN1, CSISOLATIN2, Csisolatin3, Csisolatin4, Csisolatin5, Csisolatin6, Csisolatinarabic, Csisolatincyrilllic, Csisolatingreek, Csisolatinhebrew, cskoi8r, csksc5636, csmacintosh , CSNATSDANO, CSNATSSEFI, CSN_369103, CSPC8CODEPAGE437, CSPC775BALTIC, CSPC850MULTILINGUAL, CSPC862LATINHEBREW, CSPCP852, CSSHIFTJIS, CSUCS4, CSUNICODE, CUBA, CWI-2, CWI, CYRILLIC, DE, DECMCS, DEC, DECMCS, DIN_66003, DK, DS2089, DS_2089 , E13B, EBCDIC-AT-DE-A, EBCDIC-AT-DE, EBCDIC-BE, EBCDIC-BR, EBCDIC-CA-FR, EBCDIC-CP-AR1, EBCDIC-CP-AR2, EBCDIC-CP-BE, EBCDIC -CP-CA, EBCDIC-CP-CH, EBCDIC-CP-DK, EBCDIC-CP-ES, EBCDIC-CP-FI, EBCDIC-CP-FR, EBCDIC-CP-GB, EBCDIC-CP-GR, EBCDIC-CP -He, EBCDIC-CP-IS, EBCDIC-CP-IT, EBCDIC-CP-NL, EBCDIC-CP-NO, EBCDIC-CP-ROECE, EBCDIC-CP-SE, EBCDIC-CP-TR, EBCDIC-CP-US , EBCDIC-CP-WT, EBCDIC-CP-YU, EBCDIC-CYRILLIC, EBCDIC-DK-NO-A, EBCDIC-DK-NO, EBCDIC-ES-A, EBCDIC-ES-S, EBCDIC-ES, EBCDIC-FI -SE-a, EBCDIC-FI-SE, EBCDIC-FR,

Ebcdic-Greek, Ebcdic-IS-friss, ebcdic-it, ebcdic-jp-e, ebcdic-jp-kana, ebcdic-pt, ebcdic-uk, ebcdic-us, ebcdicatde, ebcdicatdea, EBCDICCAFR, EBCDICDKNO, EBCDICDKNOA, EBCDICES, EBCDICESA, EBCDICESS, EBCDICFISE, EBCDICFISEA, EBCDICFR, EBCDICISFRISS, EBCDICIT, EBCDICPT, EBCDICUK, EBCDICUS, ECMA-114, ECMA-118, ECMA-128, ECMACYRILLIC, ECMACYRILLIC, ELOT_928, ES, ES2, EUC-CN, EUC-JP, EUC-KR, EUC-TW, EUCCN, EUCJP, EUCKR, EUCTW, FI, FR, GB, GB2312, GB13000, GB18030, GBK, GB_1988-80, GB_198880, Georgian-Academy, GEORGIAN-PS, GOST_19768-74, GOST_19768, GOST_1976874, GREEKCCITT, GREEK, GREEK7OLD, GREEK7, GREEK7OLD, GREEK8, GREEKCCITT, HEBREW, HPROMAN8, HPROMAN8, HU, IBM-856, IBM-922, IBM- 930, IBM-932, IBM-933, IBM-935, IBM-937, IBM-939, IBM-943, IBM-1046, IBM-1124, IBM-1129, IBM037, IBM038, IBM256, IBM273, IBM274, IBM275, IBM277, IBM278, IBM280, IBM281, IBM284, IBM285, IBM290, IBM297, IBM367, IBM420, IBM423, IBM424, IBM437, IBM500, IBM775, IBM813, IBM819, IBM850, IBM8 51, IBM852, IBM855, IBM856, IBM857, IBM860, IBM861, IBM862, IBM863, IBM864, IBM865, IBM866, IBM868, IBM869, IBM870, IBM871, IBM874, IBM875, IBM880, IBM891, IBM903, IBM904, IBM905, IBM912, IBM915, IBM916, IBM918, IBM920, IBM922, IBM930, IBM932, IBM933, IBM935, IBM937, IBM939, IBM943, IBM1004, IBM1026, IBM1046, IBM1047, IBM1089, IBM1124, IBM1129, IEC_P27-1, IEC_P271, INIS-8, INIS-CYRILLIC, INIS, INIS8, INISCYRILLIC, ISIRI-3342, ISIRI3342, ISO-2022-CN-EXT, ISO-2022-CN, ISO-2022-JP-2, ISO-2022-JP, ISO-2022-KR, ISO-8859- 1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-14, ISO-8859-14, ISO-8859-15, ISO-8859-16, ISO-10646, ISO-10646 / UCS2, ISO-10646 / UCS4,

ISO-10646 / UTF-8, ISO-10646 / UTF8, ISO-CELTIC, ISO-IR-4, ISO-IR-6, ISO-IR-8-1, ISO-IR-9-1, ISO-IR- 10, ISO-IR-11, ISO-IR-14, ISO-IR-IR-IR-IR-IR-ISO-IR-17, ISO-IR-18, ISO-IR-19, ISO-IR-21, ISO-IR-25, ISO-IR-IR-27, ISO-IR-37, ISO-IR-49, ISO-IR-50, ISO-IR-IR-IR-IR-IR-54, ISO-IR-55, ISO- IR-57, ISO-IR-60, ISO-IR-61, ISO-IR-69, ISO-IR-84, ISO-IR-85, ISO-IR-86, ISO-IR-88, ISO-IR- 89, ISO-IR-90, ISO-IR-92, ISO-IR-98, ISO-IR-99, ISO-IR-100, ISO-IR-101, ISO-IR-103, ISO-IR-109, ISO-IR-110, ISO-IR-111, ISO-IR-121, ISO-IR-122, ISO-IR-126, ISO-IR-127, ISO-IR-138, ISO-IR-139, ISO- IR-141, ISO-IR-143, ISO-IR-144, ISO-IR-148, ISO-IR-150, ISO-IR-151, ISO-IR-153, ISO-IR-155, ISO-IR- 156, ISO-IR-157, ISO-IR-166, ISO-IR-179, ISO-IR-193, ISO-IR-197, ISO-IR-199, ISO-IR-203, ISO-IR-209, ISO-IR-226, ISO646-CA, ISO646-CA2, ISO646-CN, ISO646-CU, ISO646-DE, ISO646-DK, ISO646-ES, ISO646-ES2, ISO646-FI, ISO646-FR, ISO646-FR1, ISO646-GB, ISO646-HU, ISO646-IT, ISO646-JP-OCR-B, ISO646-JP, ISO646-KR, ISO646-NO, ISO646-NO2, IS O646-PT, ISO646-PT2, ISO646-SE, ISO646-SE2, ISO646-US, ISO646-YU, ISO2022CN, ISO2022CNEXT, ISO2022JP, ISO2022JP2, ISO2022KR, ISO6937, ISO8859-1, ISO8859-2, ISO8859-3, ISO8859- 4, ISO8859-5, ISO8859-6, ISO8859-7, ISO8859-8, ISO8859-9, ISO8859-10, ISO8859-13, ISO8859-14, ISO8859-15, ISO8859-16, ISO88591, ISO88592, ISO88593, ISO88594, iSO88595, ISO88596, ISO88597, iSO88598, ISO88599, ISO885910, ISO885913, ISO885914, ISO885915, ISO885916, ISO_646.IRV: 1991, ISO_2033-1983, ISO_2033, ISO_5427EXT, ISO_5427, ISO_5427: 1981, ISO_5427EXT, ISO_5428, ISO_5428: 1980, ISO_6937-2: 1983, ISO_6937, ISO_6937: 1992, ISO_8859-1, ISO_8859-1: 1987, ISO_8859-2, ISO_8859-2: 1987, ISO_8859-3, ISO_8859-3: 1988, ISO_8859-4, ISO_8859-4: 1988, ISO_8859-5,

ISO_8859-5: 1988, ISO_8859-6, ISO_8859-6: 1987, ISO_8859-7, ISO_8859-7: 1987, ISO_8859-8, ISO_8859-8: 1988, ISO_8859-9, ISO_8859-9: 1989, ISO_8859-10, ISO_8859-10: 1992, ISO_8859-14, ISO_8859-14: 1998, ISO_8859-15: 1998, ISO_9036, ISO_10367BOX, ISO_10367BOX, ISO_69372, IT, JIS_C6220-1969-RO, JIS_C6229-1984-B, JIS_C62201969RO, JIS_C62291984B, JOHAB, JP-OCR-B, JP, JS, JUS_I.B1.002, KOI-7, KOI-8, KOI8-R, KOI8-T, KOI8-U, KOI8, KOI8R, KOI8U, KSC5636, L1, L2, L3, L4, L5, L6, L7, L8, L10, LATIN-GREEK-1, Latin-Greek, Latin1, Latin2, Latin3, Latin4, Latin5, Latin6, Latin7, Latin8, Latin10, Latingreek, Latingreek1, Mac-Cyrillic, Mac-IS, Mac-Sami, Mac-UK, Mac, Maccyrillic, Macintosh, Macis, Macuk, Macukrainian, MS-ANSI, MS-ARAB, MS-CYRL, MS-EE, MS-GREEK, MS-HEBR, MS- Mac-cyrillic, ms-Turk, MSCP949, MSCP1361, MSMACCYRILLIC, MSZ_7795.3, MS_KANJI, NAPLPS, NATS-DANO, NATS-SEFI, NATSDANO, NATSSEFI, NC_NC0010, NC_NC00-10, NC_NC00-10: 81, NF_Z_62-010, NF_Z_62-010_ (1973), NF_Z_62-010_1973, NF_Z_62010, NF_Z_62010_ 1973, NO, NO2, NS_4551-1, NS_4551-2, NS_45511, NS_45512, OS2LATIN1, OSF00010001, OSF00010002, OSF00010003, OSF00010004, OSF00010005, OSF00010006, OSF00010007, OSF00010008, OSF00010009, OSF0001000A, OSF00010020, OSF00010100, OSF00010101, OSF00010102, OSF00010104, OSF00010105, OSF00010106, OSF00030010, OSF0004000A, OSF0005000A, OSF05010001, OSF100201A4, OSF100201A8, OSF100201B5, OSF100201F4, OSF100203B5, OSF1002011C, OSF1002011D, OSF1002035D, OSF1002035E, OSF1002035F, OSF1002036B, OSF1002037B, OSF10010001, OSF10020025, OSF10020111, OSF10020115, OSF10020116, OSF10020118, OSF10020122, OSF10020129, OSF10020354, OSF10020357, OSF10020357, OSF10020359, OSF10020360, OSF10020364, OSF10020365, OSF10020366, OSF10020367, OSF10020370, OSF10020387,

OSF10020388, OSF10020396, OSF10020402, OSF10020417, PT, PT2, R8, Roman8, SE, SEN_850200_B, SEN_850200_C, SHIFT-JIS, SHIFT_JIS, SJIS, SS636127, ST_SEV_358-88, T.61-8bit, T.61, T. 618bit, Tis-620, Tis620-0, Tis620.2529-1, TIS620.2533-0, TIS620, TS-5881, UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS- 4LE, UCS2, UCS4, UHC, UJIS, UK, Unicode, Unicodebig, Unicodelittle, US-ASCII, US, UTF-7, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF- 32BE, UTF-32LE, UTF7, UTF8, UTF16, UTF16BE, UTF16LE, UTF32, UTF32BE, UTF32LE, VISCII, WCHAR_T, WIND SAMI-2, Winbaltrim, Windows-1250, Windows-1251, Windows-1252, Windows-1253, Windows-1254, Windows-1255, Windows-1256, Windows-1257, Windows-1258, Winsami2, WS2, Yu Knowledge Perl, PHP and other languages have this ICONV package interface. For Java languages, it is easier and can be implemented without external programs or packages.

BufferedReader IN = New BufferedReader (NEW INPUTSTREAMREADER (System.in, Fromencoding);

BufferedWriter out = new buffsetwriter (new outputstreamwriter (system.out, toencoding);

String Str;

WHILE ((str = in.readline ())! = null) {

Out.write (Str, 0, str.length ());

Out.newline ();

Out.flush ();

}

Java supported character sets:

Basic Encoding Set (Contained in rt.jar)

Canonical Name Description ASCII American Standard Code for Information Interchange Cp1252 Windows Latin-1 ISO8859_1 ISO 8859-1, Latin alphabet No. 1 UnicodeBig Sixteen-bit Unicode Transformation Format, big-endian byte order, with byte-order mark UnicodeBigUnmarked Sixteen-bit Unicode Transformation Format, big-endian byte order UnicodeLittle Sixteen-bit Unicode Transformation Format, little-endian byte order, with byte-order mark UnicodeLittleUnmarked Sixteen-bit Unicode Transformation Format, little-endian byte order UTF8 Eight-bit Unicode Transformation Format UTF-16 Sixteen-bit unicode transformation format, byte order specified by a mandatory initial byte-order markextended encoding set (contained in i18n.jar)

Canonical Name Description Big5 Big5, Traditional Chinese Big5_HKSCS Big5 with Hong Kong extensions, Traditional Chinese Cp037 USA, Canada (Bilingual, French), Netherlands, Portugal, Brazil, Australia Cp273 IBM Austria, Germany Cp277 IBM Denmark, Norway Cp278 IBM Finland, Sweden Cp280 IBM Italy Cp284 IBM Catalan / Spain, Spanish Latin America Cp285 IBM United Kingdom, Ireland Cp297 IBM France Cp420 IBM Arabic Cp424 IBM Hebrew Cp437 MS-DOS United States, Australia, New Zealand, South Africa Cp500 EBCDIC 500V1 Cp737 PC Greek Cp775 PC Baltic Cp838 IBM Thailand extended SBCS cp850 MS-DOS Latin-1 Cp852 MS-DOS Latin-2 cp855 IBM Cyrillic Cp856 IBM Hebrew Cp857 IBM Turkish Cp858 Variant of cp850 with Euro character Cp860 MS-DOS Portuguese Cp861 MS-DOS Icelandic Cp862 PC Hebrew Cp863 MS- Dos Canadian French CP864 PC ARABIC CP865 MS-DOS Nordic CP866 MS-DOS RUSSIAN CP868 MS-DOS PAKISTAN CP869 IBM Modern Greek CP870 IBM Multilingual L atin-2 Cp871 IBM Iceland Cp874 IBM Thai Cp875 IBM Greek Cp918 IBM Pakistan (Urdu) Cp921 IBM Latvia, Lithuania (AIX, DOS) Cp922 IBM Estonia (AIX, DOS) Cp930 Japanese Katakana-Kanji mixed with 4370 UDC, superset of 5026 Cp933 korean Mixed with 1880 UDC, superset of 5029 Cp935 Simplified Chinese Host mixed with 1880 UDC, superset of 5031 Cp937 Traditional Chinese Host miexed with 6204 UDC, superset of 5033 Cp939 Japanese Latin Kanji mixed with 4370 UDC, superset of 5035 Cp942 IBM OS / 2 Japanese, SuperSet Of CP932 CP942C Variant Of CP942 CP943 IBM OS / 2 Japanese,

superset of Cp932 and Shift-JIS Cp943C Variant of Cp943 Cp948 OS / 2 Chinese (Taiwan) superset of 938 Cp949 PC Korean Cp949C Variant of Cp949 Cp950 PC Chinese (Hong Kong, Taiwan) Cp964 AIX Chinese (Taiwan) Cp970 AIX Korean Cp1006 IBM AIX pakistan (Urdu) Cp1025 IBM Multilingual Cyrillic: Bulgaria, Bosnia, Herzegovinia, Macedonia (FYR) Cp1026 IBM Latin-5, Turkey Cp1046 IBM Arabic - Windows Cp1097 IBM Iran (Farsi) / Persian Cp1098 IBM Iran (Farsi) / Persian (PC) Cp1112 IBM Latvia, Lithuania Cp1122 IBM Estonia Cp1123 IBM Ukraine Cp1124 IBM AIX Ukraine Cp1140 Variant of Cp037 with Euro character Cp1141 Variant of Cp273 with Euro character Cp1142 Variant of Cp277 with Euro character Cp1143 Variant of Cp278 with Euro character Cp1144 Variant of Cp280 with Euro character CP1145 VARIANT OF CP284 with EURO Character CP1146 Variant Of CP285 with EURO CHARACTER CP1147 Variant of CP297 with EURO Character CP1148 Variant Of CP500 with EURO Character CP1 149 Variant of Cp871 with Euro character Cp1250 Windows Eastern European Cp1251 Windows Cyrillic Cp1253 Windows Greek Cp1254 Windows Turkish Cp1255 Windows Hebrew Cp1256 Windows Arabic Cp1257 Windows Baltic Cp1258 Windows Vietnamese Cp1381 IBM OS / 2, DOS People's Republic of China (PRC) Cp1383 IBM AIX People's Republic of China (PRC) Cp33722 IBM-eucJP - Japanese (superset of 5050) EUC_CN GB2312, EUC encoding, Simplified Chinese EUC_JP JIS X 0201, 0208, 0212, EUC encoding, Japanese EUC_JP_LINUX JIS X 0201, 0208, EUC encoding, Japanese EUC_KR KS C 5601, Euc Encoding, Korean EUC_TW CNS11643 (Plane 1-3), Euc Encoding, Traditional Chinese GBK GBK, SIMPLIFIED Chinese ISO2022CN ISO 2022 CN,

Chinese (conversion to Unicode only) ISO2022CN_CNS CNS 11643 in ISO 2022 CN form, Traditional Chinese (conversion from Unicode only) ISO2022CN_GB GB 2312 in ISO 2022 CN form, Simplified Chinese (conversion from Unicode only) ISO2022JP JIS X 0201, 0208 in ISO 2022 form, Japanese ISO2022KR ISO 2022 KR, Korean ISO8859_2 ISO 8859-2, Latin alphabet No. 2 ISO8859_3 ISO 8859-3, Latin alphabet No. 3 ISO8859_4 ISO 8859-4, Latin alphabet No. 4 ISO8859_5 ISO 8859-5, Latin / Cyrillic alphabet ISO8859_6 ISO 8859-6, Latin / Arabic alphabet ISO8859_7 ISO 8859-7, Latin / Greek alphabet ISO8859_8 ISO 8859-8, Latin / Hebrew alphabet ISO8859_9 ISO 8859-9, Latin alphabet No. 5 ISO8859_13 ISO 8859-13, Latin alphabet No. 7 ISO8859_15_FDIS ISO 8859-15, Latin alphabet No. 9 JIS0201 JIS X 0201, Japanese JIS0208 JIS X 0208, Japanese JIS0212 JIS X 0212, Japanese JISAutoDetect Detects and converts from Shift-JIS, EUC-JP, ISO 2022 JP ( Conversion to Unicode ONLY) Johab Johab, Korean K OI8_R KOI8-R, Russian MS874 Windows Thai MS932 Windows Japanese MS936 Windows Simplified Chinese MS949 Windows Korean MS950 Windows Traditional Chinese MacArabic Macintosh Arabic MacCentralEurope Macintosh Latin-2 MacCroatian Macintosh Croatian MacCyrillic Macintosh Cyrillic MacDingbat Macintosh Dingbat MacGreek Macintosh Greek MacHebrew Macintosh Hebrew MacIceland Macintosh Iceland MacRoman Macintosh Roman MacRomania Macintosh Romania MacSymbol Macintosh Symbol MacThai Macintosh Thai MacTurkish Macintosh Turkish MacUkraine Macintosh Ukraine SJIS Shift-JIS, Japanese tIS620 tIS620, Thai should be noted that: j2sdk a US-only version supports only the first table of the character set, The international version (which has lib / i18n.jar) can support the character sets in both forms.

转载请注明原文地址:https://www.9cbs.com/read-90485.html

9cbs

New Post(0)