Java Chinese Processing Learning Notes - Hello Unicode [Repost]

xiaoxiao2021-03-05  25

http://www.chedong.com/tech/hello_unicode.html Author: East car

Copyright Notice: You can reprint anything, please sure to indicate the original source and author information and this statement by hyperlink. Http://www.chedong.com/tech/hello_unicode.html Keywords: Linux Java Mutlibyte Encoding Locale i18n i10n Chinese ISO-8859-1 GB2312 BIG5 GBK Unicode content Abstract: I don't know if you feel this: Why do PHPs have few garbled problems and use Java to do web applications so troubles? Why can I use Simplified Chinese on Google to find traditional Chinese, or even Japanese results? And when you use Google, it can actually automatically call out the Chinese interface according to the language of the language I use the browser? Many international applications let me understand such a truth: Unicode is designed for more convenient application, while Java core characters are based on Unicode, this mechanism provides applications to Chinese "Word" control (Not bytes). But if you don't carefully understand the specification, this freedom will become a cumbersome, resulting in more garbled questions: some basic concepts of character set; test 1: Display system environment settings and supported coding methods; Test 2: The system default encoding method affects the input and output of Java applications; test 3: The character set problem in the output and output in the web application; the preparation knowledge of the character set: ISO-8859-1 GB2312 BIG5 GBK GB18030 Unicode why there is such a Multi-character set encoding method? Note: The following instructions are not strictly defined, and some metapys are only used as a convenient understanding. Suppose a character is a chess piece on the board, with its fixed coordinates, if you need to distinguish all characters, you need enough chess to accommodate different "characters". SingleByte Charsets in English and Europe: First of all, I think that the character set in the ISO-8859 series is one: 2 ^ 8 = 16 * 16 = 256 lattice chessboard, so all Western characters (English) It is basically covered with such a 16 × 16 coordinate system. The English is actually only one of the parts of which is less than 128 (/ x80). Different definition rules with spaces greater than 128 have formed an extension character set for other European languages: ISO-8859-2 ISO-8859-4, etc ... ISO-8859-1ISO-8859-7 other languages ​​English other Western European characters ē ē English Greek character μγ English Other single-byte character set GB2312 BIG5 SJIS Multibyte Charsets: For Asian language: Chinese characters so much, use such a 256-grid small chessboard can't let go, so Different thousands of Chinese character solutions are to locate a "word" on the board with 2 bytes (coordinates), to make an extension of the above rules: if the first character is less than 128 (/ x80) Still compatible with English character sets; if the first character is greater than 128 (/ x80), it is a first byte of Chinese characters. This one byte of himself is constorified into a Chinese character; As a result, it is equivalent to each small chess grid in a small checker in 128, and a 16 × 16 small chessboard is divided. Such a lattice number in such a checkerboard becomes 128 128 * 256. According to a similar manner, the GB2312 standard of Simplified Chinese, traditional BIG5 character set and Japanese SJIS character set, etc., the GB2312 character set contains approximately six common Simplified Chinese characters.

简体 中文 Japanese SJIS Traditional Chinese English Simplified Chinese English Japanese English Traditional Chinese Separation, all of these are compatible from the ASCII extended encoding method: English part is compatible, but the encoding method of the extended part is incompatible, though Many words are consistent in three systems (such as "Chinese" 2 words) but in the corresponding character concentration, the page written in GB2312 has become unrecognizable with BIG5. And sometimes frequently changing Chinese characters often in browsing other non-English countries (such as people who contain German), in fact, it is caused by the encoding conflict between the extension. I understand GBK and GB18030 into a small unicode: GBK character set is GB2312 extension (k), about 10,000 thousand more characters in GBK, in addition to keeping and GB2312 compatibility, traditional Chinese characters, even Japanese pseudonym Characters can also be displayed. The GB18030-2000 is a more complex character set that uses the beam-length byte encoding method to support more characters. About the code of the Chinese characters more detailed definition specification can be used: http://www.unihan.com.cn/cjk/ana17.htmascii (English) ==> Western European text ==> Eastern European character set (Russian, Greek Waiting, etc.) ==> East Asian character set (GB2312 BIG5 SJIS, etc.) ==> Extended character set GBK GB18030 This development process also reflects the development process of character set standards, but so over time, especially the internet When the interaction of the language information has become more and more, there are too many incidents of encoding standards for local languages, resulting in a very high cost of an application. Especially if you want to write a document that contains French and Simplified Chinese, this generally thinks that if you can use a universal character set to display all the words all languages, and doing applications can be more convenient, In order to achieve this goal, even if the application sacrifices some space and program efficiency is also very worthwhile. Unicode is such a universal solution. Unicode double-word character set, so you can imagine Unicode as this: let all characters (including English) indicate two bytes (2 8 digits), this is 2 ^ (8 * 2) = 256 * 256 = 65536 big chessboards for lattice. In this checkerboard, this (simple) Japan and South Korea (also included Vietnam) is placed in a certain location as a CJK character set, and a "chess" is shared in order to reduce repetitive, the same word in various languages. Detailed location See Appendix AUNICODE: (DoubleByte Charsets) West C Its European J Japanese 英 k 语 文 文 What else must be UTF-8? After all, more than 70% of the information is still in English. If you use 2 bytes to access (UCS-2), the space is not too much? The so-called UTF-8 is such a character set conversion format for improving English access efficiency: Unicode Transformation Form 8-bit Form.

转载请注明原文地址:https://www.9cbs.com/read-34362.html

New Post(0)