Cottage code encountered in Java development
If you want to understand why Chinese garbled during the JSP development process, let's take a look at the Unicode encoding. Unicode (unified code) as the name suggests is a stuff with a variety of words in the world. The Unicode Policy Meeting is made by major US computer vendors. Objective To promote a world-wide encoding system, all common texts in the shock world covers it, thereby reducing computer companies to develop abroad market problems. In order to collect thousands of words, under a common encoding mechanism, under the principle of the economy, whether it is oriental or Western text, each word is in the Unicode in two bytes, which is at least 2 65536 different combinations of 65,536 were sufficient to cope with the majority of the current few more occasions. Basically, the computer is only processed by the number. They specify a number to store letters or other characters. There are hundreds of coding systems that specify these numbers before creating Unicode. None of the code can contain enough characters: For example, several European Communies require several different coding to include all languages. Even a single language, such as English, no code can be applied to all letters, punctuation, and common technical symbols. These coding systems will also conflict with each other. That is, two codes may use the same number to represent two different characters, or use different numbers represent the same characters. Any specific computer (especially server) requires many different coding, but whether the data will always be damaged during whether the data is between different coding or platforms. Unicode provides a unique number to each character, no matter what the platform, no matter what the language is. The Unicode standard has been adopted by leaders in these industries, such as Apple, HP, IBM, Justsystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys, and many other companies. The latest standards require unicode, such as XML, Java, Ecmascript (JavaScript), LDAP, CORBA 3.0, WML, and the like, and Unicode is a regular way of implementing ISO / IEC 10646. Many operating systems, all the latest browsers and many other products are supported. The emergence of Unicode standards and supports the presence of tools, and is the most important trend of global software technology. Combine Unicode with customer server or multi-storey applications, and save money than using traditional character sets. Unicode allows a single software product or a single website to run through multiple platforms, languages, and countries without reconstruction. It can transmit data to many different systems without damage. In various technical documents related to Unicode, ISO 10646 and UCS often see two nouns. ISO is an abbreviation in the International Standards Bureau of Switzerland. The Universal Character Set 10646 issued by UCS is ISO, which is the world's general character set. The UCS general character set uses four Bytes to encode, and all the world's official and commercial encoding is eaten, and the network is exhausted. Unicode has closely matched since 1991 and the ISO's UCS team allows Unicode and ISO 10646 to be consistent. Therefore, Unicode starts with version 2.0, and ISO 10646-1 uses the same encoding. The Chinese characters in the Kangxi dictionary have 40,000, if they add the simplicity words in the inside, and the Japanese words of different ways, then Unicode 6 million allocation space, the light is used to assign the Chinese characters, and it seems to catch the ban. What Thai, Arabia and other words.
In response to this problem Unicode and UCS, the solution of [China Japanese Korean Integration] (CJK Unification) is used, which is expressed in the same single code. After [Sino-Japanese Korean Integration], Unicode is called unihan. The complete Unicode4.0 version can be downloaded by http://www.unicode.org/public/unidata/unihan.txt. UTF (Unicode / UCS Transformation Format), Unicode recommends using UTF-8 and UTF-16 formats in which 8 and 16 refer to BITS rather than bytes. UTF-16 is basically the implementation of the Unicode double byte, plus a expansion coding mechanism (very small) UTF-8 to cope with the future, is an unequal encoding method, the English number (ASCII word) keeps the original, It is completely unaffected (therefore do not need to be converted), while other Chinese characteristics must be converted through the program, will [gain], because each word requires additional or two bytes to encode. UCS character set, 2 and 4 of UCS-2 and UCS-4, etc., 2 and 4 refer to Bytes, corresponding to UTF-8 and UTF-16.ucs-2 Basic and Unicode Double Byte Codes Almost UCS-4 Four Byte The encoding represents a word, and the corresponding UCS-4 can be obtained in front of each of the UCS-2, adding two blank BYTE. The spatial allocation of Unicode: The following Unicode location code is identical to the first 256 characters of Unicode and ISO-8859-1 (Western European alphabet), which is ASCII (U 0000 to U 00FF). Each ISO-8859-1 code is a corresponding Unicode code before completing an empty BYTE (0x00). Unihan-related unihan mainly distributed between U 3400 to U F9FFF, GB2312 and BIG5 are mainly distributed between U 4E00 to U 9FFF. UTF-8 coding principle and characteristic: After knowing Western Europe characters and Chinese characters in Unicode, let's take a look at UTF-8U 0000 ~ U 007E 1 _ _ _ _ _ _ (7bits) u 0080 ~ u 07FF 1 1 0_ _ _ _ _ 1 0_ _ _ _ _ _ (11BITS) U 0800 ~ U FFFF 1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _ _ 1 0 _ _ _ _ _ _ _ _ Look at the three formats of Unicode proposes free BITS (underline vacancy) is enough to represent the Unicode code in the location, just enough. Then when the program handles the UTF-8 encoding file, where is the territory of a character? Is there anything in three forms? Each is encoded by UTF-8, whether in one or second, three bytes, the first BYTE front end clearly indicates the total number of BYTEs of the characters. For example, there are two 1 in 110, which represents this character appears in the second way, composed of two Bytes. There are three 1 in 1110, indicating that this character appears, consists of three bytes. Each multiple BYTE UTF-8 encoding has a common process, that is, the second third Byte, which is starting at 10 two BITS. Since the highest bit is constituent 1, it can be easily disconnected with only one Byte ASCII character area in UTF-8, which is convenient. Because of the above design characteristics, between UTF-8 and Unicode, it is easy to do two-way free conversion without losing any information.