1. Unicode is initially indeed only U 0000 to U FFFF, but the location is not enough (single is more than 70,000 Chinese characters in Unicode). So starting from V2.0, 16 auxiliary planes appear. The first auxiliary plane (U 1000 to U 1FFFF) is used to place a nickname, the second auxiliary plane (U 20000 to U 2FFF) is used to place the Han Dynasty, the sixteenth auxiliary plane (U 100000 to U 10FFFF) is a privately located. 2. BIG-5 yard from 8140 to Fefe, the first Byte by 81-Fe total 126 second Byte by 40-7E, A1-Fe total 157 126 multiply 157 is 19782 code 3. There are many people here. I only have seen the BIG5 code, so confused the words and coding two things. (BIG-5 is actually a very poor word code!) Simply, Unicode is a list, UTF-8, UTF-16, and UTF-32 are encoding methods. Take Japanese as an example, Japanese has JIS-X0208, JIS-X0212, JIS-X0213 and other words, each word is a 94-multiply 94 word code table (whose 94 will be talked again), such as Japanese "Asia" word, put it in the "16-01" code bits of JIS-X0208. The coding method of Japanese includes Euc-JP, JIS, Shift_JIS, etc. The UNICODE list is the word list from 0000 to 10FFFF; UTF-16 (also known as UCS2) is a method of putting each word fixed 16bit, UTF-32 (also known as UCS4) is a handle The word is placed to a fixed 32bit, and the UTF-8 is used to represent the English letters. The two Bytes represent the letters of U 0080 to U 07FF, three Bytes represent the word of U 0800 to U FFFF, the auxiliary plane Words use four to six bytes. UTF-16, UTF-32 and UTF-8 can be interchanged according to mathematical calculating, unlike the BIG-5 to go to the GB code, there is no need to worry about the conversion speed, and it will not be turned to another due to one encoding method. The encoding method is lost. UTF-16, UTF-32 and UTF-8, each have a favorable situation, such as UTF-8 with 1byte when recording English letters, using 3Bytes when recording Chinese characters, so use UTF-8 records to use more words Location; UTF-16 is used by 2bytes per word, so use UTF-16 records less than a small position, but the English word should be used to use more times. The reason why Microsoft uses UTF-16 is because it is easier to calculate, and the webpage uses the reasons for UTF-8, because most web content is written in Western Europe, store Make UTF-8 more provinces, and U 0000 to U 007F can be compatible with ASCII. The web page can be stored in UTF-8, and you can also store UTF-16, depending on whether you need to be compatible with the ASCII code, use the older browser (UTF-8), or you want to handle some when you want to write (UTF- 16) Unicode (unified code / standard general code): Introduction
Unicode (unified code) is expressed in 2byte, a total of 65,536 combinations, is the subset of ISO-10646 UCS (Universalcharacter Set, the world's general segment), and 4144 units are included in V4.0.0. Unicode different versions of the standard are included in FTP.Unicode.org, and the latest version is included in FTP.Unicode.org/unidata, version number is 4.0.0. The distribution of Unicode can be obtained by ftp://ftp.uitodode.org/public/unidata/unicodedAta.txt (830k). Unicode is in a 16-in-one, indicated by "U ", such as u 4e00 (note, not 0x4e00). Unicode is just a glyph and internal code, and does not define the method that is actually accessed on the computer. Therefore, the Unicode Association defines a complete set of computers to access Unicode encoding, and considers compatible with other coding. It is called UTF (Unicode / UCS Transformation Format, unified code / universal word set transformation format). Commonly used formats have UTF-8 and UTF-16 UTF-16 are basically the implementation of Unicode dual Byte encoding, plus a coding mechanism (but very small) to cope with future expansion needs. UTF-8 is an unequal encoding method that may require 1, 2, 3 Bytes to store, the ASCII font does not need to conversion, keep the original; but other language materials must be transformed, the capacity will be Because each self need to be added with 1 to 2 Bytes to encode. Transition of UTF-8 (Unicode-> UTF-8) U 0000 ~ U 007F (1 byte, 128) 0 7BITS [2 ^ 7 = 128] U 0080 ~ U 07FF (2 byte, 1920 110 5Bits, 10 6BITS [2 ^ 11 = 2048]
U 0800 ~ U FFFF (3 Bytes, 63488) 1110 4BITS, 10 6BITS, 10 6BITS [2 ^ 16 = 65536]
--------------------------------
U 0000 ~ U FFF total 65536
=======================================================================================================================================================
Unihan (uniform Chinese character)