1.UTF8 knowledge
UTF8 The space occupied by different character encoders is different. While implementing unified encoding, in some cases (such as English), you can save space relative to other unified encoded character sets (such as Unicode), UTF8 is often used in communication clients and servers.
Because a letter is added to some keyboards to add only one binary seven bits, and one byte is eight bits, so UTF8 uses one byte to character and some keyboards. However, how do we know its composition after we get the encoded byte? It may be a byte of English letters, or it may be one byte in the three bytes of Chinese characters! So, UTF8 is a sign!
When the content to be represented is 7 by one byte: 0 ****** The first 0 is the flag, and the remaining space can represent the content of ASCII 0-127.
When the content to be expressed, use two bytes when the content is 8 to 11: 110 ***** 10 ****** The first byte 110 and the second byte 10 is the flag Bit.
When the content to be expressed, use three bytes in 12 to 16: 1110 ***** 10 ****** 10 ******, the same, the first byte 1110 And the second, the three bytes of 10 are flag bits, and the remaining space can represent Chinese characters.
Push this:
Four bytes: 11110 **** 10 ****** 10 ****** 10 ******
Five bytes: 111110 *** 10 ****** 10 ****** 10 ****** 10 ******
Six bytes: 1111110 ** 10 ****** 10 ****** 10 ****** 10 ****** 10 ******
.............................................
............................................
See
http://www.newebug.com/article/cpp/2221.shtml
2. Some of the basic concepts of some other character sets (GB, BIG5, GBK, Unicode)
First, basic concept
· GB code
The full name is GB2312-80 "Basic Sets for Han Character Coding Character Set", published in 1980, is a national standard for Chinese information, in continental and overseas, using Simplified Chinese, etc., is the only Chinese code for mandatory use. . P- Windows3.2 and Apple OS are encoded with GB2312 as basic Chinese characters, and Windows 95/98 is encoded with GBK as basic Chinese characters, but is compatible with GB2312. The GB code has a total of 6763 Simplified Chinese characters, 682 symbols, including the Chinese character part: 1 word 3755, sorted in pinyin, secondary word 3008, sorted by side. The formulation and application of this standard has played a big role in promoting Chinese informationization processes. In 1990, the coding standard GB12345-90 "Information exchange for information exchange", which is the first auxiliary set of information exchange, ", which must use traditional Chinese characters, and ancient books. This standard has a total of 6866 Chinese characters (more than 103 words more than GB2312, which do not include these words), and pure traditional words have about 2,200 words. (2312 episodes and 12345 sets are not intersecting. One is Simplified, one is traditional)
· BIG5
It is currently the coding standard for traditional Chinese characters in Taiwan and Hong Kong, including 440 symbols, 5401 Chinese characters, 7652 Chinese characters, and a total of 13060 Chinese characters. BIG-5 is a double-byte coding scheme, and the value of the first byte is between 16-based A0 to Fe, the second byte is between 40 to 7E and A1 to Fe. Therefore, the highest bit of the first byte is 1, the highest bit of the second byte may be 1, or it may be 0. · Chinese Internal Code Specification
GBK coding (commonly known as big character set) is a new Chinese coding expansion national standard made by China's mainland, equivalent to UCS. GBK Working Group completed the GBK specification in December of the same year in October 1995. The coding standard is compatible with GB2312, including 2,1003 Chinese characters, 883 symbols, and provides 1894 types of characters, simple, and traditional characters in one library. Windows 95/98 Simplified Chinese version of the font library surface encoding is GBK, contact the underlying letter library with the Codes table with one or one between GBK and UCS. The value of its first byte is between the 81- FE of 16, and the second byte is 40 to Fe, remove the XX7F line.
Determine the encoding is a GBK code:
(0x81 <= char1 <= 0xfe) && (0x40 <= char2 <= 0x7e || 0x7e <= char2 <= 0xfe)
• Universal Multiple Oct Coded Character Set
International Standards Organize an ISO / IEC JTC1 / SC2 / WG2 Working Group in April 1984, which is uniformly encoded for Chinese characters and symbols. 1991 US multinational company established UNICODE CONSORTIUM, and reached an agreement with the WG2 in October 1991, using the same codeworthy set. At present, Unicode is a 16-bit encoding system that is the same as the BMP (Basic Multilingual Plane) of ISO10646. Unicode passed DIS (Draf International Standard) in June 1992, the current version V2.0 is announced in 1996, including 6811 symbols, 20902 Chinese characters, 11,172 Chinese characters, 6400 Chinese, 20,249, total 65534 .
· UCS2 encoding, Unicode encoded subset
See
Http://blog.9cbs.net/i_like_cpp/archive/2005/03/16/320606.aspx
3. Character set detection
n When a software opens a text, the first thing it has to do is to determine which code saved which code is used. Software generally uses three ways to determine the character set and encoding of the text:
Detect file header identification, prompt user selection, speculation according to certain rules
The most standard way is to detect several bytes of the most beginning of text, start byte charSet / encoding, as shown in the following table:
EF BB BF UTF-8
Fe FF UTF-16 / UCS-2, Little Endian
FF Fe UTF-16 / UCS-2, BIG Endian
FF Fe 00 00 UTF-32 / UCS-4, Little Endian.
00 00 Fe FF UTF-32 / UCS-4, Big-endian.
See