This is a fun reading of programmers. The so-called interesting refers to a relaxed understanding of the concept, promoting knowledge, similar to the upgrade of the RPG game. Motivation to organize this article is two questions:
Question one:
Use Windows Notepad's "Save", which can be converted between GBK, Unicode, Unicode Big Endian and UTF-8. Also txt file, how is Windows identifies the encoding method?
I found a few bytes of the outfield of Unicode, Unicode Big Endian and UTF-8 encoded, respectively, FF, Fe (Unicode), FE, FF (Unicode Big Endian, EF, BB) , BF (UTF-8). But these markers are based on what standards?
Question 2:
I recently saw a ConvertF.c on the Internet, realizing the mutual conversion of UTF-32, UTF-16 and UTF-8. For Unicode (UCS2), GBK, UTF-8, I have learned. But this program makes me some confused, I can't think of what is the relationship between UTF-16 and UCS2.
Checked the relevant information, finally made these problems clearly, and it also learned some unicode details. Write into an article and give a friend who has a similar question. This article is easy to understand when writing, but asks the reader to know what is byte, what is a hexadecimal. 0, BIG Endian and Little Endian Big Endian and Little Endian are different ways of CPU processing multiple bytes. For example, the Unicode encoding of the "Han" word is 6c49. So when you write to the file, is it written in front of 6C, or write 49 in front? If you write 6c in front, it is BIG Endian. Still writing 49 in front, is Little Endian. The word "endian" comes from "Gleno Travel". When the civil war of the small country is stemming from the big-endian knockout or from the little-endian, it has happened six rebels, one of the emperors sent life, another One lost the throne. We generally translate endian into "byte sequence", call BIG Endian and Little Endian as "big tail" and "small tails". 1, character encoding, internal code, in the band-introduction Chinese character encoding characters must be encoded to be handled by computer. The default encoding method used by the computer is the internal code of the computer. Early computer uses 7-bit ASCII encoding, in order to handle Chinese characters, programmers designed GB2312 for Simplified Chinese and BIG5 for traditional Chinese. GB2312 (1980) has included 7445 characters, including 6763 Chinese characters and 682 other symbols. The internal code range high byte from the B0-F7, the low byte from A1-Fe, the occupied code is 72 * 94 = 6768. 5 of which are D7FA-D7FE. There are too little Chinese characters supported by GB2312. In 1995, the Chinese character extension specification GBK1.0 included 21,886 symbols, which were divided into Chinese zones and graphics symbols. The Chinese zone includes 21,003 characters. The 2000 GB18030 is a formal national standard for replacing GBK1.0. This standard contains 2,7484 Chinese characters, and also includes major ethnic minorities such as Tibet, Mongolian, and Uighur. The current PC platform must support GB18030, which is not required for embedded products. So mobile phones, MP3 typically only support GB2312. From ASCII, GB2312, GBK to GB18030, these coding methods are compatible, ie, the same character is always the same encoding in these schemes, and the rear standard supports more characters. In these codes, English and Chinese can be handled uniformly. The method of distinguishing between Chinese coding is that the highest bit of high bytes is not 0. According to the title of the programmer, GB2312, GBK to GB18030 belongs to the double-byte character set (DBCS). Some Chinese Windows default internal code or GBK, can be upgraded to GB18030 via the GB18030 upgrade package. However, GB18030 is more difficult to use the character of GBK, and ordinary people are very difficult, usually we still refer to the Chinese Windows internal code with GBK. There are still some details here:
The original text of the GB2312 is still a location code, from the location code to the internal code, you need to add A0 on the high byte and low bytes. In DBCS, the memory format of the GB's internal code is always BIG Endian, that is, the high position. The highest bit of the two bytes of GB2312 is 1. However, only 128 * 128 = 16384 is met with this condition. So the highest bit of the low byte of GBK and GB18030 may not be 1. However, this does not affect the resolution of the DBCS character stream: When reading the DBCS character flow, as long as the byte of the high is 1, the next two bytes can be used as a double-byte encoding without tube low bytes. What is the high? 2, Unicode, UCS, and UTF, the encoding method from ASCII, GB2312, GB18030 is mentioned, is backward compatible from ASCII, GB2312, and GB18030. Unicode is only compatible with ASCII (more accurately, it is compatible with ISO-8859-1) and is incompatible with GB. For example, the Unicode encoding of the "Han" word is 6c49, and the GB code is Baba. Unicode is also a character encoding method, but it is designed by an international organization that can accommodate all language texts in the world. Universal Multiple-Oct Coded Character Set is referred to as UCS. UCS can be seen as an abbreviation for "Unicode Character Set". According to the Niki Encyclopedia (http://en.wikipedia.org/wiki/): There are two organizers that try to independently design Unicode, the International Standardization Organization (ISO) and a Software Manufacturer (Unicode). ORG). ISO has developed ISO 10646 project, UNICODE Association developed a Unicode project. Before 1991, both parties realized that the world did not need two incompatible character sets. So they begin to combine both parties to work together to create a single coding table. Starting from Unicode 2.0, UNICODE projects use the same font library and word code as ISO 10646-1. At present, two projects still exist and publish their own standards independently. The latest version of the Unicode Association now is 2005 Unicode 4.1.0. The latest standard of ISO is 10646-3: 2003. UCS specifies how to use multiple bytes to represent a variety of words. How to transmit these codes, which are specified by the UCS Transformation Format specification, including UTF-8, UTF-7, UTF-16. The IETF's RFC2781 and RFC3629 describe UTF-16 and UTF-8 encoding methods in the consistent style, clear and bright, and unrestrained and unrest. I always remember IETF is an abbreviation for Internet Engineering Task Force. However, the RFC of IETF is maintained is the basis for all specifications on the Internet. 3, UCS-2, UCS-4, BMP UCS has two formats: UCS-2 and UCS-4. As the name suggests, UCS-2 is used in two byte encodings. UCS-4 is used in 4 bytes (actually only 31 bits, the highest bit must be 0) encodes. Let us do some simple math games: UCS-2 has 2 ^ 16 = 65536 code bits, UCS-4 has 2 ^ 31 = 2147483648 code bits. UCS-4 is divided into 2 ^ 7 = 128 groups according to the highest byte of the highest bit of 0. Each group is divided into 256 Plane according to the second high byte. Each Plane is divided into 256 lines according to the third byte, and each line contains 256 Cells. Of course, the CELLs in the same line is only different, and the rest is the same. Plane 0 of Group 0 is called Basic Multilingual Plane, which is BMP.
Or in UCS-4, the code bits of the high two bytes are called BMP. UCS-2 is obtained by removing the BMP of the UCS-4 to the previous zero byte. With two zero bytes of UCS-2, the BMP of UCS-4 is obtained. There is no characters in the current UCS-4 specification outside of BMP. 4, UTF encoding UTF-8 is encoded with 8 bits as units. The encoding method from UCS-2 to UTF-8 is as follows:
UCS-2 encoding (16 credited) UTF-8 byte stream (binary) 0000 - 007F0XXXXXX0080 - 07FF110XXXXX 10XXXXXXX0800 - FFFF1110XXXX 10xxxxxxx 10xxxxxxx, for example, Unicode encoding of "Han" word is 6C49.6C49 between 0800-FFF, so affirm that With 3 bytes templates: 1110xxxx 10xxxxxx 10xxxxxx. The 6C49 is written into binary is: 0110 110001 001001, in order to use this bit stream instead of X in the template, obtained: 11100110 10110001 10001001, i.e., E6 B1 89. Readers can use Notepad to test whether our coding is correct. UTF-16 encodes UCS in 16 bits. For UCS codes less than 0x10000, UTF-16 encoding is equal to the 16-bit unsigned integer corresponding to the UCS code. An algorithm is defined for a UCS code that is not less than 0x10000. However, due to the actual UCS2, or the BMP of UCS4 is inevitably less than 0x10000, it is considered that UTF-16 and UCS-2 are basically the same. However, UCS-2 is just a coding scheme, but UTF-16 is used for actual transmission, so it has to consider the problem of word sequence. 5, UTF's byte order and BOM UTF-8 are in-byte as encoding units, no byte sequence. UTF-16 is encoded in two bytes, first understanding the byte sequence of each encoding unit before explaining a UTF-16 text. For example, a Unicode encoding of "Kui" is 594E, "B" Unicode encoding is 4E59. If we receive the UTF-16 byte stream "594e", then this is "Kui" or "B"? The method recommended by the NNICODE specification is BOM. BOM is not a BOM table of "Bill of Material", but Byte Order Mark. Bom is a little smart idea: there is a character called "Zero Width No-Break Space" in the UCS code, and its code is Feff. The fffe does not exist in the UCS, so it should not appear in the actual transmission. UCS specification recommends that we transfer characters "Zero Width No-Break Space before transporting byte streams." Thus, if the recipient receives Feff, this byte stream is BIG-endian; if it receives FFFE, this byte stream is Little-Endian. Therefore, the character "Zero Width No-Break Space" is also called a BOM. UTF-8 does not require BOM to indicate byte order, but you can use BOM to indicate encoding. The UTF-8 encoding of the characters "Zero Width No-Break Space" is EF BB BF (readers can verify the encoding method described earlier). Therefore, if the recipient receives the byte stream at the beginning of the EF BB BF, it is known that this is UTF-8 encoded. Windows is the encoding method that uses BOM to mark text files. 6. Further reference information The main reference information is "Short Overview of ISO-IEC 10646 and Unicode" (http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html). I also found two things that look good, but because I started to find the answer, so I didn't see: