Talk about Unicode Coding, briefly explain UCS, UTF, BMP, BOM and other nouns
This is a fun reading of programmers. The so-called interesting refers to a relaxed understanding of the concept, promoting knowledge, similar to the upgrade of the RPG game. Motivation to organize this article is two questions:
Question one:
Use Windows Notepad's "Save", which can be converted between GBK, Unicode, Unicode Big Endian and UTF-8. Also txt file, how is Windows identifies the encoding method?
I found a few bytes of the outfield of Unicode, Unicode Big Endian and UTF-8 encoded, respectively, FF, Fe (Unicode), FE, FF (Unicode Big Endian, EF, BB) , BF (UTF-8). But these markers are based on what standards?
Question 2:
I recently saw a ConvertF.c on the Internet, realizing the mutual conversion of UTF-32, UTF-16 and UTF-8. For Unicode (UCS2), GBK, UTF-8, I have learned. But this program makes me some confused, I can't think of what is the relationship between UTF-16 and UCS2.
Checked the relevant information, finally made these problems clearly, and it also learned some unicode details. Write into an article and give a friend who has a similar question. This article is easy to understand when writing, but asks the reader to know what is byte, what is a hexadecimal.
0, BIG Endian and Little Endian
BIG Endian and Little Endian are different ways of the CPU handling multiple bytes. For example, the Unicode encoding of the "Han" word is 6C49. So when you write to the file, is it written in front of 6C, or write 49 in front? If you write 6c in front, it is BIG Endian. If 49 is written in front, it is Little Endian.
The word "endian" comes from "Gleno Travel". When the civil war of the small people stemped from the big-endian knockout or from the little-endian, therefore happened six rebels, an emperor sent life, another I lost the throne.
We generally translate endian into "byte sequence", call BIG Endian and Little Endian as "big tail" and "small tails".
1, character encoding, internal code, in-band introduction Chinese character encoding
The characters must be encoded after being processed by the computer. The default encoding method used by the computer is the internal code of the computer. Early computer uses 7-bit ASCII encoding, in order to handle Chinese characters, programmers designed GB2312 for Simplified Chinese and BIG5 for traditional Chinese.
GB2312 (1980) has included 7445 characters, including 6763 Chinese characters and 682 other symbols. The internal code range high byte from the B0-F7, the low byte from A1-Fe, the occupied code is 72 * 94 = 6768. 5 of which are D7FA-D7FE.
There are too little Chinese characters supported by GB2312. In 1995, the Chinese character extension specification GBK1.0 included 21,886 symbols, which were divided into Chinese zones and graphics symbols. The Chinese zone includes 21,003 characters.
From ASCII, GB2312 to GBK, these coding methods are compatible, ie the same character has the same code in these schemes, which supports more characters. In these codes, English and Chinese can be handled uniformly. The method of distinguishing between Chinese coding is that the highest bit of high bytes is not 0. According to the title of the programmer, GB2312, GBK belongs to the double-byte character set (DBCS).
The 2000 GB18030 is a formal national standard for replacing GBK1.0. This standard contains 2,7484 Chinese characters, and also includes major ethnic minorities such as Tibet, Mongolian, and Uighur. From the Chinese characters, GB18030 added 6582 Chinese characters of CJK expansion A on the basis of GB13000.1, and a total of 27484 Chinese characters were included on the basis of GB13000.1 20902 Chinese characters. CJK is the meaning of China and Japan. In order to save code bits, Unicode will unify the text in the three languages of China and South Korea. GB13000.1 is the Chinese version of ISO / IEC 10646-1, which is equivalent to Unicode 1.1.
The encoding of GB18030 is single-byte, double-byte, and 4 bytes. Among them, single bytes, double-bytes, and GBK are fully compatible. The 4-byte encoded code is to include 6582 Chinese characters of CJK expansion A. For example, the encoding of the 0x3400 of the UCS should be 8139EF30, and the encoding of the 0x3401 in the GB18030 should be 8139EF31 in the GB18030.
Microsoft provides the upgrade package of GB18030, but this upgrade package is just a new font for 6582 Chinese characters that support CJK expansion A: New Song -18030, does not change the internal code. The internal code of Windows is still GBK.
There are still some details here:
The original text of the GB2312 is still a location code, from the location code to the internal code, you need to add A0 on the high byte and low bytes. For any character encoding, the order in which the coding unit is specified by the coding scheme, and independent is independent of the endian. For example, the encoding unit of GBK is byte, indicating a Chinese character with two bytes. The order of these two bytes is fixed and is not affected by the CPU byte. The UTF-16 encoding unit is Word (Double byte), the order between WORD is the code specified by the coding scheme, and the byte arrangement within the Word will be affected by the Endian. UTF-16 will be described later. The highest bit of the two bytes of GB2312 is 1. However, only 128 * 128 = 16384 is met with this condition. So the highest bit of the low byte of GBK and GB18030 may not be 1. However, this does not affect the resolution of the DBCS character stream: When reading the DBCS character flow, as long as the byte of the high is 1, the next two bytes can be used as a double-byte encoding without tube low bytes. What is the high?
2, Unicode, UCS and UTF
The encoding method from ASCII, GB2312, GB2312, GB18030 is mentioned, is backward compatible. Unicode is only compatible with ASCII (more accurately, it is compatible with ISO-8859-1) and is incompatible with GB. For example, the Unicode encoding of the "Han" word is 6c49, and the GB code is Baba.
Unicode is also a character encoding method, but it is designed by an international organization that can accommodate all language texts in the world. Universal Multiple-Oct Coded Character Set is referred to as UCS. UCS can be seen as an abbreviation for "Unicode Character Set".
According to the Niki Encyclopedia (http://en.wikipedia.org/wiki/): There are two organizers that try to independently design Unicode, the International Standardization Organization (ISO) and a Software Manufacturer (Unicode). ORG). ISO has developed ISO 10646 project, UNICODE Association developed a Unicode project.
Before 1991, both parties realized that the world did not need two incompatible character sets. So they begin to combine both parties to work together to create a single coding table. Starting from Unicode 2.0, UNICODE projects use the same font library and word code as ISO 10646-1.
At present, two projects still exist and publish their own standards independently. The latest version of the Unicode Association now is 2005 Unicode 4.1.0. The latest standard of ISO is ISO 10646-3: 2003.
UCS just specifies how to encode and does not specify how to transfer, save this code. For example, the UCS coding of the "Han" word is 6C49, I can use 4 ASCII numbers to transfer this code; can also be encoded with UTF-8: 3 consecutive bytes E6 B1 89 to represent it. The key is that both communications should be recognized. UTF-8, UTF-7, UTF-16 are all widely accepted programs. A particular advantage of UTF-8 is that it is fully compatible with ISO-8859-1. UTF is an abbreviation for "UCS Transformation Format".
The IETF's RFC2781 and RFC3629 describe UTF-16 and UTF-8 encoding methods in the consistent style, clear and bright, and unrestrained and unrest. I always remember IETF is an abbreviation for Internet Engineering Task Force. However, the RFC of IETF is maintained is the basis for all specifications on the Internet.
2.1, internal code and code page
The current Windows kernel has supported the Unicode character set, so that all languages texts all over the world can be supported. However, since the existing large number of programs and documents use a particular language encoding, such as GBK, Windows cannot support existing encoding, and all use Unicode.
Windows uses code pages (code page) to accommodate all countries and regions. Code Page can be understood as the previously mentioned internal code. The Code Page corresponding to GBK is CP936.
Microsoft also defines CODE Page: CP54936 for GB18030. However, since the GB18030 has a portion of 4 bytes encoding, and Windows code page only supports single-byte and double-byte coding, this code page cannot be truly used.
3, UCS-2, UCS-4, BMP
UCS has two formats: UCS-2 and UCS-4. As the name suggests, UCS-2 is used in two byte encodings. UCS-4 is used in 4 bytes (actually only 31 bits, the highest bit must be 0) encodes. Let us do some simple math games:
UCS-2 has 2 ^ 16 = 65536 code bits, UCS-4 has 2 ^ 31 = 2147483648 code bits.
UCS-4 is divided into 2 ^ 7 = 128 groups according to the highest byte of the highest bit of 0. Each group is divided into 256 Plane according to the second high byte. Each Plane is divided into 256 lines according to the third byte, and each line contains 256 Cells. Of course, the CELLs in the same line is only different, and the rest is the same.
Plane 0 of Group 0 is called Basic Multilingual Plane, which is BMP. Or in UCS-4, the code bits of the high two bytes are called BMP.
UCS-2 is obtained by removing the BMP of the UCS-4 to the previous zero byte. With two zero bytes of UCS-2, the BMP of UCS-4 is obtained. There is no characters in the current UCS-4 specification outside of BMP.
4, UTF encoding
UTF-8 is encoded with 8 bits of units. The encoding method from UCS-2 to UTF-8 is as follows:
UCS-2 encoding (16 credits) UTF-8 byte stream (binary) 0000 - 007F0XXXXXXX0080 - 07FF110XXXXX 10XXXXXX0800 - FFFF1110XXXXXXXXXXX 10XXXXXX
For example, the Unicode encoding of the "Han" word is 6C49.6C49 between 0800-FFF, so it must use 3 bytes template: 1110xxxxxxxxxxx 10xxxxxx. The 6C49 is written into binary is: 0110 110001 001001, in order to use this bit stream instead of X in the template, obtained: 11100110 10110001 10001001, i.e., E6 B1 89. Readers can use Notepad to test whether our coding is correct. Note that UltraEdit is automatically converted to UTF-16 when opening UTF-8 encoded text files, which may be confused. You can turn off this option in the settings. A better tool is HEX Workshop.
UTF-16 encodes UCS in 16 bits. For UCS codes less than 0x10000, UTF-16 encoding is equal to the 16-bit unsigned integer corresponding to the UCS code. An algorithm is defined for a UCS code that is not less than 0x10000. However, due to the actual UCS2, or the BMP of UCS4 is inevitably less than 0x10000, it is considered that UTF-16 and UCS-2 are basically the same. However, UCS-2 is just a coding scheme, but UTF-16 is used for actual transmission, so it has to consider the problem of word sequence.
5, UTF byte sequence and BOM
UTF-8 is in-byte as encoding unit, no byte sequence. UTF-16 is encoded in two bytes, first understanding the byte sequence of each encoding unit before explaining a UTF-16 text. For example, "Kui" Unicode encoding is 594E, "B" Unicode encoding is 4E59. If we receive the UTF-16 byte stream "594e", then this is "Kui" or "B"?
The method recommended by the NNICODE specification is BOM. BOM is not a BOM table of "Bill of Material", but Byte Order Mark. Bom is a little smart idea:
There is a character called "Zero Width No-Break Space" in the UCS encoding, and its code is Feff. The fffe does not exist in the UCS, so it should not appear in the actual transmission. UCS specification recommends that we transfer characters "Zero Width No-Break Space before transporting byte streams."
Thus, if the recipient receives Feff, this byte stream is BIG-endian; if it receives FFFE, this byte stream is Little-Endian. Therefore, the character "Zero Width No-Break Space" is also called a BOM.
UTF-8 does not require BOM to indicate byte order, but you can use BOM to indicate encoding. The UTF-8 encoding of the characters "Zero Width No-Break Space" is EF BB BF (readers can verify the encoding method described earlier). Therefore, if the recipient receives the byte stream at the beginning of the EF BB BF, it is known that this is UTF-8 encoded.
Windows is the encoding method that uses BOM to mark text files.
6, further reference
The information main reference is "Short Overview of ISO-IEC 10646 and Unicode" (http://www.nada.kth.se/i18n/ucs/unicode-io10646-oview.html).
I also found two things that look good, but because I started to find the answer, so I didn't see:
"Understanding Unicode A general introduction to the Unicode Standard" (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter04a) "Character set encoding basics Understanding character set encodings and legacy encodings (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=iws-chapter03) I wrote UTF-8, UCS-2, GBK mutually converted package, including using Windows API And do not use the version of the Windows API. If you have time in the future, I will put it under my personal home page (http://fmddlmyy.home4u.china.com).
I want to know all the questions before I started writing this article, I thought I could write it for a while. I didn't expect to consider the wording and verification details for a long time, I was written from 1:30 pm to 9:00. I hope some readers can benefit from it.
Appendix 1 talk about location code, GB2312, internal code and code page
Some friends have questions about this sentence in the article: "The original text of GB2312 is still a location code, from the location code to the internal code, need to add A0 on high bytes and low bytes."
I explain it in detail:
"GB2312's original text" refers to a national standard of 1980 "National Standard Information Exchange for National Standard Information Exchange GB 2312-80". This standard uses two numbers to encode Chinese characters and Chinese symbols. The first number is called "zone", and the second number is called "bits". So it is also known as a location code. The 1-9 area is a Chinese symbol, and the 16-55 area is the first-class Chinese character, and the 56-87 area is a secondary Chinese character. Now Windows also has a location input method, such as input 1601 gets "ah". (This location input method can automatically identify the 16-en-en-en-en-en-encyclopedic code code, that is, the input B0A1 will also get "ah".)
The internal code refers to character encoding inside the operating system. The internal code of the early operating system is related to the language. The current Windows supports Unicode inside, and then adapts to various languages in the code page, and the concept of "internal code" is relatively blurred. Microsoft generally speaks the encoding specified by the default code page as an internal code.
Included this vocabulary, there is no official definition, the code page is just a Microsoft's company called. As a programmer, we only need to know what they are, there is no need to give too much attention to these nouns.
The so-called code page is character encoding for a language text. For example, GBK's Code Page is CP936, and BIG5 Code Page is CP950, and the Code Page of GB2312 is CP20936.
The concept of default code pages in Windows, that is, what coding is used to interpret characters. For example, Windows's Notepad opens a text file, the content inside is byte stream: BA, BA, D7, D6. How do you explain it?
Is it explained according to the Unicode coding, or explained in GBK, or explained by BIG5 or in accordance with ISO8859-1? If you press GBK to explain, you will get two words "Chinese characters". Follow other coded explanations, the corresponding characters may not be found, or the wrong characters may be found. The so-called "error" refers to the original meaning of the text author, there is a garbled.
The answer is that Windows explains the byte stream in the text file according to the current default code page. The default code page can be set by the area options for the control panel. Notepad has an ANSI, in fact, is saved in the coding method of the default code page.
The internal code of Windows is Unicode, which can support multiple code pages simultaneously. As long as the file can explain what code you use, the user has installed the corresponding code page, and Windows can display correctly, for example, in the HTML file, you can specify the charset. Some HTML files, especially English authors, think that everyone in the world uses English, and does not specify charset in the document. If he uses the character between 0x80-0xFF, Chinese Windows explains in accordance with the default GBK, there will be garbled. At this time, as long as the statement specified in this HTML file, for example: