Chinese character encoding and related issues (Unicode, ANSI, GB2312)

xiaoxiao2021-03-06  27

The characters in the computer are usually not saved as images, each character is expressed by using a coding, and which coding representative is used for each character, depending on which character set is used (Charset).

In the initial time, there is only one character set on the Internet, which uses 7 bits to represent a character, which represents 128 characters, including common characters such as English letters, numbers, punctuation. After that, it is expanded, and the 8 bits represents a character, and 256 characters can be represented, and some special symbols such as tabs are added to the original 7 bits character set.

Later, due to the addition of language language, ASCII does not meet the needs of information exchange, so in order to represent the text of other countries, countries have developed their own character sets based on ASCII, which is habitually obtained from the ANSI standard. The collective is an ANSI character set, and their official name should be MBCS (Multi-Byte Chactacter System, That is, multi-byte character system). These derived character sets are characterized by ASCII 127 Bits, compatible with ASCII 127. They use larger than 128 encoding as a Leading Byte, followed by the second (or even third) characters behind Leading Byte as the Leading Byte. Actual code. There are a lot of such character sets, and our common GB-2312 is one of them.

For example, in the GB-2312 character set, "connect" encoded is C1 AC CD A8, where C1 and CD are Leading Byte. The first 127 encoded as standard ASCII retained, for example, "0" encoding is 30 h (30h represents hexadecimal 30). When the software is read, if you see 30h, you know that it is less than 128 is a standard ASCII, indicating "0", seeing C1 greater than 128, knowing it will have another encoding, so C1 AC constitutes a whole code, in The GB-2312 character set indicates "even".

Since each language has established its own character set, it is very inconvenient to convert the character set frequently in international exchanges. Therefore, the Unicode character set is proposed, which is fixed using 16 bits (two bytes, one words) to represent a character, and can represent 65536 characters. A common character of almost all languages ​​in the world is convenient for information exchange. The standard Unicode is called UTF-16. Later, for the double-byte Unicode can be properly transmitted on the existing handler system, UTF-8 appears, and Unicode is encoded using similar MBCS. Note that UTF-8 is encoded and it belongs to the Unicode character set. The Unicode character set has a variety of coded forms, while ASCII has only one, most MBCS (including GB-2312) is only one.

For example, Unicode standard encoding UTF-16 (BIG Endian) is: DE 8F 1A 90

And its UTF-8 encoding is: E8 BF 9E E9 80 9A

Finally, when a software opens a text, the first thing it wants is to determine which code saved which code is used. Software has three ways to determine the character set and encoding of the text:

The most standard way is a few bytes that detect the top of the text, as follows:

Opening byte Charset / Encodingef BB BF UTF-8FE FF UTF-16 / UCS-2, Little Endianff Fe UTF-16 / UCS-2, BIG Endianff Fe 00 00 UTF-32 / UCS-4, Little Endian.00 00 FE FF UTF-32 / UCS-4, BIG-Endian. After inserted, "two words UTF-16 (BIG Endian) and UTF-8 code are: FF Fe DE 8F 1A 90ef BB BF E8 BF 9E E9 80 9A But the MBCS text does not have these character set tags that are at the beginning, and more unfortunately, some early and some design poor software do not insert these character set tags in the beginning when saving Unicode text. Therefore, software cannot rely on This approach. At this time, the software can take a relatively safe way to determine the character set and its encoding, that is, pop up a dialog box to request a user, such as dragging that "connect" file to MS Word, Word will A dialog box is popped. If the software doesn't want to trouble the user, or it is not convenient to ask the user, it can only take the "guess" method, the software can speculate it according to the characteristics of the entire text, which one may belong to which one may belong to, which is likely to be Not allowed. Use Notepad to open the "connect" file is this. We can prove this: After the "connect" in Notepad, select "Save AS", you will see the last drop-down box. There is "ANSI", save it. When the "connect" file is turned on, click "File" -> "Save As", it will see "UTF-8" in the last drop-down box. Note Notepad thinks the text currently open is a UTF-8 encoded text. And we have saved it with ansi character set when we have saved. This description, the Notepad guess the character set of "connecting" files, think it is more like A UTF-8 encoding text. This is because the GB-2312 encoding of "connecting" two words seems to be more like UTF-8 encoding, this is a coincidence, not all text, you can use Notepad's Open Features When opening the "connect" file, select ANSI in the last drop-down box, it will be displayed normally. In turn, if it is saved as UTF-8 encoding, it will do not have problems directly. If will "connect "Document is placed in MS Word, Word will also think it is a UTF-8 encoded file, but it cannot be determined, so a dialog will pop up with the user, then select" Simplified Chinese (GB2312), you can normal Open. Notepad is at this point It is more simplified, which is consistent with the positioning of this program. We will thank the seniors to give us the explanation, let us have a relatively clear understanding of this phenomenon.

转载请注明原文地址:https://www.9cbs.com/read-43610.html

New Post(0)