Notepad cannot identify the reason for "Unicom"

xiaoxiao2021-03-06 123

Notepad's encoding problem, when all characters in the document are in the range of C0 ≤ B ≤ DF 80 ≤ BB ≤ bf, NotePad cannot confirm the format of the document, and do not automatically follow the UTF-8 format "Display". "Unicom" is C1 AA CD A8, just within the above range, so it does not display it normally.

However, due to not detailed, the author has specially consulted the high-winning engineer of the release technology network, and the high-end workers have brought us more detailed explanation:

The characters in the computer are usually not saved as images, each character is expressed by using a coding, and which coding representative is used for each character, depending on which character set is used (Charset).

In the initial time, there is only one character set on the Internet, which uses 7 bits to represent a character, which represents 128 characters, including common characters such as English letters, numbers, punctuation. After that, it is expanded, and the 8 bits represents a character, and 256 characters can be represented, and some special symbols such as tabs are added to the original 7 bits character set.

Later, due to the addition of language language, ASCII does not meet the needs of information exchange, so in order to represent the text of other countries, countries have developed their own character sets based on ASCII, which is habitually obtained from the ANSI standard. The collective is an ANSI character set, and their official name should be MBCS (Multi-Byte Chactacter System, That is, multi-byte character system). These derived character sets are characterized by ASCII 127 Bits, compatible with ASCII 127. They use larger than 128 encoding as a Leading Byte, followed by the second (or even third) characters behind Leading Byte as the Leading Byte. Actual code. There are a lot of such character sets, and our common GB-2312 is one of them.

For example, in the GB-2312 character set, "connect" encoded is C1 AC CD A8, where C1 and CD are Leading Byte. The first 127 encoded as standard ASCII retained, for example, "0" encoding is 30 h (30h represents hexadecimal 30). When the software is read, if you see 30h, you know that it is less than 128 is a standard ASCII, indicating "0", seeing C1 greater than 128, knowing it will have another encoding, so C1 AC constitutes a whole code, in The GB-2312 character set indicates "even".

Since each language has established its own character set, it is very inconvenient to convert the character set frequently in international exchanges. Therefore, the Unicode character set is proposed, which is fixed using 16 bits (two bytes, one words) to represent a character, and can represent 65536 characters. A common character of almost all languages in the world is convenient for information exchange. The standard Unicode is called UTF-16. Later, for the double-byte Unicode can be properly transmitted on the existing handler system, UTF-8 appears, and Unicode is encoded using similar MBCS. Note that UTF-8 is encoded and it belongs to the Unicode character set. The Unicode character set has a variety of coded forms, while ASCII has only one, most MBCS (including GB-2312) is only one.

For example, Unicode standard encoding UTF-16 (BIG Endian) is: DE 8F 1A 90

And its UTF-8 encoding is: E8 BF 9E E9 80 9A Finally, when a software opens a text, what is the first thing it does is to determine which code saved which type of text is used. Software has three ways to determine the character set and encoding of the text:

The most standard way is a few bytes that detect the top of the text, as follows:

[Img] http://pic.enorth.com.cn/01/11/1117082_719703.jpg [/ img] After the insert mark, connect "two words UTF-16 (BIG Endian) and UTF -8 yards are:

FF Fe DE 8F 1A 90

EF BB BF E8 BF 9E E9 80 9A

But MBCS text does not have these character set tags that are on the beginning, and more unfortunately, some early and some design poor software do not insert these character set tags on the beginning when saving Unicode text. Therefore, software cannot rely on this way. At this time, the software can take a relatively safe way to determine the character set and its encoding, that is, a dialog box is popped up, for example, dragging the "connect" file to MS Word, Word will pop up a dialog .

If the software does not want to trouble users, or it is inconvenient to ask the user, it can only take the "guess" method, the software can speculate on the characteristics of the entire text, which may belong to which one may belong to, which is likely not to be. Use Notepad to open the "connect" file is this.

We can prove this: After the "connect" in Notepad, select "Save As", you will see "ANSI" in the last drop-down box, save. When there is a garbled when the "connect" file is opened, click "File" -> "Save As", you will see "UTF-8" in the last drop-down box, this shows that Notepad thinks that the text is currently opened. Is a UTF-8 encoded text. And we have been saved with an ANSI character when we have saved. This shows that Notepad guess the character set of "connect" files, think it is more like a UTF-8 encoding text. This is because "connecting" both words GB-2312 encoding seems to be more like UTF-8 encoding, this is a coincidence, not all words. You can use Notepad's open function, select ANSI in the last drop-down box when you open the "Connect" file, you can display it normally. Conversely, if it is saved as a UTF-8 encoding, it will not appear directly when saved.

If you put the "connect" file in MS Word, Word will also think it is a UTF-8 encoded file, but it cannot be determined, so a dialog box will pop up with the user, then select "Simplified Chinese (GB2312)" It will open it normally. Notepad is more simplified at this point, which is consistent with the positioning of this program.

We will thank the seniors to give us the explanation, let us have a relatively clear understanding of this phenomenon.

转载请注明原文地址:https://www.9cbs.com/read-97250.html

9cbs

New Post(0)