From Microsoft and Unicom to have a hatred

xiaoxiao2021-03-06  32

There is a joke on the Internet, saying that Microsoft and Unicom have a hatred, and the content is roughly as follows:

If your computer operating system is Win2000 or WinXP, then right click on the desktop, select New-Text Document; 2. Open "New Text Document", enter "Mobile", and turn off 3. Re-open "New Text Document", what did you see? Is it "mobile" two words just entered? 4. Turn "Mobile" to "Telecom" and "Netcom", repeat 1--3, is it no problem? 5. Now we will try "Unicom" to try, repeat 1-3 steps, you will find that the words "Unicom" just entered, replaced by a charred mobile phone battery (a symbol). It seems that Microsoft really has a hatred with Unicom!

Joke is of course a joke, can't be serious. But why is this? Is Microsoft's bug? It is really a bit like, but - Microsoft is the world's top software company, and Notepad may be the simplest application in Windows, saying that this is a bit unreasonable?

Ok, since I have divers my subjective, let us embark on the hard history of looking for the truth :).

I don't know if you haven't used it, the opening of the Notepad, save the dialog than one encoding option than the ordinary file dialog box, which can be specified by the encoding of the file is Unicode, ANSI or UTF8. "Hey, I know", you may say, "This is definitely a disaster that Windows API IstextUnicode. Because the text file itself does not save the encoded information, the Notepad opens the file to call IsTextUnicode to determine the code. Code. IstextUnicode is based on the content of the text, so it is definitely it is a guessing code format. Think about the 'Unicom' only two words, such a mistake can be original, OK, the problem is solved. "

To be honest, I thought about it at the beginning, but later found, I made two mistakes. 1ITextUnicode has not guess, do not believe you can check the return value of IstextUnicode ("Unicom", 4, NULL. 2 Notepad may save the encoded information, then say it later.

It turned out that Notepad, in addition to judging the coding, it is necessary to determine whether it is UTF8. The code for "Unicom" is (byte order from low to high): C1 AA CD A8, converted to binary is: 11000001 10101010 11001101 10101000. Control UTF8 encoding scheme (for details, please see http://www.cis.ohio-state.edu/htbin/rfc/rfc2279.html): 0000-007F The character does not do the conversion 0080-07FF code to 110xxxxx The encoding of 10xxxxx0800-fff is 1110xxxxxxxxxxx 10xxxxxx is not difficult to find, "Unicom" encoding is in line with the second case, so Notepad is determined to be UTF8 encoding, and it will become 00000000000000 0110101010 00000000000000 01101000 . Note: The first two bytes are not between 0080--07FF after decoding, so it is considered to be the value of the wrong, ignored. After the two bytes after adjusting the byte sequence, the 0x0368 that will be changed, that is, the burned battery (depending on the font used). PS:

1. If you save the file, specify the use of encodings other than ANSI, the Notepad will save file encoding with a few bytes of the file, Unicode corresponds to 0xFeff, and Unicode Big Endian corresponds to 0xFFFE, UTF-8 corresponds to 0xBFBBEF. These numbers are called BOM (I don't know the abbreviation of the words). If the file has a BOM, the Notepad uses it directly to determine the encoding, otherwise it is determined according to the file content.

2. In the process of analysis, I use Ultra Edit to view the file's 16 credit content, but it will automatically encode the conversion and add a BOM to the file, causing the view and actual incompatibility (file 4 bytes, to UltraEdit It became 6 bytes, and I got some detours.

转载请注明原文地址:https://www.9cbs.com/read-39440.html

New Post(0)