C ++ a little experience in various strings ---- Character set and encoding

xiaoxiao2021-03-06  100

Foreword You have seen a variety of string types on Tchar, Std :: String, BSTR, and those who started to start _TCs. There are also, for example, ASCII, DBCS, Unicode encoding. Here first explains the basic issue of characters. All String classes are based on a C-style string. In the operating system, the character score code page Follow the code mapping indicates symbols of different languages, and different languages ​​may use different code pages. For example, ansi code page 1252 is used in English and most European languages, while ANSI code pages 932 are used in Japanese Chinese characters. In fact, all code pages share the lowest 128 characters in the ASCII character set (0x00 to 0x7f). There are three coding methods, ASCII, DBCS, and Unicode. The ASCII character set is only 256 characters, represented by numbers between 0-255. Including case of cases, numbers, and minority characters; such as punctuation, currency symbols, etc. These characters have been sufficient for most Latin languages. However, many Asian and oriental languages ​​are far more than 256 characters. Some more than 1,000. In order to break through the limit of the number of ASCII codes, people try to write computer programs for more than 256 characters. So Unicode came into being. Unicode represents a character in a larger range by using a double byte to map the digital code to the character set of multiple languages. The first encoding type is a single-Byte Character set or SBCS. In this encoding mode, all characters are represented by one byte. ASCII is SBCS. One byte represents 0 used to mark the end of the SBCS string. The second encoding mode is a multibyte character set (Multi-Byte Character set or MBCS). A MBCS encoding contains some character and other characters greater than one byte length. The MBCs in Windows contains two character types, single-byte characters and double-byte characters (double-byte character character). Most of the multi-character characters used in Windows are two bytes long, so MBCs are often replaced with DBCs. In DBCS encoding mode, some specific values ​​are reserved to indicate that they are part of the double-byte character. For example, in Shift-JIS encoding (a common Japanese encoding mode), the value between 0x81-0x9f and the value of 0xE0-OXFC "is a double-byte character, the next sub-feet is part of this character. "This value is called" Leading Bytes ", they are greater than 0x7f. The byte followed behind a Leading Byte subode is called "trail byte". In DBCS, Trail Byte can be any non-0 value. Like SBCS, the end flag of the DBCS string is also 0 represented by a single byte. The third encoding mode is Unicode. Unicode is a single character using two byte encoded encoding modes. The Unicode characters are sometimes referred to as a wide character because it is wide than the single sub-character (more storage space). Note that Unicode cannot be seen as MBCS. The uniqueness of MBCs is that its characters use different length byte encodings. Unicode string uses two bytes represent 0 as its end flag. In Visual C , MBCS always refers to DBCS. Do not support the character set than the two bytes wide. How to choose these three coding methods.

转载请注明原文地址:https://www.9cbs.com/read-125014.html

New Post(0)