C ++ string full guide - Win32 character encoding (1)

xiaoxiao2021-03-06  74

C string Complete Guide - Win32 character encoding (a) Author: Translation: Ripple Category: VC / VC.NET Date: 2003-1-6 14:35:46

C string full guide - Win32 character encoding (1) http://www.zdnet.com.cn/developer/tech/story/0,2000081602,39098124-1,00.HTM preface character string has different forms, Like Tchar, Std :: String, BSTR, etc., sometimes it will also see the mobs who use _TCs starting. The purpose of this guide is to illustrate various string types and their purposes, and explain how to transform each time when necessary. In the first part of the guide, three character encoding formats are introduced. It is important to understand the working principle of the encoding. Even if you already know that the string is a characterful array of characters, please read this article, which will let you understand the relationship between the various string classes. The second part of the guide will explain the various string classes, and which type of string class is used, and its mutual conversion will be used. String Basics - ASCII, DBCS, Unicode All string classes originate from the C language string, and the C language string is an array of characters. First understand the character type. There are three coding methods and three character types. The first coding method is a single-byte character set, called SBCS, and all of its characters have only one byte length. The ASCII code is SBCS. The SBCS string is ending a zero byte. The second coding method is a multi-byte character set, called MBCS, which contains single-word throttle characters in characters, and multi-word thrower characters. Windows uses only two character types, single-byte characters, and double-byte characters. Therefore, the most used characters in Windows are double-byte character sets, namely DBCs, usually used in place of MBCs. In DBCS coding, use some reserved values ​​to indicate that the character belongs to a double-byte character. For example, SHIFT-JIS (General Japanese) encoding, value 0x81-0x9f and 0xe0-0xfc mean: "This is a double-byte character, the next byte is part of this character." Such values ​​are often referred to as a lead BYTE, always greater than 0x7F. The front guide byte is followed by a trail byte. DBCS's follower byte can be any non-zero value. Like SBCs, the DBCS string is also ending with a zero-byte. The third coding method is Unicode. All characters in the Unicode coding standard are double-word. Sometimes Unicode is called a wide character set because its characters are wider than the single-byte character (using more memory). Note that Unicode is not MBCS - the difference between the character length in the MBCS encoding is different. Unicode string ends with two zero-byte characters (a zero value encoding of a wide character). The single-byte character set is a Latin alphabet, stress text, defined with ASCII standards for the DOS operating system. Double-byte character set is used in East Asia and the Middle East language. Unicode is used inside COM and Windows NT. The readers are familiar with the single-byte character set, and its data type is char. Dual-word character set also uses a char data type (one of the many weird places in the double-byte character set). Unicode character set Wchar_t data type. Unicode string with L prefix start, such as: wchar_t wch = l'1 '; // 2 bytes, 0x0031wchar_t * wsz = l "Hello"; // 12 bytes, 6 wide character strings The byte string sequentially stores each character and uses a zero byte to end the string. For example, the storage format of the string "bob" is: Unicode encoding, the storage format of the "Bob" is: end the string with 0x0000 (Unicode's zero code). DBCs looks a bit like SBCS. We will see subtle differences in the string processing and pointer use.

The storage format of the string "Nihongo" is as follows (with LB and TB representing the front-line byte and the leader, respectively): Note that the "Ni" value is not Word value 0xfa93. Value 93 and FA sequence combination encodes as characters "Ni". (In the high position priority CPU, the storage order is as described above). String Processing Function C Language String Processing Function, such as strcpy (), sprintf (), atol (), etc. can only be used for single-byte strings. In the standard library, there is only functions of the Unicode string, such as WCSCPY (), SWPRINTF (), _Wtol (). Microsoft joined support for DBCS strings in the C run library (CRT). Corresponding to the strxxx () function, DBCS uses the _mbsxxx () function. When processing DBCS strings (such as Japanese, Chinese, or other DBCS), use the _mbsxxx () function. These functions can also be used to process SBCS strings (because DBCS strings may contain only single-byte characters). Now use an example to illustrate the difference between the string processing function. If there is a Unicode string L "BOB": X86 CPU's arrangement order is a low-endian, the storage order of the value 0x0042 is 42 00. At this time, if you use the strlen () function to seek a string of strings. The function finds the first byte 42, then 00, means the string ends, so returns 1. Conversely, use the WCSLEN () function to ask "BOB" to be worse. WCSLEN () first locate 0x6F42, then 0x0062, will continue to find 00 00 00 00 00 00 00 00 00 00 00 00 0062, in the future. Strxxx () and its corresponding _MBSXXX () How do I work? The difference between the two is very important, directly affecting the way to correctly traverse DBCS strings. Let's first introduce string traversal and then come back to discuss strxxx () and _mbsxxxx (). Most of us are growing from SBCs, all accustomed to the and - operators of the pointer to traverse strings, sometimes using arrays to process characters in the string. These two methods are correct for SBCS and Unicode strings, because the characters of the two are equal, and the compiler can correctly return the character position of our sought. But it is not possible for the DBCS string. There are two principles to access the DBCS string with pointers, and it will cause errors to break these two principles. 1. Do not use the operator unless each time it is checked whether it is a front-end byte. 2. Never use the - operator to traverse it. First illustrate the principles 2, because it is easy to find an unmanned example. Suppose, there is a formulated file, and the program is read from the installation path when the program starts, such as: c: / program files / mycoolapp / config.bin. The file itself is normal. It is assumed that the file name is configured with the following code: BOOL getConfigFileName (Char * pszname, size_t nbuffsize) {char szconfigfilename [max_path]; // Here, read the file installation path from the registry, suppose everything is normal. // If there is no anti-alarm line at the end of the path, add a reverse slope.

// First, use the pointer to the end zero: char * plastchar = strchr (szconfilename, '/ 0'); // then retreat one character: plastchar--; if (* plastchar! = '//') strcat (SzconfigName , "//"); // Plus file name: strcat (szconfilename, "config.bin"); // If the string length is sufficient, return to the file name: IF (Strlen (SZConfileName> = nbuffsize) Return False; Else {strcpy (pszname, szconfigfilename); Return True;} This code is very protection, but it will be erroneous if the DBCS string is used. If the installation path of the file is expressed in Japanese: C: / ヨウヨウソ, the memory expression of the string is: At this time, use the above getConfigFileName () function to check if the file path is contained in the file, it will be wrong, get the wrong file name . Where is wrong? Pay attention to the two hexadecimal value 0x5c (blue) above. The front 0x5c is character "/", and the back is the character value 83 5c, which represents the character "ソ". However, the function mistakenly considers anti-laminated line. The correct way is to use the DBCS function to point the pointer to the appropriate character position, as shown below: BOOL FixedgetConfigFileName (char * pszname, size_t nbuffsize) {char szconfilename [max_path]; // Here, read the file installation path from the registry, suppose everything is normal. // If there is no anti-alarm line at the end of the path, add a reverse slope. // First, use the pointer to the end zero: char * plastchar = _mbschr (szconfilename, '/ 0'); // then retreat a double byte character: plastchar = charprev (szconfigfilename, plastchar); if (* plastchar! = '//') _mbscat (szconfigetename, "//"); // plus file name: _mbscat (szconfilename, "config.bin"); // If the string length is sufficient, return to the file name: if (_Mbslen (SzinstallDir )> = nbuffsize; else {_mbscpy (pszname, szconfigfilename); Return true;}} This improved function uses the charPrev () API function to move the pointer PlastCham backwards. If the characters at the end of the string are double-byte characters, move 2 bytes backwards. The result returned at this time is correct because the character is not misjudiced as a backslash. Now I can imagine the first principle. For example, to traverse the string look for characters ":", if you do not use the charNext () function, you will use the operator when you use the character value, just ":" is wrong. Related to principles 2 is the use of array subscripts: 2A. Never use a decrement subscript in a string array. The cause of the error is the same as the principle 2.

转载请注明原文地址:https://www.9cbs.com/read-90880.html

New Post(0)