C string is completely guided - Win32 character encoding original: michael dunn Translation: chengjie sun original source: CodeProject: The Complete Guide to C strings, Part I quotes undoubtedly, we have seen tchar, std :: String, BSTR and other string types, as well as the strange macro started with _TCs. You may be staring at the monitor. This guide summarizes the purpose of introducing various character types to showcase some simple usage and tell you how to achieve conversion between various string types when necessary. In the first part, we will introduce 3 character encoding types. It is important to learn about various coding patterns. Even if you already know a string is a character array, you should also read this section. Once you understand this, you will have a clear understanding of the relationship between the various string types. In the second part, we will tell the String class separately how to use it and achieve their conversion between them. Character Basics - ASCII, DBCS, all String classes for Unicode are based on the C-style string. The C-Style string is an array of characters. So let's introduce the character type first. There are three types of coding modes to correspond to three character types. The first encoding type is a single-Byte Character set or SBCS. In this encoding mode, all characters are represented by one byte. ASCII is SBCS. One byte represents 0 used to mark the end of the SBCS string. The second encoding mode is a multibyte character set (Multi-Byte Character set or MBCS). A MBCS encoding contains some character and other characters greater than one byte length. The MBCs in Windows contains two character types, single-byte characters and double-byte characters (double-byte character character). Most of the multi-character characters used in Windows are two bytes long, so MBCs are often replaced with DBCs. In DBCS encoding mode, some specific values are reserved to indicate that they are part of the double-byte character. For example, in Shift-JIS encoding (a common Japanese encoding mode), the value between 0x81-0x9f and the value of 0xE0-OXFC "is a double-byte character, the next sub-feet is part of this character. "This value is called" Leading Bytes ", they are greater than 0x7f. The byte followed behind a Leading Byte subode is called "trail byte". In DBCS, Trail Byte can be any non-0 value. Like SBCS, the end flag of the DBCS string is also 0 represented by a single byte. The third encoding mode is Unicode. Unicode is a single character using two byte encoded encoding modes. The Unicode characters are sometimes referred to as a wide character because it is wide than the single sub-character (more storage space). Note that Unicode cannot be seen as MBCS. The uniqueness of MBCs is that its characters use different length byte encodings. Unicode string uses two bytes represent 0 as its end flag. Single-byte characters contain a graphic character defined by the Latin alphabet, Accented Characters and ASCII standards and DOS operating systems. Double byte characters are used to represent the language of East Asia and the Middle East. Unicode is used inside the COM and Windows NT operating system. You must be very familiar with the single-byte character. When you use char, you handle a single-byte character.
Double byte characters also use the char type (this is one of the many strange places we will see about the twin character). Unicode characters are represented by Wchar_T. Unicode characters and string constants are represented by prefix L. For example: wchar_t wch = l''1'; // 2 Bytes, 0x0031wchar_t * wsz = l "Hello"; // 12 Bytes, 6 Wide Characters characters in memory single byte string: Each Characters are sequentially stored in one byte, and finally ends with 0 represented by single byte. E.g. The storage form of "Bob" is as follows:
426F6200Bobbos Unicode's storage form, l "bob"
42 00 6F 0062 0000 00bobBo uses two bytes represent 0 to end the flag. At a glance, the DBCS string is very similar to the SBCS string, but we will see the subtleties of the DBCS string, which makes the string operation function and the forever character pointer to generate a string that is expected outside the result. . String "" "(" Nihongo ") The storage form in memory is as follows (LB and TBs used to represent Leading Byte and Trail Byte)
93 FA96 7B8C EA00LB TBLB TBLB Tbeoseos is worth noting that the value of "ni" cannot be interpreted as a Word type value 0xFA93, and should be considered as a "NI" encoding as two values 93 and FA. We have seen string functions in C language using string processing functions, struct (), sprintf (), atoll (), etc. These strings should only be used to handle single-byte character strings. Standard libraries also provide functions that apply to UNICODE type strings, such as WCSCPY (), SWPrintf (), Wtol (), and so on. Microsoft also adds a version of the DBCS string in its CRT (C Runtime Library). The str *** () function has a DBCS version of the corresponding name_MBS *** (). If you expect a DBCS string (if your software will be installed in countries with DBCS), such as China, Japan, etc., you may use _MBS *** () function, Because they can also handle SBCS strings. (A DBCS string may also contain single-byte characters, which is why _mbs *** () function can also handle the reason for the SBCS string) Let us look at a typical string to clarify why different versions of characters need String handler. We still use the previous Unicode string l "bob":
42 00 6F 0062 0000 00bobbos Because x86cpu is Little-Endian, the storage form of the value 0x0042 in memory is 42 00. Can you see what problem occurs if this string is passed to the strlen () function? It will first see the first byte 42, then 00, and the 00 is the flag ending, so strlen () will return 1. If "BOB" is passed to WCSLEN (), a worse result will be obtained. Wcslen () will first see 0x6f42, then 0x0062, then read to the end of your buffer until the 00 00 end flag is found or causing GPF. So far, we have discussed the use of str *** () and wcs *** () and the differences between them. What is the difference between Str *** () and _MBS ** ()? It is important to understand the difference between them, is important for traversing the DBCS string in the right way. Below, we will introduce the traversal of the string, then return to the difference between Str *** () and _MBS *** (). Correct traversal and index string Because most of us grow up with SBCS strings, we often use pointers - and - operation when traverse strings. We also use the representations of the array to operate the characters in the string. Both ways are for SBCS and Unicode strings because the characters in them have the same width, and the compiler can correctly returns the characters we need. However, when you touch the DBCS string, we must abandon these habits. There are two rules that use the pointer to traverse the DBCS string. Violation of these two rules, your program will have DBCS related BUGS. 1. Do not use operations in front of time, unless you check the Lead Byte each time;
2. Never use - the operation is traversed.
Let's explain the rules 2 because it is easy to find a real instance code that violates its real. Suppose you have a program saved a setup file in your own directory, and you save the installation directory in the registry. When running, you read the installation directory from the registry, then synthesize the configuration file name, then read the file. Suppose your installation directory is C: / program files / mycoolapp, then your synthesis file name should be C: / Program files / mycoolapp / config.bin. When you test, you find that the program is running normally. Now, what you imagine your synthesis file name may be like this: Bool getConfigFileName (char * pszname, size_t nbuffsize)
{
Char szconfigfilename [max_path];
// read Install Dir from Registry ... we'll Assume it succeeds.
// add on a backslash if it isn' '' '...
// first, Get a Pointer to the Terminating Zero.
Char * plastchar = strchr (szconfigfilename, '' / 0 ');
// NOW Move It Back One Character.
Plastchar -
IF (* plastchar! = '' // ')
STRCAT (SZConfigfilename, "//");
// Add on the name of the config file.strcat (szconfigfilename, "config.bin");
// if the caller ''s buffer is big enough, return the filename.
IF (Strlen (SzconfigFileName> = nbuffsize)
Return False;
Else
{
Strcpy (pszname, szconfigfilename);
Return True;
}
} This is a very robust code, but it will be wrong when you encounter DBCS characters. Let's take a look at why. Suppose a Japanese user uses your program and install it in C: /. Here is the storage form of this name in memory:
433A5C83 8883 4583 5283 5C00 LB TB LB TB LB TB LB TB C: / EOS When using getConfigEname () check the '' // '' of the tail, it looks for the last non-0 byte in the installation directory name, see it is equal to '' // '', so there is no re-add one '' // ''. The result is that the code returns the wrong file name. Where did I get something wrong? Take a look above two bytes with blue high display. The value of the slash '' // '' is 0x5c. The value of '' '' is 83 5c. The above code is incorrectly reads a trail Byte, which is a character. The correct backward traversal method is to use a function that identifies the DBCS character, so that the correct one number is moved. The following is the correct code. (Moved by the pointer) Bool FixedgetConfigFileName (char * pszname, size_t nbuffsize)
{
Char szconfigfilename [max_path];
// read Install Dir from Registry ... we'll Assume it succeeds.
// add on a backslash if it isn' '' '...
// first, Get a Pointer to the Terminating Zero.
Char * plastchar = _mbschr (szconfigfilename, '' / 0 ');
// Now Move It Back ONE DOUBLE-BYTE CHARACTER.
Plastchar = charprev (szconfigename, plastchar);
IF (* plastchar! = '' // ')
_mbscat (szconfigfilename, "//");
// Add on the name of the config file.
_MBSCAT (SzconfigfileName, "config.bin");
// if the caller ''s buffer is big enough, return the filename.
IF (_mbslen (szinstalldir)> = nbuffsize)
Return False;
Else
{
_mbscpy (pszname, szconfigfilename);
Return True;
}
}
The above function uses the charprev () API to move the PlastChar to the backward, this character may be two bytes. In this version, if the IF condition works, because Lead Byte will never equal 0x5c. Let us imagine a violation of rules 1. For example, you might want to detect if the file name entered by the user has appeared more '': ''. If you use operation to traverse the string, not using charnext (), you may issue an incorrect error warning If it happens to have a TRAIL BYTE its value equal to '' '' 'value. Rules for string indexes with rules 2: 2A. Never use subtraction to get a string index. The code that violates this rule and the code violates rule 2 are very similar. For example, char * plastchar = & szconfigfilename [Strlen (SzconfileName) - 1]; this is the same as that moves a pointer is the same. Back to Str *** () and _MBS *** () Now we should be very clear why _mbs *** () function is required. The str *** () function does not consider DBCS characters at all, and _MBS *** () is considered. If you call Strrchr ("c: //", '' // '), the return result may be wrong, however _MBSRCHR () will recognize the last double-byte character, return a point to true' '//' 'pointer. The last point of the string function: Str *** () and _MBS *** () function Think the length of the string is calculated in char. So, if a string contains 3 double-byte characters, _mbslen () will return 6. The length of the Unicode function returns is calculated according to Wchar_T. For example, WCSLEN (L "BOB") returns 3. MBCS and Unicode in Win32 API APIS: Although you may never pay attention, there are two versions of each of Win32 associated with string related API and Message. A version accepts the MBCS string and the other accepts a Unicode string. For example, there is no setWindowText () API at all, in contrast, SetWindowTexta () and setWindowTextw (). The suffix A indicates that this is a MBCS function, the suffix W represents the function of the Unicode version. When you build a Windows program, you can choose to use MBCS or Unicode APIs. If you have used the VC wizard and did not change the pre-processed settings, that indicates that you are using the MBCS version. So, since there is no setWindowText () API, why can we use it? Winuser.h header files contain some macros, such as Bool WinAPI SetWindowTexta (HWND HWND, LPCSTR LPSTRING);
Bool WinAPI SetWindowTextw (HWND HWND, LPCWSTR LPSTRING);
#ifdef unicode
#define setWindowText setWindowTextw
#ELSE
#define setWindowText setWindowTexta
#ENDIF When using the MBCS APIS to build programs, Unicode is not defined, so the preprocessor see: #define setWindowText SetWindowTexta This macro definition converts all the calls to SetWindowText into a true API function setWindowTexta. (Of course, you can call setWindowTexta () or setWindowTextw () or setWindowTextW (), although you don't have to do it.), If you want to turn the default API function into Unicode version, you can set up in the pre-processor settings, MBCs are removed from a predefined macro list, then add Unicode and _unicode. (You need two definitions, because different headers may use different macros.) However, if you use char to define your string, you will fall into a embarrassment. Consider the following code: hWnd hwnd = getSomewindowHandle (); char sznewtext [] = "We love bob!";
SetWindowText (hwnd, sznewtext); after the pre-regulator uses setWindowText to replace it, the code becomes: hwnd hwnd = getSomewindowHandle ();
Char sznewtext [] = "We love bob!";
SETWINDOWTEXTW (hwnd, sznewtext); have you seen a question? We pass the single-byte string to a function of the Unicode string to do parameters. The first solution for solving this problem is to use #ifdef to include the definition of string variables: hwnd hwnd = getSomewindowHandle ();
#ifdef unicode
Wchar_t sznewtext [] = l "We love bob!";
#ELSE
Char sznewtext [] = "We love bob!";
#ENDIF
SetwindowText (hwnd, sznewtext); you may already feel that this will make you a headache. The perfect solution is to use Tchar. Using Tchar Tchar is a string type, which allows you to use the same code when you come to build programs with MBCS and Unnicode, do not need to use cumbersome macro definitions to include your code. The definition of tchar is as follows: #ifdef unicode
Typedef wchar_t tchar;
#ELSE
Typedef char tchar;
#ENDIF So when using MBCS to Build, Tchar is char, using Unicode, tchar is Wchar_t. There is also a macro to process the L prefix you need to define the constant of the Unicode string. #ifdef unicode
#define _t (x) l ## x
#ELSE
#define_t (x) x
#ENDIF ## is a pre-processing operator that can connect two parameters together. If you need a string constant in your code, add the _t macro. If you use Unicode to Build, it will add a L prefix before the string constant. Tchar sznewtext [] = _T ("We love bob!"); Like a macro hiding the details of SETWINDOTEXTA / W, there are many macros that can be used to implement str *** () and _MBS ** * () And other string functions. For example, you can use the _TCSRCHR macro to replace Strrchr (), _ mbsrchr () and wcsrchr (). _TCSRCHR is based on your predefined macro to expand into the correct function, just like SETWINDOWTEXT. Not only Str *** () functions have TCHAR macros. Other functions such as _stprintf (instead of sprinft () and swprintf ()), _ TFOpen (instead of fopen () and _wfopen ()). "Generic-Text Routine Mappings in MSDN has a complete macro list. Strings and Tchar Typedefs Due to the function list of the Win32 API document uses a common name (for example, "SETWINDEXTEXT"), all strings are defined with TCHAR. (In addition to the API introduced in XP only applicable to Unicode). Some commonly used typedefs are listed below, you can see them in MSDN. type Meaning in MBCS builds Meaning in Unicode buildsWCHARwchar_twchar_tLPSTR zero-terminated string of char (char *) zero-terminated string of char (char *) LPCSTR constant zero-terminated string of char (const char *) constant zero-terminated string of char ( const char *) LPWSTRzero-terminated Unicode string (wchar_t *) zero-terminated Unicode string (wchar_t *) LPCWSTRconstant zero-terminated Unicode string (const wchar_t *) constant zero-terminated Unicode string (const wchar_t *)
TCHAR
charr
Wchar_t
LPTSTRzero-terminated string of TCHAR (TCHAR *) zero-terminated string of TCHAR (TCHAR *) LPCTSTR constant zero-terminated string of TCHAR (const TCHAR *) constant zero-terminated string of TCHAR (const TCHAR *) and when to use TCHAR Unicode is now, you may ask, why we want to use Unicode. I have used a lot of char. In the following three cases, using Unicode will benefit you:
1. Your program is only running in a Windows NT system.
2. Your program needs to handle file names that exceed MAX_PATH.
3. Your program needs to use only Unicode versions of APIs introduced in XP.