C ++ string is completely guided - Win32 character encoding

zhaozj2021-02-16  92

Original: Michael Dunn

Translation: chengjie sun

Original source: CodeProject: The Complete Guide to C strings, Part I

introduction

There is no doubt that we have seen a variety of string types like Tchar, Std :: String, BSTR, and those who start with _TCS. You may be staring at the monitor. This guide summarizes the purpose of introducing various character types to showcase some simple usage and tell you how to achieve conversion between various string types when necessary.

In the first part, we will introduce 3 character encoding types. It is important to learn about various coding patterns. Even if you already know a string is a character array, you should also read this section. Once you understand this, you will have a clear understanding of the relationship between the various string types.

In the second part, we will tell the String class separately how to use it and achieve their conversion between them.

Character foundation - ASCII, DBCS, Unicode

All String classes are based on a C-style string. The C-Style string is an array of characters. So let's introduce the character type first. There are three types of coding modes to correspond to three character types. The first encoding type is a single-Byte Character set or SBCS. In this encoding mode, all characters are represented by one byte. ASCII is SBCS. One byte represents 0 used to mark the end of the SBCS string.

The second encoding mode is a multibyte character set (Multi-Byte Character set or MBCS). A MBCS encoding contains some character and other characters greater than one byte length. The MBCs in Windows contains two character types, single-byte characters and double-byte characters (double-byte character character). Most of the multi-character characters used in Windows are two bytes long, so MBCs are often replaced with DBCs.

In DBCS encoding mode, some specific values ​​are reserved to indicate that they are part of the double-byte character. For example, in Shift-JIS encoding (a common Japanese encoding mode), the value between 0x81-0x9f and the value of 0xE0-OXFC "is a double-byte character, the next sub-feet is part of this character. "This value is called" Leading Bytes ", they are greater than 0x7f. The byte followed behind a Leading Byte subode is called "trail byte". In DBCS, Trail Byte can be any non-0 value. Like SBCS, the end flag of the DBCS string is also 0 represented by a single byte.

The third encoding mode is Unicode. Unicode is a single character using two byte encoded encoding modes. The Unicode characters are sometimes referred to as a wide character because it is wide than the single sub-character (more storage space). Note that Unicode cannot be seen as MBCS. The uniqueness of MBCs is that its characters use different length byte encodings. Unicode string uses two bytes represent 0 as its end flag.

Single-byte characters contain a graphic character defined by the Latin alphabet, Accented Characters and ASCII standards and DOS operating systems. Double byte characters are used to represent the language of East Asia and the Middle East. Unicode is used inside the COM and Windows NT operating system.

You must be very familiar with the single-byte character. When you use char, you handle a single-byte character. Double byte characters also use the char type (this is one of the many strange places we will see about the twin character). Unicode characters are represented by Wchar_T. Unicode characters and string constants are represented by prefix L. For example: wchar_t wch = l''1' '; // 2 Bytes, 0x0031

Wchar_t * wsz = l "Hello"; // 12 bytes, 6 Wide Characters

How is the character stored in memory?

Single-byte string: Each character is stored in sequence in order, and finally ends in single byte. E.g. The storage form of "Bob" is as follows:

42

6F

62

00

B

o

b

BOS

Unicode's storage form, l "bob"

42 00

6F 00

62 00

00 00

B

o

b

BOS

Use two bytes to represent 0 to do the end flag.

At a glance, the DBCS string is very similar to the SBCS string, but we will see the subtleties of the DBCS string, which makes the string operation function and the forever character pointer to generate a string that is expected outside the result. . String "" "(" Nihongo ") The storage form in memory is as follows (LB and TBs used to represent Leading Byte and Trail Byte)

93 fa

96 7b

8C EA

00

LB TB

LB TB

LB TB

EOS

EOS

It is worth noting that the value of "Ni" cannot be interpreted as a Word type value 0xFA93, and should be considered as two values ​​93 and FA in this order as "Ni" encoding.

Use a string handler

We have seen string functions in C language, struct (), sprintf (), atoll (), etc. These strings should only be used to handle single-byte character strings. Standard libraries also provide functions that apply to UNICODE type strings, such as WCSCPY (), SWPrintf (), Wtol (), and so on.

Microsoft also adds a version of the DBCS string in its CRT (C Runtime Library). The str *** () function has a DBCS version of the corresponding name_MBS *** (). If you expect a DBCS string (if your software will be installed in countries with DBCS), such as China, Japan, etc., you may use _MBS *** () function, Because they can also handle SBCS strings. (A DBCS string may also contain single-byte characters, which is why _mbs *** () functions can also handle the SBCS string.)

Let us look at a typical string to clarify why different versions of the string handler. We still use the previous Unicode string l "bob":

42 00

6F 00

62 00

00 00

B

o

b

BOS

Because x86cpu is Little-Endian, the storage form of the value 0x0042 in memory is 42 00. Can you see what problem occurs if this string is passed to the strlen () function? It will first see the first byte 42, then 00, and the 00 is the flag ending, so strlen () will return 1. If "BOB" is passed to WCSLEN (), a worse result will be obtained. Wcslen () will first see 0x6f42, then 0x0062, then read to the end of your buffer until the 00 00 end flag is found or causing GPF.

So far, we have discussed the use of str *** () and wcs *** () and the differences between them. What is the difference between Str *** () and _MBS ** ()? It is important to understand the difference between them, is important for traversing the DBCS string in the right way. Below, we will introduce the traversal of the string, then return to the difference between Str *** () and _MBS *** (). Correct traversal and index string

Because most people in us are growing with SBCS strings, we often use the pointer - and - operation when traverse strings. We also use the representations of the array to operate the characters in the string. Both ways are for SBCS and Unicode strings because the characters in them have the same width, and the compiler can correctly returns the characters we need.

However, when you touch the DBCS string, we must abandon these habits. There are two rules that use the pointer to traverse the DBCS string. Violation of these two rules, your program will have DBCS related BUGS.

1. Do not use operations in front of time, unless you check the Lead Byte; 2. Never use - the operation is traversed.

Let's explain the rules 2 because it is easy to find a real instance code that violates its real. Suppose you have a program saved a setup file in your own directory, and you save the installation directory in the registry. When running, you read the installation directory from the registry, then synthesize the configuration file name, then read the file. Suppose your installation directory is C: / program files / mycoolapp, then your synthesis file name should be C: / Program files / mycoolapp / config.bin. When you test, you find that the program is running normally.

Now, the code that you want to synthesize the file name may be like this:

Bool getconfigfilename (char * pszname, size_t nbuffsize)

{

Char szconfigfilename [max_path];

// read Install Dir from Registry ... we'll Assume it succeeds.

// add on a backslash if it isn' '' '...

// first, Get a Pointer to the Terminating Zero.

Char * plastchar = strchr (szconfigfilename, '' / 0 ');

// NOW Move It Back One Character.

Plastchar -

IF (* plastchar! = '' strcat (szconfigename, "//");

// Add on the name of the config file.

STRCAT (SZConfigfilename, "config.bin");

// if the caller ''s buffer is big enough, return the filename.

IF (Strlen (SzconfigFileName> = nbuffsize)

Return False;

Else

{

Strcpy (pszname, szconfigfilename);

Return True;

}

}

This is a very robust code, but it will be wrong when you encounter DBCS characters. Let's take a look at why. Suppose a Japanese user uses your program and install it in C: /. Here is the storage form of this name in memory:

43

3A

5C

83 88

83 45

83 52

83 5C

00

LB TB

LB TB

LB TB

LB TB

C

:

/

EOS

When using getConfigFileName () checks '' // '', it looks for the last non-0 byte in the installation directory name, see it is equal to '' // '', so there is no re-add one '' // ''. The result is that the code returns the wrong file name.

Where did I get something wrong? Take a look above two bytes with blue high display. The value of the slash '' // '' is 0x5c. The value of '' '' is 83 5c. The above code is incorrectly reads a trail Byte, which is a character.

The correct backward traversal method is to use a function that identifies the DBCS character, so that the correct one number is moved. The following is the correct code. (The pointer moves with red indicated)

Bool FixedgetconfigfileName (Char * pszname, size_t nbuffsize)

{

Char szconfigfilename [max_path];

// read Install Dir from Registry ... we'll Assume it succeeds.

// add on a backslash if it isn' '' '...

// first, Get a Pointer to the Terminating Zero.

Char * plastchar = _mbschr (szconfigfilename, '' / 0 ');

// Now Move It Back ONE DOUBLE-BYTE CHARACTER.

Plastchar = charprev (szconfigename, plastchar);

IF (* plastchar! = '_mbscat (szconfigfilename, "//");

// Add on the name of the config file.

_MBSCAT (SzconfigfileName, "config.bin");

// if the caller ''s buffer is big enough, return the filename.

IF (_mbslen (szinstalldir)> = nbuffsize)

Return False;

Else

{

_mbscpy (pszname, szconfigfilename);

Return True;

}

}

The above function uses the charprev () API to move the PlastChar to the backward, this character may be two bytes. In this version, if the IF condition works, because Lead Byte will never equal 0x5c.

Let us imagine a violation of rules 1. For example, you might want to detect if the file name entered by the user has appeared more '': ''. If you use operation to traverse the string, not using charnext (), you may issue an incorrect error warning If it happens to have a TRAIL BYTE its value equal to '' '' 'value. Rule for string indexes with rule 2:

2A. Never use subtraction to get an index of a string.

The code that violates this rule and the code violates rule 2 are very similar. E.g,

Char * plastchar = & szconfigfilename [strlen (SzconfigFileName) - 1];

This is the same as the backward movement of a pointer.

Back to Str *** () and _MBS *** ()

Now, we should be very clear why _mbs *** () function is required. The str *** () function does not consider DBCS characters at all, and _MBS *** () is considered. If you call Strrchr ("C: //", '' About string functions: str *** () and _MBS *** () function Think the length of the string is calculated by char So, if a string contains 3 double-byte characters, _mbslen () will return the length of the 6.Unicode function returned by Wchar_T. For example, WCSLEN (L "Bob") returns 3.

MBCS and Unicode in Win32 API

Two groups of APIs:

Although you may never pay attention, each of Win32 has two versions of API and Message in the string. A version accepts the MBCS string and the other accepts a Unicode string. For example, there is no setWindowText () API at all, in contrast, SetWindowTexta () and setWindowTextw (). The suffix A indicates that this is a MBCS function, the suffix W represents the function of the Unicode version.

When you build a Windows program, you can choose to use MBCS or Unicode APIs. If you have used the VC wizard and did not change the pre-processed settings, that indicates that you are using the MBCS version. So, since there is no setWindowText () API, why can we use it? Winuser.h header files contain some macros, for example:

Bool WinAPI SetwindowTexta (HWND HWND, LPCSTR LPSTRING);

Bool WinAPI SetWindowTextw (HWND HWND, LPCWSTR LPSTRING);

#ifdef unicode

#define setWindowText setWindowTextw

#ELSE

#define setWindowText setWindowTexta

#ENDIF

When using MBCS APIs to build programs, Unicode is not defined, so the preprocessor see:

#define setWindowText setWindowTexta

This macro definition converts all the calls to SetWindowText to the real API function setWindowTexta. (Of course, you can call SetWindowTexta () or setWindowTextw () or setwindowTextw (), although you don't have to do it.)

So, if you want to turn the API function used by default into a Unicode version, you can remove the _mbcs from a predefined macro list in the pre-regulator setting, then add Unicode and _unicode. (You need two definitions, because different headers may use different macros.) However, if you use char to define your string, you will fall into a embarrassment. Consider the following code: hwnd hwnd = getSomewindowHandle ();

Char sznewtext [] = "We love bob!";

SetwindowText (hwnd, sznewtext);

After the preprocessor uses setWindowText with setWindowTextw, the code becomes:

HWND HWND = GetSomewindowHandle ();

Char sznewtext [] = "We love bob!";

SetWindowTextw (HWND, Sznewtext);

Did you see the problem? We pass the single-byte string to a function of the Unicode string to do parameters. The first scheme for solving this problem is to use #ifdef to include the definition of string variables:

HWND HWND = GetSomewindowHandle ();

#ifdef unicode

Wchar_t sznewtext [] = l "We love bob!";

#ELSE

Char sznewtext [] = "We love bob!";

#ENDIF

SetwindowText (hwnd, sznewtext);

You may have already felt that this will make you a headache. The perfect solution is to use Tchar.

Tchar

Tchar is a string type that allows you to use the same code when you come to Build programs with MBCS and Unnicode, do not need to use cumbersome macro definitions to include your code. TCHAR is defined as follows:

#ifdef unicode

Typedef wchar_t tchar;

#ELSE

Typedef char tchar;

#ENDIF

So when using MBCS to Build, TCHAR is char, when using Unicode, TCHAR is Wchar_t. There is also a macro to process the L prefix you need to define the constant of the Unicode string.

#ifdef unicode

#define _t (x) l ## x

#ELSE

#define_t (x) x

#ENDIF

## is a pretreatment operator that can connect two parameters together. If you need a string constant in your code, add the _t macro. If you use Unicode to Build, it will add a L prefix before the string constant.

Tchar sznewtext [] = _t ("we love bob!");

Like the details of the SETWINDOTEXTA / W, there are many combustion functions that can be used to implement Str *** () and _MBS *** (). For example, you can use the _TCSRCHR macro to replace Strrchr (), _ mbsrchr () and wcsrchr (). _TCSRCHR is based on your predefined macro to expand into the correct function, just like SETWINDOWTEXT.

Not only Str *** () functions have TCHAR macros. Other functions such as _stprintf (instead of sprinft () and swprintf ()), _ TFOpen (instead of fopen () and _wfopen ()). "Generic-Text Routine Mappings in MSDN has a complete macro list. String and tchar typedefs

Due to the function list of the Win32 API document uses the common names of the function (for example, "SETWINDOTEXT"), all strings are defined with TCHAR. (In addition to the API introduced in XP only applicable to Unicode). Some commonly used typedefs are listed below, you can see them in MSDN.

Type

Meaning in MBCS Builds

Meaning in Unicode Builds

Wchar

Wchar_t

Wchar_t

LPSTR

ZERO-TERMINATED STRING OF CHAR (CHAR *)

ZERO-TERMINATED STRING OF CHAR (CHAR *)

LPCSTR

Constant Zero-Terminated String of Char (const char *)

Constant Zero-Terminated String of Char (const char *)

LPWSTR

ZERO-TERMINATED Unicode String (wchar_t *)

ZERO-TERMINATED Unicode String (wchar_t *)

LPCWSTR

Constant Zero-Terminated Unicode String (const wchar_t *)

Constant Zero-Terminated Unicode String (const wchar_t *)

TCHAR

charr

Wchar_t

LPTSTR

ZERO-TERMINATED STRING OF TCHAR (TCHAR *)

ZERO-TERMINATED STRING OF TCHAR (TCHAR *)

LPCTSTR

Constant Zero-Terminated String of Tchar (const tchar *)

Constant Zero-Terminated String of Tchar (const tchar *)

When to use TCHAR and Unicode

To now, you may ask, why we use Unicode. I have used a lot of char. In the following three cases, using Unicode will benefit you:

1. Your program is only running in a Windows NT system. 2. Your program needs to handle file names that exceed MAX_PATH. 3. Your program needs to use only Unicode versions of APIs introduced in XP.

Most APIs in Windows 9x do not implement Unicode versions. So, if your program is to run in Windows 9x, you must use MBCS APIS. However, since Unicode is used inside the NT system, the use of Unicode APIs will speed up your program's running speed. Each time, you pass a string calling the MBCS API, the operating system converts this string into a Unicode string, then calls the corresponding Unicode API. If a string is returned, the operating system also converts it back. Although this conversion process is highly optimized, it is impossible to avoid the loss of speed.

As long as you use the Unicode API, the NT system allows very long file names (breaks through the limit of max_path, max_path = 260). Another advantage of using the Unicode API is that your program will automatically handle various languages ​​entered. So a user can enter English, Chinese or Japanese, and you don't need to write an extra code to handle them. Finally, with the fade of Windows 9X products, Microsoft seems to be abandoning the MBCS APIS. For example, a setWindowTheme () API containing two string parameters is only Unicode version. Using Unicode to Build Your program will simplify the processing of strings, you don't have to convert each other between MBCS and Unicdoe.

Even if you don't use Unicode to Build your program, you should also use Tchar and its related macros. This can not only process DBCS, but also if you want to use Unicode to build your program, you only need to change the settings in the pre-processor.

About the Author

Michael Dunn: Living in Los Angeles in Sunshine City. He likes the weather here to live here for a lifetime. He started programming at the 4th grade, and the computer used was Apple // E. In 1995, a bachelor's degree in mathematics was obtained in UCLA, followed by QA engineers in Symantec, working in Norton AntiVirus. He learned from Windows and MFC programming. In 1999-2000, he designed and implemented a new interface of Norton Antivirus.

Michael is now developing work in Napster (a company providing online subscription music service). He also developed UltraBar, an IE toolbar plugin, which allows web search to be easier, giving GoogleBar to combat it; he also developed CodeProject SearchBar; founded Zabersoft, which has an office in Los Angeles and Danish Odense.

He likes to play games. The game has Pinball, Bike Riding, occasionally playing PS, DreamCasth, and MAME games. He was sad since I forgot my language: French, Chinese, and Japanese.

Source Document

转载请注明原文地址:https://www.9cbs.com/read-12686.html

New Post(0)