About Unicode

zhaozj2021-02-16  107

Documents and Settings> Weimaohua> local settings> Temp> FrontPageTempdir

About Unicode

Introduction

1.1 single-byte and double-character character set

1.2 Unicode: Wide Biode Character Set

1.3 Application when programming

1.3.1 You can choose to compile

1.3.2 strcpy, _tcscpy, wcscpy

1.3.3 Data type conversion macro

1.3.4 data type

Second. Reference

About Unicode and related

Version: 1.0

Author: Soundboy

Date: 2004-7-21

Remarks: This article describes some knowledge about character set encoding

Introduction

Software localization should handle different character sets. Over the years, one '/ 0' is placed behind a series of single-byte characters. However, some texts and writing rules (such as Chinese), the characters are too much, and the 256 characters provided by the single byte cannot meet the needs.

1.1 single-byte and double-character character set

Chinese characters in Japanese, if the first character is between 0x81 and 0x9f, or () between XE0 and 0xFC, then the next byte must be observed to complete the character. There are many functions in MS VC , such as _mbslen (), can be used to operate multi-byte character set (DBCS)

1.2 Unicode: Wide Biode Character Set

There is a specialized mechanism to maintain the Unicode standard. Each Unicode character is 16-bit, a total of 65,000 characters. Many files have their own characters in Unicode, called code points. The code point of Chinese Pinyin is 0250 ~ 02AF.

1.3 Application when programming

Just define two macros (Unicode and _genicode). The standard C header file String.h, there is more data types of Wchar_t.

Typedef unsigned short wchar_t;

The original C function of the string operation cannot be operated on the Unicode operation, only the ANSI C operation, as long as the STR is changed to WCS for Unicode operations. E.g

Strcmp () -> wcscmp ()

1.3.1 You can choose to compile

You can also build files for ANSI and Unicode, and must contain "Tchar.h" files instead of "string.h". Tchar.h file contains a set of macros, if _unicode is defined, then the WCS group function will be referenced, otherwise the STR group function will be referenced.

1.3.2 strcpy, _tcscpy, wcscpy

There is a _tcscpy function in tchar.h, if you don't define _unicode when compiled, this function is StrcPy, if it is set, it will become a Unicode function wcscpy (). In this way, the program can successfully perform ANSI or Unicode compiled.

1.3.3 Data type conversion macro

For this type of THCAR, if _unicode is defined, then

Typedef wchar_t tchar;

If there is no definition, then

Typedef char tchar;

And you can tchar * szerror = "error";

But this define Unicode, then there will be an error, you must use szerror = l "error";

Plus L After the compiler, add 4 zero bytes in front of the data segment. But this is not an ANSI. So use the _Text () macro can complete this. E.g:

TCHAR * SZERROR = _Text ("error"); this can be compiled correctly for Unicode and ANSI.

1.3.4 data type

Wchar Unicode characters

PWchar Unicode string pointer

PCWchar Unicode string constant pointer

PTSTR and PCTSTR can point to Unicode or point to the ANSI string, depending on whether the _unicode macro is defined.

Second. Reference

"Windows Core Programming" JEFFREY RICHTER

The following is the article of Li Zhiyong, I am here

Talk about character set

03-8-8 18:02 by leezy_2000

Since the author is the American, I found several famous articles under Windows (such as "Windows Programming", Jeffrey Richter's "Windows Core Programming") is not very reasonable. Now it is clear here to clarify some of the confused issues and indicate that some problems that are easy to make mistakes (I have made it yourself).

First explain a few concepts:

Character set: According to the coding characteristics, the character set can be divided into three categories.

l Narrow Clear Set (SBCS) Each code is represented by one byte, such as ANSI.

l The code in the multi-byte character set (MBCS) character set or single byte, or multiple bytes, such as DBCS, GB2312, and the like.

L Wide Biode Character Set Character Set Each character is represented by two bytes. Such as Unicode

Code Page: In Unicode and DBCs, due to the included code, you need to organize these code in order to use convenience. The organization method is to put different countries' code from different code pages.

The character set and code page relationship: By you can see, for Unicode and DBCS, the code page is from the character set. However, for the SBCS class character set (such as ANSI) and MBCS character set outside of DBCS (such as GB2312, etc.) only corresponds to one code page.

Look at the potential program:

Void converandoutputString (HDC HDC, LPWSTR WSTR, INT Length, Int x, int y)

{

int NRET;

INT SizeBuffer = 2 * Length;

Char * lpbuffer = new char [sizebuffer];

NRET = WideChartomultibyte (CP_ACP, 0, WSTR, Length,

LPBuffer, SizeBuffer, Null, NULL

TextOut (HDC, X, Y, LPBuffer, NRET);

DELETE [] lpbuffer;

}

This program is very simple, just turn a wide string to the DBCS string, and output it according to the specified coordinates. Jeffrey Richter also uses almost the same method in the "Windows Core Programming" on page 26. But this program is actually problematic. The problem should not be hard-coded when converting a string, and should be dynamically obtained according to the current font. Otherwise, in some cases, the Unicode characters in WSTR will not be converted to the correct code. If you use the above code to perform Chinese output, you will be fortunate to see that many question marks are automatically added to your string.

The solution is also very simple, but first you have to be familiar with the following two API functions:

INT GetTextCharset (HDC HDC); // This API is used to get the character set of the current font.

Bool TranslateCharsetInfo

DWORD * PSRC, // InformationLpcharsetInfo LPCS, // Character Set Information

DWORD dwflags // translation opt

);

This function can convert character sets, code pages, and FontSignature to each other. Convert information

Placed in LPCS. DWFlags indicates that the conversion needs to be converted, and the character set is converted to the code page or other.

It is important to note that the PSRC parameter, this parameter needs to be in the character set to the code page.

It is a pointer with a pointer type rather than pointing to a value. So output a function for the above string

Just add the following two lines to ensure that the string does not encounter the condition code during the conversion.

Void converandoutputString (HDC HDC, LPWSTR WSTR, INT Length, Int x, int y)

{

int NRET;

INT SizeBuffer = 2 * Length;

Char * lpbuffer = new char [sizebuffer];

INT Charset = GetTextCharset (HDC);

CharsetInfo csinfo = {0};

TranslateCharsetInfo (DWORD *) Charset, & Csinfo, TCI_SRCCHARSET

NRET = WideChartomultibyte (Csinfo. .ciacp, 0, wstr, length,

LPBuffer, SizeBuffer, Null, NULL

TextOut (HDC, X, Y, LPBuffer, NRET);

DELETE [] lpbuffer;

}

Finally, the theme of this article is to dynamically determine the code page when doing the conversion of the character set.

转载请注明原文地址:https://www.9cbs.com/read-14155.html

New Post(0)