In cross-platform C ++ software development, some issues and solutions caused by different operating systems for Unicode.

xiaoxiao2021-03-06  97

1. The question is proposed.

A variety of encodings will always be one of the most headaches of software developers, and Unicode has brought hope for unified coding. However, even Unicode is not 100% perfect, it just completed the formulation of various language codes, and in the specific operating system support, it is divided into several versions of UTF8, UTF16 and UTF32. For example, Unicode supported by the Windows system is UTF16, which means that every 2 bytes represent a character (there is a situation called agent, and discussed later). LINUX supports the UNF32 UNICODE standard, and each 4 bytes represent one character.

2. Example description

Imagine, assume that you have developed a software under the Windows system, save some text data with Unicode (UTF16) encoding to a file, now due to business needs, what happens to ran this program in the Linux platform, what happens? ? In general, your code uses the Wchar_T array of C / C standard libraries or std :: wstring to save strings using WSTRCPY, WSTRLEN or Std :: String class, using WSTRCPY, WSTRLEN or STD: String class. String, then use WfPrintf or Std :: WfStream to read and write. This part of the code seems to have no problem, the standard C / C function, does not have to be modified throughout the software. Yes, code lexics and syntax have indeed no problem, and the compilation can also pass. However, what is the result of the program run? According to the discussion above, Wchar_t under the Windows platform, each character 2 bytes; and under the Linux platform, 4 bytes per character. If your user saves a string "Hello" to a file under the Windows platform, then another user opens this file under Linux, then what will he get? The "Hello" to the file is saved under the Windows platform, and the actually written 16-based string is shown below HE E L O 4800 6500 6C00 6C00 6F00 form a string, 00480065006C006C006F, write file. Under the Linux system, the program will read this string as a UTF32, the result is two characters (last 2 bytes 6f00 abandoned) 0x00650048 and 0x006C006C, the string length is 2, which is obviously not hope "Hello" ".

3. Solving Measures According to individual experience, the solution mainly has the following:

1. Because the main reason for generating platform is not compatible is the Unicode scheme used between platforms. One of the most obvious solution is to choose the encoding of some operating system. On the other platform, use their own string to save and Operation function. UTF16 is generally recommended, please refer to the comparison of UTF16 and UTF32 for specific reasons. After determining the encoding scheme for use, you also need to define the corresponding operational function, such as string copy functions, compare functions, and more. A lazy way is to bring some code of the VC CRT Copy, what WSTRCPY, WSTRLEN, and one will not let go. In the program, you can use your own defined methods when you have a string process and access, you can avoid the problem of platform differences. Moreover, a more important reason for choosing this solution is that this solution has a large number of Open Source software libraries can be used directly, such as ICU (International Component For Unicode), Libiconv, etc. 2. Avoid UTF16 and UTF32, directly using UTF8 to process the string, because each platform is completely identical for UTF8. Moreover, UTF8 and UTF16 and UTF32 are exactly the same, but the difference between memory representation, the mature algorithm can achieve these three coded direct conversion. The following is the official conversion code connection: http: // Www.unicode.org/public/programs/cvtutf/. Of course, this practice also has certain disadvantages because UTF8 itself is a growing coding scheme, for common English, uses one byte, for Chinese, generally 3 bytes. This growing character encoding is inconvenient to some operations. If a string of a string is taken, it must be traversed from the beginning, then a character is calculated until the length reaches the requirements. And if you use UTF16 or UTF32, a simple to index can index (do not consider UTF16 proxy mechanisms, common text will not be used). About UTF8, UTF16, UTF32 detailed discussion can refer to www.unicode.org, or you can also refer to some introductory texts under the BLOG encoding folder.

4. Refer to the document www.unicode.org, all the information mentioned in this article can be found on Unicode.org. Http://oss.software.ibm.com/icu/, IBM's open source Unicode implementation, which can fully replace the functionality provided by the operating system with the encoding and internationalization. Http://www.gnu.org/software/libiconv/, GNU's Unicode implementation, integrated in the Linux operating system, but the library itself cross the platform.

转载请注明原文地址:https://www.9cbs.com/read-105480.html

New Post(0)