Character, byte, and encoding

xiaoxiao2021-03-19  183

Character, byte, and encoding

[Original article, reproduced please keep or indicate: http://www.regexlab.com/zh/encoding.htm]

Level: primary

Abstract: This article introduces the development process of characters and coding, the correct understanding of the relevant concepts. For example, in some practical applications, the implementation method of encoding is illustrated. Then, this paper tells several misunderstandings that usually target characters and coding, due to these misunderstandings, the cause of garbled, and eliminate garbled. The content of this article covers the "Chinese problem" and "garbled problem".

introduction

"Character and Code" is a topic that is discussed. Even though, there is still a mess, still plagued everyone. Although we have a lot of ways can be used to eliminate garbled, but we do not necessarily understand the inner principles of these methods. And some of the reasons for garbled, actually caused by the underlying code itself. Therefore, it is not only a blurred character encoding, but some underlying developers lack accurate understanding of character encoding.

1. The origin of the encoding problem, the understanding of the relevant concept

1.1 Characters and Coding Development

From the computer's support perspective of multi-language, it can be roughly divided into three phases:

The system's internal code Description System phase A ASCII computer has just supported only English, and other languages ​​cannot be stored and displayed on the computer. English DOS Phase II ANSI Coding (Localization) To enable computer to support more languages, 1 bytes of 0x80 to 0xFF ranges are usually used to represent 1 characters. For example: Chinese characters' in the Chinese operating system, use [0xD6, 0xD0] to store two bytes. Different countries and regions have developed different standards, thereby generating the respective coding standards of GB2312, BIG5, JIS. These use 2 bytes to represent various Chinese character extension coding methods of a character, called ANSI encoding. Under Simplified Chinese System, ANSI encoding represents GB2312 encoding, under the Japanese operating system, ANSI encoding represents JIS encoding. Different ANSI codes are not compatible, when information is exchanged interchange, the text that belongs to two languages ​​is stored in the same paragraph ANSI encoded text. Chinese DOS, Chinese Windows 95/98, Japanese Windows 95/98 Stage Three Unicode (International) In order to make international information exchange more convenient, international organization has developed a Unicode character set, setting each character in various languages. Uniform and unique number numbers to meet the requirements for text conversion and processing across a cross-platform. Windows NT / 2000 / XP, Linux, Java

Sitting method in memory in memory:

In the ASCII phase, the single-byte string stores a character (SBCS) using one byte. For example, "BOB123" is:

426F6231323300BOB123 / 0

Supports multiple language phases using an ANSI encoding, each character uses one byte or multiple bytes to represent (MBCS), so the characters stored in this manner are also referred to as multi-byte characters. For example, "Chinese 123" is 7 bytes in Chinese Windows 95, each Chinese character accounts for 2 bytes, each English and numeric characters account for 1 byte:

D6D0CEC431323300 Chinese 123/0

After the Unicode is used, when the computer stores the string, it is changed to the serial number of each character in the Unicode character set. At present, the computer typically uses two bytes (16 bits) to store a serial number (DBCS), so the characters stored in this manner are also referred to as a wide-byte character. For example, under Windows 2000, the actually stored in memory is 5 serial numbers: 2D4E87653100320033000000 ← In X86 CPU, low byte in front Chinese 123/0

A total of 10 bytes.

Back to top

1.2 characters, bytes, strings

Understanding the key to the encoding is to accurately understand the concept of characters and the concept of bytes. These two concepts are easily confused, we do this here:

Conceptual description example characters people use the marker, an abstract sense of a symbol. '1', ',', 'a', '$', '¥', ...... The unit stored in the unit, an 8-bit binary number, is a very specific storage space. 0x01, 0x45, 0xfa, ... ANSI string In memory, if "character" exists in an ANSI coding, a character may use one byte or multiple bytes, then we call this string as ANSI string or multi-byte string. "Chinese 123" (7-byte) Unicode string In memory, if "character" exists in the serial number in Unicode, then we call this string to a Unicode string or a wide-byte string. L "Chinese 123" (10 bytes)

Since the criteria specified in different ANSI encoders are different, for a given multi-byte string, we must know which coding rules it use can know which "characters" it contains. For Unicode strings, the "character" content it represents is always constant no matter what environment.

Back to top

1.3 character set and encoding

Different ANSI coding criteria in various countries and regions have only specified the "character" required for their respective language. For example, the Chinese character standard (GB2312) does not specify how Korean language characters are stored. The contents of these ANSI coding criteria include two levels of meaning:

Which characters are used. That is to say which Chinese characters, letters and symbols will be in the income standard. The collection "character" is called "character set". Specifies that each "character" is stored in one byte or multiple byte storage, which bytes are used to store it, and this rule is called "encoding".

When developing coding standards, various countries and regions are generally developed simultaneously. Therefore, it is usually what we say "character set", such as GB2312, GBK, JIS, etc., in addition to "Collection of characters", and also includes "encoding" meaning.

"Unicode Characters" contains all "characters" used in various languages. There are many types used to encode the Unicode character set, such as UTF-8, UTF-7, UTF-16, Unicodelittle, UnicodeBig, and the like.

Back to top

2. Implementation of characters and encoding in the program

2.1 Characters in the program and bytes

In C and Java, it is used to represent the "character" and "byte" data types, and how to encode:

Type or Operation C Java Character Wchar_tchar * Byte Charbyteansi String Char [] Byte [] Unicode String Wchar_T [] String byte String → String MBstowcs (), MultibytetowideChar () * String = New String (bytes, " Encoding ") String → byte string wcstombs (), widechartomultibyte () bytes = String.getbytes (" eNCoding ") Need to pay attention to you:

Char in Java represents a "Unicode Character (Wide Birace Character)", while char in C represents a byte. MultibyToWideChar () and WideChartomultibyte () are Windows API functions.

Back to top

2.2 C related implementation methods

Declare a paragraph of string constants:

// ANSI string, content length 7-byte char SZ [20] = "Chinese 123"; // unicode string, content length 5 Wchar_t (10 bytes) Wchar_t WSZ [20] = L "/ x4e2d / x6587 / X0031 / X0032 / X0033 ";

The I / O operation of the Unicode string, the conversion operation of characters and bytes:

// Runtime Set the current ANSI code, VC format setLocale (lc_all, ".936"); // GCC format setLocale (lc_all, "zh_cn.gbk"); // Visual C Using lowercase% s, followlocale Specify the encoded output to the file // GCC uses uppercase% sfwprintf (fp, l "% s / n", wsz); // convert the Unicode string into byte WCSTOMBS (SZ, WSZ, 20) ; // convert the byte string into the encoding specified by SetLocale into UNICODE string MBstowcs (WSZ, SZ, 20);

In Visual C , Unicode string constants have a simpler representation. If the source program is inconsistent with the current default ANSI encoding, you need to use #pragma setlocale, tell the compiler source program:

/ / If the source program is inconsistent with the current default ANSI coding, this line is required, and the code is used to indicate the code #pragma setlocale (". 936) // Unicode string, content length 10 bytes Wchar_t WSZ [20] = L "Chinese 123";

The above needs to note that #pragma setlocale and setLocale (lc_all, "") are different, # pragma setlocale works in compiling, and setLocale () works at runtime.

Back to top

2.3 Related implementation methods in Java

The content in the string clay string is a Unicode string:

// java code, write Chinese String String = "Chinese 123"; // Get the length of 5 because it is 5 characters system.out.println (String.Length ());

String I / O operation, characters and byte conversion operations. In Java Bag Java.io. *, the class ending with "stream" is generally used to operate the "byte string" class, with "Reader", "Writer", is generally used to operate "string" the type. // String and byte string transformation // Follow GB2312 to get bytes (get multi-byte strings) Byte [] bytes = string.getbytes ("GB2312"); // From bytes to get a Unicode character String string = new string (bytes, "gb2312"); // To write string to a text file according to some coding, there are two ways: // First approach: writing STREAM class has been transformed in accordance with the specified encoding Byte string OutputStream OS = New fileoutputStream ("1.txt"); Os.Write (Bytes); Os.close (); // Second approach: Constructing the designated encoded Writer to write strings Writer OW = New OutputStreamWriter ("2.txt"), "GB2312"); OW.WRITE (STRING); OW.CLOSE (); / * The last 1.TXT and 2.TXT are 7 bytes * /

If Java's source program encoding does not match the current default ANSI encoding, you need to indicate the encoding of the source program when compiling. such as:

E: /> Javac -Encoding Big5 Hello.java

The above needs to pay attention to the encoding of the partial source program and the encoding of the I / O operation. The former works when compiling, and the latter works at runtime.

Back to top

3. Several misunderstandings, as well as the causes and solutions of garbled

3.1 Easy to generate misunderstandings

Misunderstanding of the encoded is incorrectly converted to the "byte string" into "Unicode string", for example, when reading a text file, or by transmitting text through the network, it is easy to simply use the "byte string" as a single byte. String, using each "one byte" is the "one character" method for transformation. In fact, in a non-English environment, "byte string" should be used as an ANSI string, and the adcoding is used to get the Unicode string, and "multiple bytes" can be "a character". Usually, the programmers who have been developing in the English environment are easy to have this misunderstanding. Misconnection 2 In DOS, Windows 98, etc. In a Unicode environment, the string exists in an ANSI encoded byte. This string exists in bytes must know which coding can be used correctly. This makes us form an inertial thinking: "Code of string". When UNICODE is supported, String in Java is stored in the "serial number" of characters, which is not stored in "some coded byte", so there is no "code of the character string". Only when "string" is transformed with "byte strings", or when a "byte string" is used as an ANSI string, there is a concept of encoding. Many people have this misunderstanding.

The first misunderstanding is often caused by garbled. The second misunderstanding often leads to the problem that the problem that is easy to correct becomes more complicated.

Back to top

3.2 Common Code Introduction

Briefly introduce the commonly used coding rules, make a preparation for the chapters behind. Here, we divide all codes into three categories according to the characteristics of the coding rules:

Classification Code Standard Description Single-byte character encodes ISO-8859-1 The simplest encoding rule, each byte directly as a Unicode character. For example, [0xD6, 0xD0] These two bytes are transformed into strings through ISO-8859-1, and they will directly get [0x00d6, 0x00d0] two Unicode characters, ie "öð". Conversely, when the Unicode string is converted into a byte string through ISO-8859-1, only 0 to 255 characters can be converted. ANSI encoding GB2312, BIG5, SHIFT_JIS, ISO-8859-2 ... When the Unicode string is converted into a "byte string", a Unicode character may be converted into one byte or multiple words according to the respective coded regulations. Section. Conversely, when the byte string is converted into a string, there may be multiple bytes to convert into a character. For example, [0xD6, 0xD0] These two bytes are converted into a string via GB2312, which will be obtained by [0x4e2D], which is 'in'. "ANSI encoding" features: 1. These "ANSI coding standards" can only process Unicode characters within their respective language. 2. The relationship between "Unicode characters" and "converted bytes" is human specified. Unicode encodes UTF-8, UTF-16, UnicodeBig ... Similar to "ANSI encoding", when transforming the string into a "byte string", a Unicode character may be converted into one byte or bytes. . Unlike "ANSI encoding": 1. These "Unicode Encodes" can process all Unicode characters. 2. "Unicode characters" and "converted bytes" can be calculated. Here, we can see that "misunderstanding", that is, the use of each "one byte" is a "one character" conversion method, which is actually equivalent to transforming with ISO-8859-1. Therefore, we often use Bytes = String.getBytes ("ISO-8859-1") to perform reverse operation to get the original "byte string". Then use the correct ANSI encoding, such as string = new string (bytes, "GB2312") to get the correct "Unicode string".

Back to top

3.3 Non-Unicode programs are migrating between different locale

The string in the non-Unicode program exists in a certain ANSI coding. If the language environment at runtime is different from the language environment at the time of development, it will cause the display failed to display the ANSI string.

For example, in a Japanese environment, the publication of the Japanese Japanese program interface is running in the Chinese environment, and garbled will be displayed on the interface. If this Japanese program interface is changed to use Unicode to log strings, the normal Japanese will be displayed when running in a Chinese environment.

Due to objective reasons, sometimes we must run non-Unicode's Japanese software under the Chinese operating system, then we can use some tools, such as Antarctic stars, AppLocale, etc., temporary simulation of different locale.

Back to top

3.4 Web file submission string

When the form is submitted to the list, first transform the string into byte strings according to the coding of the current page. Then transform each byte into a "% xx" format submit to the web server. For example, when a page encoding GB2312 is submitted to "in" strings, the content submitted to the server is "% D6% D0". On the server side, the web server converts the received "% D6% D0" to [0xD6, 0xD0] two bytes, and then get "in the" in the GB2312 encoding rule. "

In the Tomcat server, Request.getParameter () is often caused by "misunderstanding" mentioned earlier. By default, when the "% D6% D0" is submitted to the Tomcat server, Request.getParameter () will return [0x00D6, 0x00d0] two Unicode characters, not to return a "in" character. Therefore, we need to use Bytes = String.getBytes ("ISO-8859-1" to get the original byte string, and use string = new string (Bytes, "GB2312") to reach the correct string "in".

Back to top

3.5 Read the string from the database

When reading a string from a database client (such as ODBC or JDBC), the client needs to know the ANSI encoding used from the server. When the database server sends a word stream to the client, the client is responsible for transforming the byte stream into the unicode string in accordance with the correct encoding.

If you get garbled when you read a string from the database, the data stored in the database is correct, then it is often caused by "misunderstanding" mentioned earlier. The solution is still the method of string = new string (String.getBytes), "GB2312"), and re-uses the correct encoded to transform into strings.

Back to top

3.6 Strings in the email

When a period or HTML is transmitted by email, the seminated content first converts to a "byte string" through a specified character, and then "byte string" through a designated transmission code (Content-Transfer- Encoding) Converts another string "byte string". For example, open an email source code, you can see similar content:

Content-type: text / plain; charset = "gb2312" Content-Transfer-Encoding: Base64SBG QcrQUQo17cf4yee74bgjz9w7 b3wudza7dbq0mqncg0kvpkzxqo6uqo17cnnsapw0ndedqoncg ==

The most common connet-transfer-encoding has two kinds of base64 and quoted-printable. When transforming binary or Chinese text, the "byte string" obtained by Base64 is shorter than quoted-printable. When transforming the English text, the "byte string" obtained by quoted-printable is shorter than Base64.

The title of the message, with a shorter format to label "character encoding" and "transmission encoding". For example, the title content is "in", the email source code is shown:

/ / The correct title format Subject: =? GB2312? B? 1ta =? =,

The first "=?" In the middle of "?" Specifies the character encoding, specified in this example, GB2312. "?" "" B "in the middle of"? "Represents Base64. If it is "Q", it represents quoted-printable. In the end "?" And "? =", It is transformed into a byte string through GB2312, and then transforms the title of Base64.

If "transfer coding" is changed to quoted-printable, the same, if the title content is "in":

/ / The correct title format SUBJECT: =? GB2312? Q? = D6 = D0? =

If you have garbled when you read the email, it is generally because "character encoding" or "transmission encoding" is specified, or no specified. For example, some email components are sending emails, in the title:

/ / Error Title Format Subject: =? ISO-8859-1? Q? = D6 = D0? =

In this point, it is actually clearly indicated that the title is [0x00D6, 0x00d0], "öð", not "in".

Back to top

4. Correction of several misunderstandings

Misunderstorm: "Is ISO-8859-1 is international coding?"

Non-also. ISO-8859-1 is only the simplest type of single-byte character set, that is, "byte number" with "Unicode character number" coding rules. When we want to convert a "byte string" into a "string", don't know which ANSI code it is, temporarily transform "every byte" as "a character", will not Resulting in information loss. Then use Bytes = String.getbytes ("ISO-8859-1") to return to the original byte string.

Misunderstorm: "How do I know the internal code of a string?"

In Java, string-class java.lang.string processes a Unicode string, not an ANSI string. We only need to treat strings as "string of abstract symbols". Therefore there is no problem with the internal code of the string.

转载请注明原文地址:https://www.9cbs.com/read-130217.html

New Post(0)