Unwanted character set

xiaoxiao2021-03-06 54

Lu_YI_MING (_AT_) Sina.com 2004.120.2 (there may be a lot of errors in this article, for reference only, thanks to corrective correction)

First, character set application example: web browsing

We started from the user to browse an HTML page with IE, assuming this is a "user information registration" web page, the user enters the name, age and other information for registration.

After the user started IE, after entering the URL in the address bar (the character set processing related to the keyboard, please refer to the "User Enter Name"), IE stores the UTF16 into the cache, and then convert the URL to UTF8, continue to package into http The data packet (the character set of the HTTP protocol), handed over to the TCP / IP Socket to send to the web server (the previous server DNS resolution, the character set problem, the Chinese domain name character set problem).

The web server (the server processing file system "Load JSP file" after seeing "Load JSP File") Returns the HTTP packet containing the HTML file (in order to support the International UTF8 encoding), return to IE through TCP / IP Socket, IE Looking for content character set settings in the HTTP protocol, if you can't find it, you can find a string , I can't find the default ISO-8859-1. In summary, IE determines a character set for the current HTML file (we assume UTF8), then press the character set to display the html file in the window of IE. In (text display, please refer to the "User Enter Name" below).

IE displays the HTML file according to the correct character set, and the user begins to enter the name in a . Every time the keyboard is pressed, Windows adds the virtual key code to the WM_KEYDOWN message to the message queue in IE. GetMessage () After the message is handed over to TranslateMessage (), translateMessage () handles the virtual key code to the input method window, the user selection of the user selection in the input method window is temporarily deposited by UTF16, and then registered according to IE Window The character set type (UTF16 / GBK) adds Chinese character encoding to the message queue sent to IE in WM_CHAR (send IE is UTF16, one message; if it is GBK, it is two messages), IE stores UTF16's Chinese character encoding into memory buffer, then uses encoding, font, display position, etc. to call textOut (), textout () finds UTF16 encoding in the font library (if it is not UTF16 coding inside the font library) The vector / dot matrix graphic of the character set conversion is displayed in a specific location in the IE window (

The user presses the "Submit" button.

All text messages entered on the web page have existed in the coded of UTF16. After the user presses the "Submit" button, IE converts data in the cache into the previously identified HTML character set (previously assumed UTF8), then packaged in the HTTP package (the URL of the Form is still converting UTF8), hand it over to TCP / IP Socket to send (the processing of the character set is the same). We assume that a JSP file for J2EE Server (Tomcat) running on Linux is processed.

TOMCAT / JVM TCP / IP Socket receives the HTTP package because JVM can only process UTF16 characters, so Tomcat converts the HTTP package to UTF16 to start analysis, and knows that you need to load a JSP, and this JSP has not been loaded, So Tomcat / JVM converts the file name (including path) to UTF8 (UTF8 because of Glibc / Linux kernel), call the glibc / linux kernel Open () to open the file (Linux kernel access file system character set conversion, or Other parts of this article also discuss it). Tomcat / JVM continues to call the Glibc / Linux kernel read () read the binary data of the file into the memory, convert the file's start section to UTF16, then look for strings <% @ Page ContentType = "text / html; charset = utf-8"> To determine the character set of this file (this file is saved with UTF8 encoding in order to support internationalization). Tomcat / JVM converts the data of the entire JSP file to UTF16, then translate into a Java format, then convert the file data in the Java format back to UTF8, and save the Java file in the working directory (file system character set conversion same) . Tomcat calls Javac to compile the Java file into a Class file (the character set conversion during the compilation process, otherwise the C library or operating system does not support, but the specific situation is unknown. The string in the code is converted into utf16), then read into memory Load (hereinafter referred to as JSP code).

Tomcat submits the primary data binary in the user's HTTP package to the JSP code processing. The JSP code first converts binary data to the UTF8 to the JVM's internal code UTF16, then read the field text in the Form, and perform appropriate data types. Conversion, start processing.

We assume that the JSP code wants to store the user's data into a database (mysql).

JSP code loads the Mysql's JDBC driver, connects to a database (name DB) of a MySQL server (omitted to the TCP / IP-related character set conversion). After the JDBC driver is connected to the DB, the DB's default character set is obtained from the server immediately (to support international settings to UTF8). The JSP code consists of the data obtained by the user Form into a SQL statement (of course, UTF 16) INSERT TABLE1 (Name, COMPANY) VALUES ('Name,' Unit '), is handed over to JDBC. JDBC first queries the server in the server set (Name character set is UTF8, the COMPANY character set is GBK), and then converts the INSERT ... statement into UTF8, but "unit" is converted Become GBK, then send this string data to the server side via TCP / IP, the server side is in the file (the character set is not changed), and the "name" of UTF8 accounts for 6 bytes, and the "unit" of GBK accounts for 4 words. Section. Assuming that the JSP code should get some information from the database, then send a confirmation email to the user's Email mailbox so that the user can confirm.

JSP code is ready for a SQL statement Select ..., handed over to the JDBC running requirement Returns a result set, JDBC converts the SELECT statement to the Mysql DB database (the specific character set conversion is similar), the server returns a result. Set, JDBC is received, and the JSP code will return a UTF16 string after getString (), JDBC from the results set. The JSP calculates the string of the message based on the obtained data, call JavaMail's mimessage.setText () and so on to set the message, and the characters here have been UTF16 encoding. The mail is ready to send a message after the JSP is ready. Transport.send converts the character set definition in the Linux environment variable before handling the data packet of the mail. The restricted character set conversion processing of the specific message and the character set process of the SMTP protocol.

After the mail is successfully sent, the JSP code displays some prompt information to the user as the end of the new user registration.

JSP code is ready to return a HTML page to the user, first specify the character set required in the HTTP protocol, continue to return UTF8 page <% @ Page ContentType = "text / html; charset = uTF-8", then OUT .println () outputs the HTML information to display, Tomcat converts the information in all output buffers from UTF16 into UTF8, and then returns to IE through TCP / IP Socket, and IE is converted to the user after converting according to the above character set, displayed to the user.

Second, the character set classification

Basic definition

◇ GBK is the super-collection of GB2312. An English letter in GBK accounts for 1 byte, and a Chinese character accounts for 2 bytes, and Chinese is sorted in Pinyin. It includes traditional Chinese characters.

◇ Unicode usually refers to UniceDe 16. There is also unicode 32, not commonly used. One character is in two bytes, and the high byte is behind. Chinese is sorted by the first step, and traditional Chinese characters are included in the font.

◇ UTF-8 is a variant of Unicode 16, corresponds to the Unicode 16, so the sorting rules are the same. An English alphabet accounts for 1 byte, a Chinese European character accounts for 2 bytes, and a Chinese character accounts for 3 bytes. String comparison, sorting and other operations more slow 30% slower (reference only). 2. Windows (2000) system

◇ We can see almost every word seen on Windows is output, and the internal (internal) of Textout (internal) uses UTF-16 / UCS-2 Little Endian (from MSDN). TextOut is displayed on the screen of the character graph from the font file, and a font file prepares a graphic (vector or dot matrix) for each character. Of course, the font file is written in which a character set. So, if the character set of the font file is not a UTF-16 / UCS-2 Little Endian, then Textout must first convert it to find an accurate character graphic display to the screen.

◇ WINDOWS is prepared in many character sets (each character set and the corresponding library of UTF-16) are used to support applications.

◇ Windows also has a default local character set (end user character set) to support localization. The local character set of Chinese windows is GBK

◇ Windows involves two sets of APIs (... W and ... a), ... W for Unicode, ... A for local GBK

◇ In a Windows application, you entered a Chinese character (using the input method tool) using the keyboard, the TranslateMessage function translates it into a GBK or Unicode character (decided according to RegisterClass), then attached to WM_CHAR (WPARAM) Give your app.

◇ Windows comes with Notepepad.exe. The Chinese entered on the keyboard is translated into Unicode characters and stored in memory buffer. When saving files, you can choose to encode. If you select ANSI, you will be saved as GB2312; if you choose Unicode, you are directly saved into unicode; if UTF-8 is selected, the translation is saved into UTF-8. The text files of these character sets can be opened.

◇ The character set of the file name saved in the FAT / 32 partition is ANSI / OEM (that is, GBK / GB2312), the character set of the file name saved in the NTFS partition is Unicode

◇ Windows DOS window can display text files for GBK / GB2312, UTF-16.

◇ Display binary data of a file can be edited by VC.

3. Linux system (Redhat FC2 / KERNEL 2.6.6)

◇ Linux kernel character set is UTF-8.

◇ The character set of the Linux user interface (/ etc / profile) can be set in the environment variable, the default is en_us.utf-8, can be changed to eXPORT LC_TYPE = "zh_cn.gb2312".

◇ 显示 Displaying the contents of other character sets can be used with iconv commands, such as Cat ... | iconv -f gb2312 -t utf-8 ◇ KWRITE, you can select characters when you open a file, save files Set translation.

◇ 显示 Show binary data of a file is hexdump -c

◇ FSTAB partitioned partition / dev / hdb1 / mnt / winc vfat defaults, codepage = 936, octharset = cp936 0 0 / dev / hda5 / mnt / wind vfAt defaults, codepage = 936, iocharset = cp936 0 0

◇ Map a shared directory on another Windows machine into a subdirectory mount -t smb-t charset = GB2312, CODEPAGE = CP936, UserName = xxx // pcname / share / mnt / dir1

4. Webviews

◇ HTTP protocol part of the character set is UTF-8, the character set of the data section is specified in the protocol, or specified in the HTML file.

The character set of the HTML file is specified in its file, . More chaos, the browser doesn't know the character set of the received data block, but it is necessary to find such a complex string inside, and then know what the character set is found. Make a test: Use FrontPage to edit a few lines in Chinese, set the web page character set to Unicode (web properties -> text -> encoding), then save it into a file, then open this file in the binary method, indicate the two words of the unicode16 The "FF Fe" is deleted, and then save it. I don't know if IE / FrontPage / Interdev is opened, IE / FrontPage / Interdev is only normal display, and Firefox is really good. It seems that IE is all determined by the file header, sometimes the tag of the HTML character set does not work.

◇ In Windows IE, Firefox translates the characters typed into Unicode to save, including address bar and form. The information of the address bar is sent to the character set conversion of the HTTP protocol before sending to the server. The information of the form is sent to the server (the server side of the application) to convert before the character set specified in HTML (or browser).

5. Java / J2EE

◇ The character set inside the JVM is Unicode

◇ The character set of the Java source file can be arbitrarily, and it is usually determined when saving files by the editor (the default character set of the system user interface). The Javac can specify the character set of the Java source file when you do not specify the default character set of the environment variable or the system. The character set of the .CLASS file is Unicode.

◇ A Java string (String) transformation character set code (converted from GB2312 to UTF-8) is: string str2 = new string (str1.getbyte ("GB2312"), "UTF-8");

The character set of the JSP file can be arbitrarily, loaded from the <@ ...> looking for a character set from the file.

◇ Linux, the character set of the message issued by Javamail will convert the character set of the local system, and the environment variable is defined by default. 6. VC / .NET

◇

7. Database

7.1 mysql (4.1.8)

◇ Mysql server-side can specify a character set, and the character set setting when startup can be written in the configuration file, and simple practices are set in the graphical management tool MySQL Administrator.

◇ Mysql's library (DB), table, and fields can set different character sets. If you support internationalization (such as the name of the Name field to save Chinese, English, Japanese, Korean, Arabic, etc.), the entire character set adopted UTF-8 is better.

◇ MySQL has several sorting methods for each character set, but the pinyin sorted by UTF-8 does not have Chinese characters (UTF-8 Chinese characters are sorted by the first painting of the deflection). For the Chinese in the UTF-8 field, the variation is to increase the field of the GBK character set, and each time the same data franchise set is simultaneously inserted into two fields. For the Chinese section, you can sort in the latter field.

◇ MySQL's JDBC driver is very "smart", which will automatically convert between Unicode character sets in Java and the character set of the database.

◇ Mysql's graphical client Mysql Query Browser is also very smart in handling character set automatic conversion.

◇ MySQL character client MySQL is very poor in processing character sets.

8. Email

◇ This version is slightly, the next version is detailed.

9. SSH

◇

Any character belongs to a character set, and the character set is everywhere.

转载请注明原文地址:https://www.9cbs.com/read-74610.html

9cbs

New Post(0)