Character internal code
Each country (or region) specifies the character coding set for computer information exchange, such as the expansion of the United States ASCII code, China's GB2312-80, Japan's JIS, etc., as the foundation of the country (regional) information, there is a unified The important role of coding. Since the local character set code range overlap, the information exchange is difficult to exchange, and the software localized version is highly maintained. It is therefore necessary to draw a commonly drawn in localization, do consistency, reduce special localization processing content to least, which is called internationalization (I18n). Various language information is specified for local information, while the underlying character set adopts Unicode containing all characters.
Character Code refers to the internal code used to represent characters. We use the internal code when entering and stored documents, and the internal code is divided into single-byte internal code and double-byte internal code. The full name of single-byte internal code is SINGLE-BYTE Character Sets (SBCS), you can support 256 character encodings; English all-in-one English is Double-Byte Character Sets (DBCS), you can support 65,000 character encodings, Mainly used to encode the Oriental text of the big character set.
CodePage refers to a list of characters that are selected in a specific order. For the language of the early single-byte internal code, the internal code order in the CodePage enables the system to be used in this list according to the input value of the keyboard. Corresponding internal code. For double-byte internal codes, the multibyte to Unicode's corresponding table can be converted to the corresponding character in the unicode form. Introducing support for CodePage is mainly to access multi-language file names, currently using Unicode on file systems under NTFS and FAT32 / VFAT, which requires the system to dynamically convert it to the corresponding language coding when reading these file names. .
I believe that the readers of the JSP code must be unfamiliar with ISO8859-1, ISO8859-1 is our usual use of a more CodePage, which belongs to the Western European. GB2312-80 is developed in the initial phase of domestic computer Chinese character information technology, which contains most commonly used, secondary Chinese characters and 9 districts. This character set is a Chinese character set supported by almost all Chinese systems and international software, which is also the most basic Chinese character set.
GBK is an extension of GB2312-80 and is compatible. It contains 20902 Chinese characters whose coding range is 0x8140 ~ 0xFefe, which eliminates the line 0x80, and all characters can be missed to Unicode 2.0, that is, Java actually provides support for GBK character sets.
> GB18030-2000 (GBK2K) further expanded Chinese characters on the basis of GBK, increasing the text of the hidden, Mongolian minority. GBK2K has fundamentally solved the problem of insufficient word and lack of glyphs.
Different development platform
1. Tomcat 4 Development Platform
There will be Chinese issues in Tomcat 4 or above in Windows 98/2000 (there is no problem in Linux and Tomcat 3.x in Tomcat 3.x), the main performance is that the page display garbled. Adjusting the character set in IE is GB2312, it can be displayed normally.
To solve this problem, you can add <% @ page language = "java" contentty = "text / html; charSet = GB2312"%>. However, this is not enough, although Chinese displayed, but found that the field read from the database became garbled. After analysis, the Chinese characters saved in the database are normal. The database uses ISO8859-1 characters to access data, while the Java program uses a unified ISO8859-1 character set when processing characters (this also reflects Java Internationalization Thought), so Java and databases are handled in ISO8859-1 when data is added, which will not be wrong. However, when reading data, there is a problem, because the data readout is also used in the ISO8859-1 character set, while the JSP's file header has a statement <% @ page language = "cybenttype =" text / html; charset = GB2312 "%>, this shows that the page is displayed with the character set of GB2312, which is different from the read data. At this time, the page displays the characters read from the database is garbled, and the solution is to transfer these characters. From ISO8859-1 to GB2312, it can be displayed normally. This solution has versatility for many platforms, readers can be flexible. 2. Tomcat 3.x, RESIN and Linux platform
In Tomcat 3.x, RESIN or under Linux, no statement <% @ page language = "java" contentty "text / html; charSet = GB2312"%>, the statement in the page play, this It can be displayed normally. Conversely, if you add <% @ page language = "java" contentType = "text / html; charset = GB2312"%> The system will report an error, indicating that the TOMCAT 4 or above is still different when processing JSP.
In addition, the choice of character set is important for different databases such as SQL Server, Oracle, MySQL, Sybase, etc. If you consider a multi-language version, the character set of the database should be unified using ISO8859-1, and it is possible to do conversion between different character sets when you need to output.
The following is a summary of different platforms:
(1) JSWDK is only suitable for normal development, stability, and other issues may not be as good as commercial software. Since the JDK version 1.3 version is better than JDK 1.2.2, and the support of Chinese is also good, so we should use it as much as possible.
(2) As a free commercial software, RESIN is not only fast, stable, automatically compiled, but also points to the wrong line, and can support the use of JavaScript in the server side, and support for Chinese support.
(3) Tomcat is just a realization of JSP 1.1, servlet 2.2 standard, we should not require this free software to be in detail and performance, it mainly considers English users, this is why not do special conversion, Chinese characters use URL method There is a problem with the pass. Most IE browser is always sent in UTF-8, which seems to be a shortcoming of Tomcat, and the other Tomcat does not care about ISO8859 regardless of the current operating system. It seems that it is not proper.
Chinese processing of JSP code
There is often a need to involve Chinese processing in the JSP code:
1. This is included in the URL. Here the Chinese parameters can usually be read directly, for example: <% = Request.getParameter ("ShowWord")%>
2. Read the Chinese value to read the HTML form in JSWDK This time you need to encode, a more concise write is: string name1 = new string (Request.getParameter ("User_ID"). GetBytes ("ISO8859_1")).
In addition, in the support of JDK 1.3, do not join <% @ page contenttype = "text / html; charset = GB2312"%>, and in JDK 1.2.2 or less, even if the above two methods are also unstable. But in the RESIN platform, the situation is better, as long as the first line of the page is added: <% @ page contenttype = "text / html; charset = GB2312"%> You can correctly process Chinese, if the additional code is not correct.
3. In the JSwDK, the Chinese contains the Chinese, if the value read from the form is correctly displayed, but it is not possible to give a Chinese value directly, and the RESIN platform is very good.
4. Add the code option when compiling the Servlet and JSP. Use Java-Encoding ISO8859-1 MyServlet.java when compiling servlets; modify compilation parameters in the JSP Zone configuration file: compiler = Builtin - Javac- Encoding ISO8859-1. After using this method, you don't need to do anything else to display Chinese.
In addition, the popular relational database system supports database eNCoding, which means that it can specify its own character set settings when creating a database, store data in the specified encoding. When the application accesses the data, there is an encoding conversion at the portions and exits. For Chinese data, the setting of the database character encoding should ensure the integrity of the data. GB2312, GBK, UTF-8, etc. are optional database eNCoding, or ISO8859-1 (8-bit), but increased programming complexity, ISO8859-1 is not recommended database ENCoding. When programming in JSP / Servlet, you can check if the management function provided by the database management system is correct.
Process method
Below are two specific Chinese garbled solutions, readers may have gains after carefully studying.
1. Common character conversion method
Transfer the value in the Form to the data library and then remove the full change. "?" FORM is submitted by POST, using the statement in the code: string st = new (Request.GetParameter ("name"). GetBytes ("ISO8859_1")), and also declares that charset = GB2312.
To handle Chinese parameters delivered in the Form, you should add the following code to the JSP, and define a GetStr class specifically to solve this problem, and then convert the received parameters:
String Keyword1 = Request.getParameter ("Keyword1");
Keyword1 = GetStr (keyword1);
This will solve the problem, the code is as follows:
<% @ Page ContentType = "text / html; charset = GB2312"%>
<%!
Public string getstr (string str) {
Try {string temp_p = STR;
BYTE [] TEMP_T = Temp_p.getbytes ("ISO8859-1"); String Temp = New String (TEMP_T);
Return Temp;
}
Catch (Exception E) {}
Return "NULL";
}
%>
<% - http://www.cndes.com Test -%>
<% String Keyword = "The Chuanglian Network Technology Center welcomes you";
String Keyword1 = Request.getParameter ("Keyword1");
Keyword1 = GetStr (keyword1);
Out.print (Keyword);
Out.print (Keyword1);
%>
2. JDBC Driver Character Conversion
At present, most JDBC Driver uses local coding format to transmit Chinese characters, such as Chinese characters "0x4175" will be transferred to "0x41" and "0x75". Therefore, the characters returned by JDBC Driver and the characters to be sent to JDBC DRIVER are converted. When inserting data into the database with JDBC Driver, you need to transfer Unicode to Native Code; when Query data from the database, you need to convert Native Code into Unicode. The implementation of these two conversions is given below:
String Native2Unicode (string s) {
IF (s == null || s.Length () == 0) {
Return NULL;
}
BYTE [] Buffer = New byte [S.Length ()];
For (int i = 0; I s.Length (); i ) {if (S.Charat (i)> = 0x100) {
C = S.Charat (i);
BYTE [] BUF = (" c) .getbytes ();
BUFFER [J ] = (char) buf [0];
BUFFER [J ] = (char) BUF [1];
}
Else {Buffer [J ] = S.Charat (i);
}
Return New String (Buffer, 0, J);
}
It should be noted that some JDBC Driver If the correct character set attribute is set by JDBC Driver Manager, the above method is not required. Refer to the relevant JDBC information for details.
Relevant information
1. Correlated standards organizations and standards
International Standards Organization Unicode (http://www.icode.org) provides the following conversion tables:
GB and Unicode conversion tables: ftp: //ftp.unicode.org/public/mappings/eastasia/gb;
BIG5 and Unicode conversion tables: ftp://ftp.unicode.org/public/mappings/eastasia/other
JIS and Unicode conversion tables: ftp://ftp.unicode.org/public/mappings/eastasia/jis
KSC and Unicode conversion tables: ftp://ftp.unicode.org/public/mappings/eastasia/ksc
Since GBK is not a national standard, Unicode does not provide GBK to Unicode's conversion table, but only uses a version of Microsoft's CodePage.