Source:
DW China Site
Published Date:
original:
http://www-900.ibm.com/developerWorks/cn/java/jsp_dbcsz/index.shtml
1. The origin of the problem
Each country (or region) specifies the character coding set for computer information exchange, such as the expansion of the United States ASCII code, China's GB2312-80, Japan's JIS, etc., as the foundation of information processing in the country / region, has unified The important role of coding. The character coding set is divided into two categories: SBCS (single-byte character set), DBCS (Double-byte character set). Early software (especially operating system), in order to solve the computer processing of local character information, there have been various localized version (L10N), in order to distinguish, introduce LANG, CODEPAGE and other concepts. However, due to the overlap of each local character set code, information exchange is difficult to exchange; the localized version of the software is high. It is therefore necessary to extract the commonality in the localization work, and consistently processed, and special localization processing content is lowered to minimize. This is the so-called internationalization (I18N). Various language information is further specified as Locale information. The underlying character set of processed has become a Unicode that contains all glyphs.
Most of the software core character processing of most international features is based on Unicode. When the software is run, determine the corresponding local character encoding setting based on the local locale / codepage setting, and processes the local characters. The mutual conversion of Unicode and local character sets is required during processing, or even Unicode is the mutual conversion of two different local character sets in the middle. This approach extends in a network environment, and character information at both ends of any network also needs to be converted to an acceptable content according to the setting of the character set.
The Java language inside the Java language uses Unicode to represent the character, comply with Unicode V2.0. The Java program will be transformed from the URL connection to the URL connection to the URL connection, or the URL connection is written, whether it is from / the file system. Although this increases the complexity of programming, it is easy to confuse, but it is in line with international ideas.
In theory, these characters conversions according to character set settings should not have too many problems. The fact is due to the different actual operating environment of the application, the complementary, perfect, and the irregularity of the system or application, and the problem that the system or application implemented, and the problem occurred in the time of transcoding, the programmer and the user are plagued.
2. GB2312-80, GBK, GB18030-2000 Chinese character set and encoding
In fact, the method of solving Chinese character encoding problems in the Java program is often simple, but understands the reasons behind it, positioning problems, but also understands existing Chinese character encoding and coding conversion.
GB2312-80 is developed in the initial phase of domestic computer Chinese character information technology, which contains most commonly used, secondary Chinese characters, symbols in 9 districts. This character set is a Chinese character set supported by almost all Chinese systems and international software, which is also the most basic Chinese character set. Its coding range is high 0xa1-0xfe, the low position is also 0xa1-0XFE; the Chinese character starts from 0xB0A1, ending at 0xf7fe;
GBK is an extension of GB2312-80 and is compatible. It contains 20902 Chinese characters whose coding range is 0x8140-0XFEFE, which eliminates the line of the high 0x80. All characters can be missed one-to-one to Unicode 2.0, that is, Java actually provides support for GBK character sets. This is the default character set of Windows and other Chinese operating systems, but not all international software supports this character set. It feels that they don't fully know how GBK is going. It is worth noting that it is not a national standard, but it is just a standard. With the release of GB18030-2000 national standard, it will complete its historical mission in the near future. GB18030-2000 (GBK2K) further expanded Chinese characters on the basis of GBK, increasing the shape of a small number of ethnic minorities. GBK2K has fundamentally solved the problem of insufficient word and lack of glyphs. It has several characteristics,
It did not determine all glyphs, just specified the coding range and expanded later. The coding is growing, and its binabed part is compatible with GBK; the four-byte part is the expanded glyph, the word bit, its code is the first byte 0x81-0XFE, the two bytes 0x30-0X39, three bytes 0x81- 0xfe, four bytes 0x30-0x39. Its promotion is a phased, first requiring implementation, all glyphs that can be fully mapped to the Unicode 3.0 standard. It is a national standard and is mandatory. There is no operating system or software to implement GBK2K, which is the work content of the current and future Chinese.
Introduction to Unicode ... Just free.
Java supported Encoding related to Chinese programming: (there are several not listed in the JDK document)
ASCII7-BIT, ascii7iso8859-18-bit, with 8859_1, ISO-8859-1, ISO_8859-1, Latin1 ... GB2312-80 with GB2312, GB2312-1980, EUC_CN, EUCCN, 1381, CP1381, 1383, CP1383, ISO2022CN, ISO2022CN_GB ... GBK (payment case), with MS936UTF8UTF-8GB18030 (now only IBM JDK1.3.? Support), with CP1392, 1392
The Java language uses Unicode processing characters. But from another perspective, in the Java program can also use non-Unicode transcoding, it is important to ensure that the program entry and the export of Chinese characters are not true. If I use ISO-8859-1 to process Chinese characters, the correct results can be achieved. Many solutions to the network are all in this type. In order not to be confused, this paper does not discuss this approach.
3. Chinese transcathes to '?', Garbled
Both direction conversions are likely to get the wrong results:
Unicode-> Byte, if the target code set does not exist, the result is 0x3f.
Such as:
The result of "/u00d6/u00ec/u00E9/u0046/u00BB/U00F9".getBytes ("GBK") is "? ìéf?", the HEX value is 3FA8ACA8A6463FA8B4.
Take a closer look, you will find / usc converted to 0xA8AC, / u00E9 is converted to / Xa8a6 ... It's actual effective bit! This is because some symbols in the GB2312 symbol area are mapped to some public symbol encodings, because these symbols appear in ISO-8859-1 or some other SBCS character set, they are proderated in Unicode, there are some valid Bit is only 8 bits, and the encoding overlap of Chinese characters (in fact this mapping is only the code mapping, it is not the same. The symbol in UNICODE is single byte wide, the symbols in the Chinese characters are double bytes.). Such symbols between Unicode / U00A0 - / U00FF have 20. Understanding this feature is very important! It is not difficult to understand why Java programming, some garbled characters often appear (actually symbolic characters) in the error result of Chinese character encoding, not the '' 'character, such as the above example. Byte -> Unicode, if the Byte identifier is in the source code set does not exist, the result is 0xFFFD.
Such as:
BYTE ba [] = {(byte) 0x81, (byte) 0x40, (byte) 0xB0, (byte) 0xA1}; new string (ba, "gb2312");
The result is "? Ah", the HEX value is "/ uffd / u554a". 0x8140 is a GBK character, pressing the GB2312 conversion table There is no corresponding value, take / ufffd. (Note: When displaying the Unicode, because there is no corresponding local Character, so it also applies the previous situation, displayed as "?".)
In the actual programming, the JSP / Servlet program gets the wrong Chinese character information, which is often superposed in these two processes, sometimes even two processes to overlap the results.
4. JSP / servlet Chinese character encoding problem and solution in WAS
4.1 Phenomenon of common encoding issues
JSP / servlet Encoding issues that often appear on the Internet are generally manifested in Browser or application terms, such as:
How does the Chinese characters in the JSP / Servlet page seen in the browser become '?'? How can the Chinese characters in the servlet page seen in the browser become garbled? How does the Chinese characters in the Java application interface become square? The JSP / Servlet page cannot display GBK Chinese characters. In the JSP page, it is garbled in the java code contained in the java code, but other Chinese characters of the page are correct. JSP / servlets cannot receive Chinese characters submitted by Form. JSP / servlet database read-write cannot obtain the correct content.
It is hidden behind these issues that are caused by various errors, except that the third is caused by the Java Font setting error). Solve similar characters eNCoding issues, you need to know the running procedure of the JSP / Servlet, check the points that may have problems.
4.2 JSP / Servlet Web Programming Encoding Questions The JSP / Servlet runs on the Java application server provides HTML content for Browser, which is shown below:
Among them, there is a character encoded and converted.
JSP compilation. The Java Application Server reads the JSP source file based on the JVM file.Encoding value, compiles the Generation Java source file, and then writes back to the file system according to the file.encoding value. If the current system language supports GBK, then Encoding issues will not appear at this time. If it is an English system, such as Linux, AIX or Solaris, which is en_us, then place the JVM's file.Encoding value into GBK. System Language If it is GB2312, then determine if you want to set file.encoding, set File.Encoding to GBK to solve the potential GBK character garble problem Java needs to be compiled into .class to execute in JVM, this process exists A. The same file.encoding problem. Since the start of servlet and JSP from here, it is similar to the Servlet's compilation is not automatic. For JSP programs, compilation of the generated Java intermediate file is automatically performed (call Sun.Tools.javac.MAIN directly). So if there is a problem in this step, you also check Encoding and OS language environment. Or turn the static Chinese character embedded in JSP Java Code to Unicode, or static text output is not in Java Code. Hand-specified -encoding parameters for servlets, Javac compiles. Servlet needs to convert HTML pages to Browser acceptable encoding content. Depending on the implementation of each Java App Server, some will query the Browser Accept-Charset and Accept-Language parameters or to determine the Encoding value in other guessing methods, and there is no matter whether it is. Therefore, it is best to use a fixed encoding perhaps the best solution. For Chinese web pages, contentType = "text / html; charset = GB2312" can be set in JSP or Servlet; if there is a GBK character in the page, set to contentType = "text / html; charSet = GBK", due to IE and Netscape pairs GBK's support is different, you need to test it when making this setting. Because the 16-bit Java Char is discarded at the time of network transmission, it is also desirable to ensure that the Chinese characters in the servlet page (including embedded and servlet runs) are expected internal codes, you can use PrintWriter Out = RES. . the getWriter () substituted ServletOutputStream out = res.getOutputStream () PrinterWriter as will be specified in accordance contentType charset conversion (before the contentType specified in need!); OutputStreamWriter package can also be used with type ServletOutputStream and write (string) output kanji character string. For JSP, Java Application Server should ensure that the embedded Chinese characters will be transmitted correctly at this stage. This is explaining the URL character eNCoding problem. If you contain Chinese character information from the parameter value returned from Browser through the GET / POST method, the servlet will not get the correct value. In the J2SDK of Sun, httputils.Parsename does not consider the language settings of Browser at the time of parsing the parameters, but will be parsed by byte. This is an Encoding issue discussed online.
Because this is a design defect, it can only resolve the resulting string in bin; or resolved in the Hack Httputils class. Reference article 2 has a presentation, but it is best to change the Chinese Encoding GB2312, CP1381 to GBK, otherwise there will be a problem when you encounter GBK Chinese characters. Servlet API 2.3 provides a new function httpserveletRequest.setCharacterenceEncoding to specify eNCoding you want before calling Request.GetParameter ("param_name"), which will help completely solve this problem. 4.3 Solutions in IBM WebSphere Application Server
WebSphere Application Server extends to standard Servlet API 2.x, providing better multilingual support. In the operating system running in Chinese, you can process the Chinese characters well without any settings. The following instructions are only for WAS to run in English, or if you need GBK support.
The above C, D condition, the WAS must query the language setting of Browser, under the default condition, ZH, ENCODING CP1381 (Note: CP1381 is only equivalent to a codePage of GB2312, no GBK support) . Do I think because I can't confirm that the operating system of Browser is supporting GB2312, or GBK, so it is small. But actual application systems require GBK Chinese characters in the page, the most famous is "镕" (rong2, 0xe946, / u9555) in Zhu Prime Minister name, so sometimes you still need to specify Encoding / Charset as GBK. Of course, the default Encoding does not have so troublesome, for A, B, reference article 5, specify -dfile.encoding = GBK in the command line parameter of Application Server; for D, in Application Server command Specify -DDefault.client.Encoding = GBK in line parameters. If you specify -ddefault.client.encoding = GBK, then CHARSET can no longer specify in C.
There is also a problem with the QAG <% ...%>, <% = ...%> in the question listed above, and the solution in WAS is In addition to setting the correct file.Encoding, you also need to set -duser.language = zh -duser.region = CN with the same method. This is related to the setting of Java Locale.
4.4 Database read and write encoding problem
Another place to appear eNCoding issues in JSP / Servlet program is the data in reading and writing.
The popular relational database system supports database eNCoding, which means that it can specify its own character set settings when creating a database, store data in the specified encoding. When the application accesses the data, there is an encoding conversion at the portions and exits. For Chinese data, the setting of database character encoding should ensure the integrity of the data. GB2312, GBK, UTF-8, etc. are all optional database eNCoding; or ISO8859-1 (8-bit), then the application is writing data The 16bit of one Chinese characters or Unicode must be split into two 8-bit characters. After reading the data, the two bytes will be merged, and the SBCS characters are also discussed. There is no use of database Encoding roles, but increased programming complexity, ISO8859-1 is not recommended database eNCoding. JSP / Servlet When programming, you can check if the management function provided by the database management system is correct. It should then be noted that the encoding of the read data, generally obtained in the Java program is Unicode. Conversely, when writing data.
4.5 Tips for when positioning problems
Positioning Chinese Encoding issues usually use the most stupid and most effective way - in the internal code of the string after you think there is suspected program. By printing the internal code of the string, you can find when the Chinese characters are converted into unicode, when Unicode is turned back to the Chinese internal code, when a Chinese word became two Unicode characters, when the Chinese string is turned into A string of question marks, when is the high of the Chinese string to be cut off ...
Use the appropriate sample string also help to distinguish the problem. Such as: "AA aa AA 丂 aa", etc., the characters between GB, GB, GBK feature characters. In general, English characters will not distortion, no matter how conversion or handle (if you encounter, you can try to increase continuous English letters).
5. Conclusion
In fact, JSP / servlet's Chinese Encoding is not as complex, although the positioning and solving problems are not regular, all kinds of operating environments are all inseparable, but the principles of the following are the same. The knowledge of the character set is the basis for solving the character problem. However, with the change of Chinese character sets, it is not only Java programming, and there will be some time in Chinese information processing.
6. Reference article
CHARACTER Problem Review Java Programming Technology Analysis and Solution GB18030 Setting Language Encoding in Web Applications: WebSphere Applications Server