There are many excellent articles and discussions on the DBCS character encoding problem in JSP / servlets, which make some instructions for them, and combined with IBM WebSphere Application Server 3.5 (WAS) solutions, I hope it is not redundant.
1. The origin of the problem
Each country (or region) specifies the character coding set for computer information exchange, such as ASCII, China's GB2312-80, Japan's JIS, etc., as the foundation of information processing in the country, has unified encoded Orthodox role. The character coding set is divided into two categories: SBCS (single-byte character set), DBCS (Double-byte character set). Early software (especially operating systems) However, due to the overlap of each local character set code, information exchange is difficult to exchange; the localized version of the software is high. It is therefore necessary to extract the commonality in the localization work, and consistently processed, and special localization processing content is lowered to minimize. This is the so-called internationalization (I18N). Various language information is further specified as Locale information. The underlying character set of processed has become a Unicode that contains all glyphs. Most of the software core character processing of most international features is based on Unicode. When the software is run, determine the corresponding local character encoding setting based on the local locale / codepage setting, and processes the local characters. The mutual conversion of Unicode and local character sets is required during processing, or even Unicode is the mutual conversion of two different local character sets in the middle. This approach extends in a network environment, and character information at both ends of any network also needs to be converted to an acceptable content according to the setting of the character set. The Java language inside the Java language uses Unicode to represent the character, comply with Unicode V2.0. The Java program will be transformed from the URL connection to the URL connection to the URL connection, or the URL connection is written, whether it is from / the file system. Although this increases the complexity of programming, it is easy to confuse, but it is in line with international ideas. In theory, these characters conversions according to character set settings should not have too many problems. The fact is due to the different actual operating environment of the application, the complementary, perfect, and the irregularity of the system or application, and the problem that the system or application implemented, and the problem occurred in the time of transcoding, the programmer and the user are plagued.
2.GB2312-80, GBK, GB18030-2000 Chinese characters
In fact, the method of solving Chinese character encoding problems in the Java program is often simple, but understands the reasons behind it, positioning problems, but also understands existing Chinese character encoding and coding conversion. GB2312-80 is developed in the initial phase of domestic computer Chinese character information technology, which contains most commonly used, secondary Chinese characters, symbols in 9 districts. This character set is a Chinese character set supported by almost all Chinese systems and international software, which is also the most basic Chinese character set. Its coding range is the high 0xa1-0xfe, the low is also 0xa1-0xfe; the Chinese character starts from 0xB0A1, ending at 0xf7fe; GBK is the extension of GB2312-80, is upward compatible. It contains 20902 Chinese characters whose coding range is 0x8140-0XFEFE, which eliminates the line of the high 0x80. All characters can be missed one-to-one to Unicode 2.0, that is, Java actually provides support for GBK character sets. This is the default character set of Windows and other Chinese operating systems, but not all international software supports this character set. It feels that they don't fully know how GBK is going. It is worth noting that it is not a national standard, but it is just a standard. With the release of GB18030-2000 national standard, it will complete its historical mission in the near future. GB18030-2000 (GBK2K) further expanded Chinese characters on the basis of GBK, increasing the shape of a small number of ethnic minorities. GBK2K has fundamentally solved the problem of insufficient word and lack of glyphs. It has several features: ● It does not determine all glyphs, just specify the coding range, and expand later. ● Coding is a long, the binhanity is compatible with GBK; the four-byte part is the expanded glyph, the zip range is the first byte 0x81-0XFE, the two bytes 0x30-0X39, three bytes 0x81 -0xfe, four bytes 0x30-0x39. ● Its promotion is a phased, first requiring implementation that all glyphs that can be fully mapped to the Unicode 3.0 standard. ● It is a national standard and is mandatory. There is no operating system or software to implement GBK2K, which is the work content of the current and future Chinese. 3.Jsp / servlet Chinese character encoding problem and solution in WAS
3.1 Phenomenon of common Encoding issues Online common JSP / servlet Encoding issues generally behave in browser or application ends, such as: ● How can the Chinese characters in the JSP / Servlet page in the browser become '?'? ● How can the Chinese characters in the servlet page seen in the browser become garbled? ● How does the Chinese characters in the Java application interface become square? ● The JSP / Servlet page cannot display GBK Chinese characters. ● JSP / servlets cannot receive Chinese characters submitted by Form. ● JSP / servlet database read-write cannot obtain the correct content. It is hidden behind these issues that are caused by various errors, except that the third is caused by the Java Font setting error). Solve similar characters eNCoding issues, you need to know the running procedure of the JSP / Servlet, check the points that may have problems. 3.2 JSP / Servlet Web programming Encoding Questions The JSP / Servlet runs on the Java application server provides HTML content for Browser, which is shown below:
Among them, there is a character encoded and transformed: A.JSP compilation. The Java Application Server reads the JSP source file according to the JVM file.Encoding value and converts to internal character encoding for JSP compilation, generates a Java source file, and writes back to the file system according to the file.encoding value. If the current system language supports GBK, then Encoding issues will not appear at this time. If it is an English system, such as Linux, AIX or Solaris, which is en_us, then place the JVM's file.Encoding value into GBK. System language If it is GB2312, then as needed, determine whether file.encoding should be set to GBK to solve potential GBK characters garbled problems with File.Encoding. B.java needs to be compiled into .class to execute in the JVM, this process exists with a. The same file.encoding problem. Since the start of servlet and JSP from here, it is similar to the Servlet's compilation is not automatic. C. Servlet needs to convert the HTML page content to Browser acceptable eNCoding content. Depending on the implementation of each Java App Server, some will query the Browser Accept-Charset and Accept-Language parameters or to determine the Encoding value in other guessing methods, and there is no matter whether it is. So constant-encoding may be the best solution. For Chinese web pages, contentType = "text / html; charset = GB2312" can be set in JSP or Servlet; if there is a GBK character in the page, set to contentType = "text / html; charSet = GBK", due to IE and Netscape pairs GBK's support is different, you need to test it when making this setting. Because the 16-bit Java Char is discarded at the time of network transmission, it is also desirable to ensure that the Chinese characters in the servlet page (including embedded and servlet runs) are expected internal codes, you can use PrintWriter Out = RES. the getWriter () substituted ServletOutputStream out = res.getOutputStream (), PrinterWriter as will be specified in accordance contentType charset conversion (before the contentType specified in need!); OutputStreamWriter package can also be used with type ServletOutputStream and write (string) output kanji character string. For JSP, Java Application Server should ensure that the embedded Chinese characters will be transmitted correctly at this stage. d. This is the URL character eNCoding problem. If Chinese character information is included in the value returned from Browser through the GET / POST method, the servlet will not be able to get the correct value. In the J2SDK of Sun, httputils.Parsename does not consider the language settings of Browser at the time of parsing the parameters, but will be parsed by byte. This is an Encoding issue discussed online. Because this is a design defect, it can only resolve the resulting string in bin; or resolved in the Hack Httputils class. References 2, 3 have introduced, but it is best to change the Chinese Encoding GB2312, CP1381 to GBK, otherwise there will be a problem when you encounter GBK Chinese characters.
Servlet API 2.3 provides a new function httpserveletRequest.setCharacterenceEncoding to specify eNCoding you want before calling Request.GetParameter ("param_name"), which will help completely solve this problem. WebSphere Application Server extends to standard Servlet API 2.x, providing better multilingual support. The above C, D case, the WAS must query the language setting of Browser, which is mapped to Java Encoding CP1381 under the default condition (note: CP1381 is only equivalent to a codePage of GB2312, without GBK support). Do I think because I can't confirm that the operating system of Browser is supporting GB2312, or GBK, so it is small. But the actual application system still requires GBK Chinese characters in the page. The most famous is "?" (Rong2, 0xe946, / u9555) in Zhu Premier, So sometimes you need to specify Encoding / Charset as GBK. Of course, the default Encoding is not as troublesome, for A, B, Reference Article 5), specify -dfile.encoding = GBK in the command line parameter of Application Server; for D, in Application Server Specify -DDefault.client.Encoding = GBK in command line parameters. If you specify -ddefault.client.encoding = GBK, then CHARSET can no longer specify in C. 3.3 Database read or write Encoding Question JSP / Servlet Programming Another place in which an eNCoding problem occurs often is data in the read and write database. The popular relational database system supports database eNCoding, which means that it can specify its own character set settings when creating a database, store data in the specified encoding. When the application accesses the data, there is an encoding conversion at the portions and exits. For Chinese data, the integrity of the data should be guaranteed. GB2312, GBK, UTF-8, etc. are all optional database eNCoding; if ISO8859-1 (8-bit SBCS) is selected, the application must remove a 16bit or Unicode to two 8 before writing data. Bit characters, after reading data, you need to merge two bytes while also discriminating the SBCS characters. There is no use of database Encoding roles, but increased programming complexity, ISO8859-1 is not recommended database eNCoding. JSP / Servlet When programming, you can check if the Chinese data is correct in the function of the database management system. It should then be noted that the encoding of the read data, generally obtained in the Java program is Unicode. Conversely, when writing data. 3.4 Tips for positioning problems To locate Chinese Encoding issues Usually use the most stupid and most effective ways - in the internal code of the string after you think there is suspected program. By printing the internal code of the string, you can find when the Chinese characters are converted into unicode, when Unicode is turned back to the Chinese internal code, when a Chinese word became two Unicode characters, when the Chinese string is turned into A string of question marks, when the high of the Chinese string is cut off ... Use the appropriate sample string also help to distinguish the problem. Such as: "aa aa? Aa", etc., GB, GBK feature characters have strings.
In general, English characters will not distortion, no matter how conversion or handle (if you encounter, you can try to increase continuous English letters). 4 Conclusion
In fact, JSP / servlet's Chinese Encoding is not as complex, although the positioning and solving problems are not regular, all kinds of operating environments are all inseparable, but the principles of the following are the same. The knowledge of the character set is the basis for solving the character problem. However, with the change of Chinese character sets, it is not only Java programming, and there will be some time in Chinese information processing.
5. Reference article
1)
Character Problem REVIEW
2)
Analysis and Solution of Chinese Characters in Java Programming Technology
3)
NLS Characters in WebSphere: SBCS / DBCS Display on Same Page
4)
GB18030
5)
Setting Language Encoding in Web Applications: WebSphere Applications Server
About the Author
Zhang Jianfang, software engineer, graduated from Beijing Institute of Technology Computer Appliances, has many years of Chinese localization experience. You can contact him through jfzhang@usa.net.