Chinese character encoding problem in JSPservlet

xiaoxiao2021-03-06 46

Chinese character coding problem JSP / Servlet in ----- Author: eclipse Source: http: //www.chinaunix.net Time: 2002-08-18 10:34:30

1. The origin of the problem Each country (or region) specifies the character coding set for computer information exchange, such as the US expansion ASCII code, China's GB2312-80, Japan's JIS, etc., as the country / region information processing The foundation has an important role in unified coding. The character coding set is divided into two categories: SBCS (single-byte character set), DBCS (Double-byte character set). Early software (especially operating systems) However, due to the overlap of each local character set code, information exchange is difficult to exchange; the localized version of the software is high. It is therefore necessary to extract the commonality in the localization work, and consistently processed, and special localization processing content is lowered to minimize. This is the so-called internationalization (I18N). Various language information is further specified as Locale information. The underlying character set of processed has become a Unicode that contains all glyphs. Most of the software core character processing of most international features is based on Unicode. When the software is run, determine the corresponding local character encoding setting based on the local locale / codepage setting, and processes the local characters. The mutual conversion of Unicode and local character sets is required during processing, or even Unicode is the mutual conversion of two different local character sets in the middle. This approach extends in a network environment, and character information at both ends of any network also needs to be converted to an acceptable content according to the setting of the character set. The Java language inside the Java language uses Unicode to represent the character, comply with Unicode V2.0. The Java program will be transformed from the URL connection to the URL connection to the URL connection, or the URL connection is written, whether it is from / the file system. Although this increases the complexity of programming, it is easy to confuse, but it is in line with international ideas. In theory, these characters conversions according to character set settings should not have too many problems. The fact is due to the different actual operating environment of the application, the complementary, perfect, and the irregularity of the system or application, and the problem that the system or application implemented, and the problem occurred in the time of transcoding, the programmer and the user are plagued. 2. GB2312-80, GBK, GB18030-2000 Chinese character set and encoding actually solve the method of Chinese characters in the Java program is often simple, but understand the reasons behind it, positioning problems, but also need to understand the existing Chinese character encoding and Code conversion. GB2312-80 is developed in the initial phase of domestic computer Chinese character information technology, which contains most commonly used, secondary Chinese characters, symbols in 9 districts. This character set is a Chinese character set supported by almost all Chinese systems and international software, which is also the most basic Chinese character set. Its coding range is the high 0xa1-0xfe, the low is also 0xa1-0xfe; the Chinese character starts from 0xB0A1, ending at 0xf7fe; GBK is the extension of GB2312-80, is upward compatible. It contains 20902 Chinese characters whose coding range is 0x8140-0XFEFE, which eliminates the line of the high 0x80. All characters can be missed one-to-one to Unicode 2.0, that is, Java actually provides support for GBK character sets.

This is the default character set of Windows and other Chinese operating systems, but not all international software supports this character set. It feels that they don't fully know how GBK is going. It is worth noting that it is not a national standard, but it is just a standard. With the release of GB18030-2000 national standard, it will complete its historical mission in the near future. GB18030-2000 (GBK2K) further expanded Chinese characters on the basis of GBK, increasing the shape of a small number of ethnic minorities. GBK2K has fundamentally solved the problem of insufficient word and lack of glyphs. It has several characteristics, which does not determine all glyphs, just specify the coding range, and will be expanded later. The coding is growing, and its binabed part is compatible with GBK; the four-byte part is the expanded glyph, the word bit, its code is the first byte 0x81-0XFE, the two bytes 0x30-0X39, three bytes 0x81- 0xfe, four bytes 0x30-0x39. Its promotion is a phased, first requiring implementation, all glyphs that can be fully mapped to the Unicode 3.0 standard. It is a national standard and is mandatory. There is no operating system or software to implement GBK2K, which is the work content of the current and future Chinese. Introduction to Unicode ... Just free. Java supported Encoding related to Chinese programming: (there are several not listed in the JDK document) ASCII 7-Bit, with ASCII7 ISO8859-1 8-Bit, with 8859_1, ISO-8859-1, ISO_8859-1 Latin1 ... GB2312-80 with GB2312, GB2312-1980, EUC_CN, EUCCN, 1381, CP1381, 1383, CP1383, ISO2022CN, ISO2022CN_GB ... GBK (payment case), with MS936 UTF8 UTF-8 GB18030 (Now there is IBM JDK1.3.? Have supported), using Unicode processing characters in the same CP1392, 1392 Java language. But from another perspective, non-Unicode transcoding can also be used in Java programs. It is important to ensure the program. Chinese character information in the entrance and export is not true. If I use ISO-8859-1 to process Chinese characters, the correct results can be achieved. Many solutions to the network are all in this type. In order not to be confused, this paper does not discuss this approach. 3. Chinese transcathes to ''? ', The origin of garbled is possible to get the result of the error: Unicode-> Byte, if the target code set does not exist, the result is 0x3f. Such as: "/u00d6/u00ec/u00E9/u0046/u00bb/u00f9".getbytes ("GBK" results are "? ìéf?", HEX value is 3FA8ACA8A6463FA8B4. Take a closer look, you will find that / u00EC is converted to 0xA8AC, / U00E9 is converted to / xa8a6 ... Its actual valid bit is growing! This is because some symbols in the GB2312 symbol area are mapped to some public symbol encodings, because these symbols appear in ISO-8859-1 Or some other SBCS characters are concentrated, so they encode in Unicode, there are some effective bits only 8 bits, and the encoding overlapping of Chinese characters (in fact this mapping is just the encoded mapping, it is not the same when displayed. UNICODE The symbols are single bytes, the symbols in the Chinese characters are double-byte width). There are 20 symbols between Unicode / U00A0 - / U00FF.

Understanding this feature is very important! It is not difficult to understand why Java programming, some garbled characters often appear (actually symbolic characters) in the error result of Chinese character encoding, not the '' 'character, such as the above example. Byte -> Unicode, if the Byte identifier is not existent in the source code set, the result is 0xFFFD. For example, byte ba [] = {(byte) 0x81, (byte) 0x40, (byte) 0xB0, (Byte) 0xA1}; New String (BA, "GB2312"; "?", HEX value is "/ uffd / u554a". 0x8140 is a GBK character, pressing the GB2312 conversion table without a corresponding value, take / ufffd. Note: When the Unicode is displayed, because there is no corresponding local character, the previous situation is also applicable, displayed as a "?".) In actual programming, the JSP / Servlet program gets the wrong Chinese character information, which is often these two. The overlay of the process is sometimes even more than two processes. 4. JSP / servlet Chinese character encoding problems and solutions in WAS 4.1 Frequent Encoding problems in the webmatient JSP / servlet encoding problem Express in the browser or application end, such as the Chinese characters in the JSP / Servlet page seen in the browser become '?'? How do the Chinese characters in the servlet page in the browser become garbled? Java app How does the Chinese characters in the program interface become a square? JSP / servlet page cannot display GBK Chinese characters. In the JSP page, it is embedded in the java code contained in the java code contained in TAG. Chinese is garbled, but other Chinese characters of the page are right. JSP / servlets cannot receive the Chinese characters submitted by Form. JSP / servlet database read-write cannot get the correct content. Hidden behind these issues is a variety of errors character conversion and Processing (divided by the 3rd, it is caused by the Java Font setting error). Solution Similar character eNCoding issues, you need to understand the JSP / Servlet running process, check the points that may have problems. 4.2 JSP / servlet Web programming Encoding Question JSP / Servlets running on the Java Application Server provides HTML content for Browser, where character encoded conversion has JSP compile. Java application server reads the JSP source file according to JVM file, compiles generation Java source files , According to file.encoding The value is written back to the file system. If the current system language supports GBK, then Encoding issues will not appear at this time. If it is an English system, such as Linux, AIX or Solaris, which is en_us, then place the JVM's file.Encoding value into GBK. System Language If it is GB2312, then determine if you want to set file.encoding, set File.Encoding to GBK to solve the potential GBK character garble problem Java needs to be compiled into .class to execute in JVM, this process exists A. The same file.encoding problem. Since the start of servlet and JSP from here, it is similar to the Servlet's compilation is not automatic.

For JSP programs, compilation of the generated Java intermediate file is automatically performed (call Sun.Tools.javac.MAIN directly). So if there is a problem in this step, you also check Encoding and OS language environment. Or turn the static Chinese character embedded in JSP Java Code to Unicode, or static text output is not in Java Code. Hand-specified -encoding parameters for servlets, Javac compiles. Servlet needs to convert HTML pages to Browser acceptable encoding content. Depending on the implementation of each Java App Server, some will query the Browser Accept-Charset and Accept-Language parameters or to determine the Encoding value in other guessing methods, and there is no matter whether it is. Therefore, it is best to use a fixed encoding perhaps the best solution. For Chinese web pages, contentType = "text / html; charset = GB2312" can be set in JSP or Servlet; if there is a GBK character in the page, set to contentType = "text / html; charSet = GBK", due to IE and Netscape pairs GBK's support is different, you need to test it when making this setting. Because the 16-bit Java Char is discarded at the time of network transmission, it is also desirable to ensure that the Chinese characters in the servlet page (including embedded and servlet runs) are expected internal codes, you can use PrintWriter Out = RES. . the getWriter () substituted ServletOutputStream out = res.getOutputStream () PrinterWriter as will be specified in accordance contentType charset conversion (before the contentType specified in need!); OutputStreamWriter package can also be used with type ServletOutputStream and write (string) output kanji character string. For JSP, Java Application Server should ensure that the embedded Chinese characters will be transmitted correctly at this stage. This is explaining the URL character eNCoding problem. If you contain Chinese character information from the parameter value returned from Browser through the GET / POST method, the servlet will not get the correct value. In the J2SDK of Sun, httputils.Parsename does not consider the language settings of Browser at the time of parsing the parameters, but will be parsed by byte. This is an Encoding issue discussed online. Because this is a design defect, it can only resolve the resulting string in bin; or resolved in the Hack Httputils class. Reference article 2 has a presentation, but it is best to change the Chinese Encoding GB2312, CP1381 to GBK, otherwise there will be a problem when you encounter GBK Chinese characters. Servlet API 2.3 provides a new function httpserveletRequest.setCharacterenceEncoding to specify eNCoding you want before calling Request.GetParameter ("param_name"), which will help completely solve this problem. In addition, the "servlet API 2.3 provides a new function httpserveletRequest.setcharacterencoding" said, I tried, very easy to use, Tomcat4.0.1.

The method is to configure a filter, filtering the request, the filter is as follows: [code] import java.io.ioException; import javax.servlet.filter; import javax.servlet.filterchain; import javax.servlet.filterConfig; import javax. servlet.ServletException; import javax.servlet.ServletRequest; import javax.servlet.ServletResponse; import javax.servlet.UnavailableException; / ***

Title: Chinese problems *

Description: Chinese problem *

Company: * @Author Writeonce * @version 1.0 * / public class encodingfilter implements filter {protected s protected FilterConfig filterConfig = null; public void destroy () {this.encoding = null; this.filterConfig = null;} public void doFilter (ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {// Select and set (if NEEDED) The character encoding to be used string encoding = selectencoding (request); if (eNCoding! = null) {Request.setCharacterencoding (Encoding);} // Pass Control On to the next filter chain.doFilter (request, response);} public void init (FilterConfig filterConfig) throws ServletException {this.filterConfig = filterConfig; this.encoding = filterConfig.getInitParameter ( "encoding");} protected String selectEncoding (ServletRequest request {Return (this.encoding);}} [/ code] At the same time, add the following configuration in Web.xml: [CODE] set character encoding EncodingFilter Encoding GBK <@ p p> <

转载请注明原文地址:https://www.9cbs.com/read-81897.html

9cbs

New Post(0)