. Problem The origin of the problem each country (or region) specifies the character coding set for computer information exchange, such as the US expansion ASCII code, China's GB2312-80, Japan's JIS, etc., as the information processing in the country / region Basis, there is an important role in unified coding. The character coding set is divided into two categories: SBCS (single-byte character set), DBCS (Double-byte character set). Early software (especially operating system), in order to solve the computer processing of local character information, there have been various localized version (L10N), in order to distinguish, introduce LANG, CODEPAGE and other concepts. However, due to the overlap of each local character set code, information exchange is difficult to exchange; the localized version of the software is high. It is therefore necessary to extract the commonality in the localization work, and consistently processed, and special localization processing content is lowered to minimize. This is the so-called internationalization (I18N). Various language information is further specified as Locale information. The underlying character set of processed has become a Unicode that contains all glyphs. Most of the software core character processing of most international features is based on Unicode. When the software is run, determine the corresponding local character encoding setting based on the local locale / codepage setting, and processes the local characters. The mutual conversion of Unicode and local character sets is required during processing, or even Unicode is the mutual conversion of two different local character sets in the middle. This approach extends in a network environment, and character information at both ends of any network also needs to be converted to an acceptable content according to the setting of the character set. The Java language inside the Java language uses Unicode to represent the character, comply with Unicode V2.0. The Java program will be transformed from the URL connection to the URL connection to the URL connection, or the URL connection is written, whether it is from / the file system. Although this increases the complexity of programming, it is easy to confuse, but it is in line with international ideas. In theory, these characters conversions according to character set settings should not have too many problems. The fact is due to the different actual operating environment of the application, the complementary, perfect, and the irregularity of the system or application, and the problem that the system or application implemented, and the problem occurred in the time of transcoding, the programmer and the user are plagued. 2. GB2312-80, GBK, GB18030-2000 Chinese character set and encoding actually solve the method of Chinese characters in the Java program is often simple, but understand the reasons behind it, positioning problems, but also need to understand the existing Chinese character encoding and Code conversion. GB2312-80 is developed in the initial phase of domestic computer Chinese character information technology, which contains most commonly used, secondary Chinese characters, symbols in 9 districts. This character set is a Chinese character set supported by almost all Chinese systems and international software, which is also the most basic Chinese character set. Its coding range is the high 0xa1-0xfe, the low is also 0xa1-0xfe; the Chinese character starts from 0xB0A1, ending at 0xf7fe; GBK is the extension of GB2312-80, is upward compatible. It contains 20902 Chinese characters whose coding range is 0x8140-0XFEFE, which eliminates the line of the high 0x80. All characters can be missed one-to-one to Unicode 2.0, that is, Java actually provides support for GBK character sets. This is the default character set of Windows and other Chinese operating systems, but not all international software supports this character set. It feels that they don't fully know how GBK is going.
It is worth noting that it is not a national standard, but it is just a standard. With the release of GB18030-2000 national standard, it will complete its historical mission in the near future. GB18030-2000 (GBK2K) further expanded Chinese characters on the basis of GBK, increasing the shape of a small number of ethnic minorities. GBK2K has fundamentally solved the problem of insufficient word and lack of glyphs. It has several characteristics, which does not determine all glyphs, just specify the coding range, and will be expanded later. The coding is growing, and its binabed part is compatible with GBK; the four-byte part is the expanded glyph, the word bit, its code is the first byte 0x81-0XFE, the two bytes 0x30-0X39, three bytes 0x81- 0xfe, four bytes 0x30-0x39. Its promotion is a phased, first requiring implementation, all glyphs that can be fully mapped to the Unicode 3.0 standard. It is a national standard and is mandatory. There is no operating system or software to implement GBK2K, which is the work content of the current and future Chinese. Introduction to Unicode ... Just free. Java supported Encoding related to Chinese programming: (there are several not listed in the JDK document) ASCII 7-Bit, with ASCII7 ISO8859-1 8-Bit, with 8859_1, ISO-8859-1, ISO_8859-1 Latin1 ... GB2312-80 with GB2312, GB2312-1980, EUC_CN, EUCCN, 1381, CP1381, 1383, CP1383, ISO2022CN, ISO2022CN_GB ... GBK (payment case), with MS936 UTF8 UTF-8 GB18030 (Now there is IBM JDK1.3.? Have supported), using Unicode processing characters in the same CP1392, 1392 Java language. But from another perspective, non-Unicode transcoding can also be used in Java programs. It is important to ensure the program. Chinese character information in the entrance and export is not true. If I use ISO-8859-1 to process Chinese characters, the correct results can be achieved. Many solutions to the network are all in this type. In order not to be confused, this paper does not discuss this approach. 3. Chinese transcathes to ''? ', The origin of garbled is possible to get the result of the error: Unicode-> Byte, if the target code set does not exist, the result is 0x3f. Such as: "/u00d6/u00ec/u00e9/u0046/u00bb/u00f9".getbytes ("GBK") result is "? ìéf?", HEX value is 3FA8ACA8A6463FA8B4. Take a closer look at the results above, you will find / u00EC being converted For 0xA8ac, / u00E9 is converted to / xa8a6 ... It's actual effective bit! This is because some symbols in the GB2312 symbol area are mapped to some public symbol encodings, because these symbols appear in ISO-8859-1 or some other SBCS character set, they are proderated in Unicode, there are some valid Bit is only 8 bits, and the encoding overlap of Chinese characters (in fact this mapping is only the code mapping, it is not the same. The symbol in UNICODE is single byte wide, the symbols in the Chinese characters are double bytes.). Such symbols between Unicode / U00A0 - / U00FF have 20. Understanding this feature is very important! It is not difficult to understand why Java programming, some garbled characters often appear (actually symbolic characters) in the error result of Chinese character encoding, not the '' 'character, such as the above example.
Byte -> Unicode, if the Byte identifier is not existent in the source code set, the result is 0xFFFD. For example, byte ba [] = {(byte) 0x81, (byte) 0x40, (byte) 0xB0, (Byte) ) 0xA1}; new string (BA, "GB2312"); "? Ah", HEX value is "/ uffd / u554a". 0x8140 is a GBK character, pressing the GB2312 conversion table without a corresponding value, taking / ufffd. Note: When there is no corresponding local character, it is also possible to apply "?".) Actual programming, JSP / servlet program gets the wrong Chinese character information, which is often these two The superposition of the process is sometimes even the result of repeated effect after two processes. 4. JSP / servlet Chinese character encoding problems and solutions in WAS 4.1 Frequent Encoding problems in the webmatic jsp / servlet encoding problem They all behave in browser or application end, such as how the Chinese characters in the JSP / Servlet page seen in the browser become '?'? How do the Chinese characters in the servlet page seen in the browser become garbled? How does the Chinese characters in the Java application interface become square? The JSP / Servlet page cannot display GBK Chinese characters. In the JSP page, it is garbled in the java code contained in the java code, but other Chinese characters of the page are correct. JSP / servlets cannot receive Chinese characters submitted by Form. JSP / servlet database read-write cannot obtain the correct content. It is hidden behind these issues that are caused by various errors, except that the third is caused by the Java Font setting error). Solve similar characters eNCoding issues, you need to know the running procedure of the JSP / Servlet, check the points that may have problems. 4.2 jsp / servlet Web programming Encoding Question JSP / servlet runs in the Java Application Server to provide HTML content for Browser, which is shown below: Where character encoded conversion is: JSP compilation. The Java Application Server reads the JSP source file based on the JVM file.Encoding value, compiles the Generation Java source file, and then writes back to the file system according to the file.encoding value. If the current system language supports GBK, then Encoding issues will not appear at this time. If it is an English system, such as Linux, AIX or Solaris, which is en_us, then place the JVM's file.Encoding value into GBK. System Language If it is GB2312, then determine if you want to set file.encoding, set File.Encoding to GBK to solve the potential GBK character garble problem Java needs to be compiled into .class to execute in JVM, this process exists A. The same file.encoding problem. Since the start of servlet and JSP from here, it is similar to the Servlet's compilation is not automatic. For JSP programs, compilation of the generated Java intermediate file is automatically performed (call Sun.Tools.javac.MAIN directly). So if there is a problem in this step, you also check Encoding and OS language environment. Or turn the static Chinese character embedded in JSP Java Code to Unicode, or static text output is not in Java Code.
Hand-specified -encoding parameters for servlets, Javac compiles. Servlet needs to convert HTML pages to Browser acceptable encoding content. Depending on the implementation of each Java App Server, some will query the Browser Accept-Charset and Accept-Language parameters or to determine the Encoding value in other guessing methods, and there is no matter whether it is. Therefore, it is best to use a fixed encoding perhaps the best solution. For Chinese web pages, contentType = "text / html; charset = GB2312" can be set in JSP or Servlet; if there is a GBK character in the page, set to contentType = "text / html; charSet = GBK", due to IE and Netscape pairs GBK's support is different, you need to test it when making this setting. Because the 16-bit Java Char is discarded at the time of network transmission, it is also desirable to ensure that the Chinese characters in the servlet page (including embedded and servlet runs) are expected internal codes, you can use PrintWriter Out = RES. . the getWriter () substituted ServletOutputStream out = res.getOutputStream () PrinterWriter as will be specified in accordance contentType charset conversion (before the contentType specified in need!); OutputStreamWriter package can also be used with type ServletOutputStream and write (string) output kanji character string. For JSP, Java Application Server should ensure that the embedded Chinese characters will be transmitted correctly at this stage. This is explaining the URL character eNCoding problem. If you contain Chinese character information from the parameter value returned from Browser through the GET / POST method, the servlet will not get the correct value. In the J2SDK of Sun, httputils.Parsename does not consider the language settings of Browser at the time of parsing the parameters, but will be parsed by byte. This is an Encoding issue discussed online. Because this is a design defect, it can only resolve the resulting string in bin; or resolved in the Hack Httputils class. Reference article 2 has a presentation, but it is best to change the Chinese Encoding GB2312, CP1381 to GBK, otherwise there will be a problem when you encounter GBK Chinese characters. Servlet API 2.3 provides a new function httpserveletRequest.setCharacterenceEncoding to specify eNCoding you want before calling Request.GetParameter ("param_name"), which will help completely solve this problem. 4.3 Solutions in IBM WebSphere Application Server WebSphere Application Server extends to standard Servlet API 2.x, providing better multilingual support. In the operating system running in Chinese, you can process the Chinese characters well without any settings. The following instructions are only for WAS to run in English, or if you need GBK support.
The above C, D condition, the WAS must query the language setting of Browser, under the default condition, ZH, ENCODING CP1381 (Note: CP1381 is only equivalent to a codePage of GB2312, no GBK support) . Do I think because I can't confirm that the operating system of Browser is supporting GB2312, or GBK, so it is small. But actual application systems require GBK Chinese characters in the page, the most famous is "镕" (rong2, 0xe946, / u9555) in Zhu Prime Minister name, so sometimes you still need to specify Encoding / Charset as GBK. Of course, the default Encoding does not have so troublesome, for A, B, reference article 5, specify -dfile.encoding = GBK in the command line parameter of Application Server; for D, in Application Server command Specify -DDefault.client.Encoding = GBK in line parameters. If you specify -ddefault.client.encoding = GBK, then CHARSET can no longer specify in C. There is also a problem with the QAG <% ...%>, <% = ...%> in the question listed above, and the solution in WAS is In addition to setting the correct file.Encoding, you also need to set -duser.language = zh -duser.region = CN with the same method. This is related to the setting of Java Locale. 4.4 Database Read / write Encoding Question JSP / Servlet Programming Another place where eNCoding issues frequently appear in the read and write data in the database. The popular relational database system supports database eNCoding, which means that it can specify its own character set settings when creating a database, store data in the specified encoding. When the application accesses the data, there is an encoding conversion at the portions and exits. For Chinese data, the setting of database character encoding should ensure the integrity of the data. GB2312, GBK, UTF-8, etc. are all optional database eNCoding; or ISO8859-1 (8-bit), then the application is writing data The 16bit of one Chinese characters or Unicode must be split into two 8-bit characters. After reading the data, the two bytes will be merged, and the SBCS characters are also discussed. There is no use of database Encoding roles, but increased programming complexity, ISO8859-1 is not recommended database eNCoding. JSP / Servlet When programming, you can check if the management function provided by the database management system is correct. It should then be noted that the encoding of the read data, generally obtained in the Java program is Unicode. Conversely, when writing data. 4.5 Tips for positioning problems Positioning Chinese Encoding issues Usually use the most stupid and most effective ways - in the internal code of the string after you think there is suspected program. By printing the internal code of the string, you can find when the Chinese characters are converted into unicode, when Unicode is turned back to the Chinese internal code, when a Chinese word became two Unicode characters, when the Chinese string is turned into A string of question marks, when the high of the Chinese string is cut off ... Use the appropriate sample string also help to distinguish the problem. Such as: "AA aa AA 丂 aa", etc., the characters between GB, GB, GBK feature characters.