java problem in Chinese Detailed posted Author: yuking Preliminaries: 1. Byte and unicode Java core is unicode, and even the class file is, but many media, including ways to save file / stream is to use the byte stream. Therefore, Java is to transform these byte streams. Char is Unicode, and Byte is byte. The function of Byte / Char in Java is in the middle of Sun.io's package. The BytetocharConverter class is scheduled, which can be used to tell you, you use the Convertor. Two of these very commonly used static functions are public static bytetocharconvert (); public static bytetocharconverter getConverter (String eNCoding); if you do not specify Converter, the system will automatically use the current Encoding, GB platform with GBK, EN platform 8859_1 Let's come to a simple example: "You /" GB code is: 0xc4e3, Unicode is 0x4f60 you use:
CODE:
Encoding = "GB2312"; BYTE B [] = {(Byte) "U00C4", (Byte) "U00E3"}; convertor = bytetocharconverter.getConverter (Encoding); char [] c = convers.convertall (b); for INT i = 0; i Printing is 0x4f60 but if you use 8859_1 encoding, print it out is 0x00c4, 0x00E3 case 1 in turn: CODE: Encoding = "GB2312"; char C [] = {"u4f60"}; convertor = bytetocharconverter.getConverter (Encoding); Byte [] b = convertER.convertall (c); for (int i = 0; i Print comes: 0xC4, 0XE3 Example 2 If you use 8859_1 0x3f, No., indicating that many Chinese issues that cannot be converted are derived from these two simplest classes. However, many classes don't directly support Encoding entries, which brings us more inconvenience. Many procedures are rare to use Encoding, directly with Default's Encoding, which gives us a lot of difficulties. 2. Autf-8 UTF-8 is corresponding to Unicode, which is very simple 7-bit unicode: 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 1 0 _ _ _ _ _ 21 bits: 1 1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ most case only The UNICODE: "You /" of the 16-bit bit: 0xc4e3, unicode is 0x4f60 we still use the above example 1: 0XC4E3 binary: 1 1 0 0 1 0 0 1 1 0 0 0 1 1 Only two of us come according to the two codes, but we found this way, because the seventh bit is not 0, therefore, return "?" Example 2: 0x4f60 binary: 0 1 0 0 1 1 1 1 0 1 1 0 0 0 0 0 We use UTF-8 to make up: 11100000 10111101 10100000 E4 - BD - A0 then returns 0xE4, 0XBD, 0xA0 3.String and Byte [] String actually core is char [], however BYTE is converted to string, and must be encoded. String.Length () is actually the length of the char array. If you use different codes, you can also be scattered, resulting in scattering and garbled. Example: Code: Byte [] b = {(byte) "u00c4", (byte) "u00E3"}; string str = new string (b, eNCoding); If eNCoding = 8859_1, there will be two words, but Encoding = GB2312 is only one word. This problem is in processing paging 4.Reader, Writer / InputStream, OutputStream Reader and Writer core are char, inputstream, and outputstream core is BYTE. But the main purpose of Reader and Writer is to read the CHAR read / write InputStream / OutputStream: File Test.txt has only one / "you /" word, 0xC4, 0XE3 - CODE: String Encoding =; InputStreamReader Reader = New FileInputStream ("text.txt"), encoding; char [] c = new char [10]; int layth = reader.read (c); for (INT i = 0 i CODE: Public void test () {string str = "you /"; FileWriter Write = New FileWriter ("test.txt"); Write.Write (STR); Write.Close ();} Example 3 If you compile with GB2312, you will find the field of E4 BD A0 if you compile using 8859_1, binary of 00c4 00E3: 00000000 11000100 00000000 11100011 - Because each character is greater than 7 digits, there is 11000001 10000100 11000011 10100011 C1 - 84 - C3 - A3 You will find C1 84 C3 A3 - but we tend to ignore this parameter, so this often has a cross-platform problem: - Example 3 compiled on the Chinese platform, generate zhclass - - Example 3 Compiled in English platform, ENCLASS --1. ENCLASS executed OK on the Chinese platform, but not on the English platform - 2. Enclass executes OK on the English platform, but not on the Chinese platform: - -1. After compiling on the Chinese platform, the STR runs on the running state of CHAR [] is running on the Chinese platform, and the default code of FileWriter is GB2312, so --ChartobyteConverter automatically uses the call. GB2312 CONVERTER conveys the STR-generated input into the fileoutputstream, so 0xC4, 0XE3 is put into the file. - But if it is in the English platform, the default value of Chartobyteconvert is 8859_1, --FileWriter automatically calls 8859_1 to transform STR, but he can't explain, so he will - output / "?" ---- - 2. After compiling on the English platform, the STR is running the CHAR [] is 0x00c4 0x00e3, and the Chinese cannot identify on the Chinese platform, so it will appear ?? - on the English platform, 0x00c4 -> 0xc4, 0x00e3-> 0xe3, therefore 0xc4, 0xe3 is put into - file ---- 1. For the explanation of JSP text: - Tomcat first look at your leaf, "<% @Page Include symbol. Yes, then set response.setContentType (..) in the same - place; follow Encoding to read, do not follow the 8859_1 - read file, then write .java files with UTF-8, Then use Sun.Tools.main to read this file, - (Of course it uses UTF-8 to read), then compile into a class file - SetContentType changes the properties of the OUT, the OUT variable is default Encoding is 8859_1 2 For Parameter, unfortunately, Parameter is only interpretation of ISO8859_1, this material can be found in the server's implementation code. 3. Explanation in include, but very unfortunate, because of that write "org.apache.jasper.compiler . Parser "people in array jsputil.validattribute [] Forgot to add a parameter: Encoding, thus causing not to hold this approach. You can compile the source code, plus support for Encoding: If you are under NT, the easiest way is to deceive Java, do not add any encoding variables: Hello <% = Request.getParameter ("Value" )%> Html> http://localhost/test/test.jsp? Value = You Result: Hello, you but this method is limited, such as segmentation of the articles uploaded, this is dead The best solution is to use this solution: Code: <% @ Page ContentType = "Text / HTML; Charset = GB2312"%> Hello <% = new string (Request.GetParameter ("Value"). Gettes ("8859_1"), "GB2312")%> html> I must read it, but the solution does not dare to compliment ---------------------------------------- --------------------------------------- 1. The web page pass parameter does not advocate the GET method And the user can adjust whether it is transmitted with UTF-8 2. It is best not to use it in JSP. In fact, it does not add this sentence to realize the normal display of Chinese. I don't add it, at least don't write this code. The following configuration I think it can make Chinese normal display: a. All javabean compiles the B.JSP file in the B.JSP file (written by it) Note the above 2 points in Tomcat - -, etc., for other JSP servers that may not work, plus the following c. The operating system language on the server is set to English (like Linux that is not equipped with a BluePoint Chinese system is usually the English). If you are not right, please report .... Re: I have to read it, but the solution does not dare to compliment --------------------------- -------------------------------------------------- --- Tomcat parameters The problem is encoded by 8859_1 whether it is GET or POST mode. This source code that can be seen in Tomcat Servlet: a) ParsepostData method for POST method javax.servlet.http.httputils: (For POST's FORM data) String postedbody = new string (postedBytes, 0, Len, "8859_1"); This is not a problem, because Chinese will use% to explain. But the Parsename is a function, but it is not integrated with Chinese. He is just simple and scattered, so he can determine that he is using 8859_1 encoding rules sb.append ((CHAR) Integer.Parseint (S.Substring (i 1, I 3), 16)); ---- i = 2; - b) for get methods CODE: Org.apache.tomcat.service.http.httprequestadapter line = new string (buf, 0, count, constants.characterencoding.default); constants.characterencoding.default = 8859_1