Detailed explanation of Chinese issues in Java

xiaoxiao2021-03-06 154

Preparatory knowledge:

1. bytes and Unicode

The Java kernel is Unicode, even Class files are also, but many media, including file / stream save methods

It is the use of byhal. Therefore, Java is to transform these byte streams. Char is Unicode, and Byte is byte.

The functions of Byte / Char in Java are in the middle of Sun.IO. Where the ByTocharConverter class is scheduled,

Can be used to tell you, you use the Convertor. Two of these very commonly used static functions are

Public static bytetocharconvert ();

Public static bytetocharconverter getConvert (String Encoding);

If you don't specify Converter, the system will automatically use the current Encoding, with GB platforms with GBK, EN platform

8859_1

Let's come to a simple example:

"You" GB code is: 0xc4e3, Unicode is 0x4f60

You use:

--encoding = "GB2312";

--byte b [] = {(byte) "u00c4", (byte) "u00E3"};

--convertor = bytetocharconverter.getConverter (Encoding);

--char [] c = converter.convertall (b);

--for (int i = 0; i

- {

- System.out.println (Integer.tohexString (C [i]));

- Print is 0x4f60

- But if you use 8859_1 encoding, print it out is

--0x00c4,0x00e3

---- Example 1

in turn:

--encoding = "GB2312";

CHAR C [] = {"U4F60"};

Convertor = bytetocharconverter.getConverter (Encoding);

--byte [] b = converter.convertall (c);

--for (int i = 0; i

- {

- System.out.println (Integer.tohexString (B [i]));

- Print is: 0xc4,0xe3

---- Example 2

- If use 8859_1 is 0x3f ,?, indicating that it cannot be transformed -

Many Chinese issues are derived from these two simplest classes. And there are many classes

Do not directly support Encoding input, which brings us many inconveniences. Many procedures are rare to use Encoding

, Use the Default's Encoding, which brings a lot of difficulties to our transplant.

2.UTF-8

--UTF-8 is corresponding to Unicode, which is very simple

- 7 bits of unicode: 0 _ _ _ _ _ _ _

--11 bits of Unicode: 1 1 0 _ _ _ _ _ 1 0 _ _ _ _ _ _ _

--16 bits of unicode: 1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _ _

- 21 new unicode: 1 1 1 1 0 _ _ _ 1 0 _ _ _ _ _ _ _ 1 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _

- Most of the cases are only available to Unicode below:

- "You" GB code is: 0xc4e3, Unicode is 0x4f60 - we still use the above example

- - Trimata 1: 0xC4E3:

- - 1 1 0 0 0 1 0 0 1 1 1 0 0 0 1 1

- - Due to only two we are row, we have discovered this line,

- - because the 7th is not 0, therefore, return "?"

- -

- - Trimata 2: 0x4f60:

- - 0 1 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0

- - We make up with UTF-8, become:

- - 11100100 10111101 10100000

- - E4 - BD - A0

- Always return 0xE4, 0XBD, 0xA0

- -

3.String and Byte []

- String is actually CHAR [], however, to convert Byte into string, must be encoded.

--String.Length () is actually the length of the Char array, if you use different coding, you can

- can be scattered, causing scattering and garbled.

--example:

---- Byte [] b = {(byte) "u00c4", (byte) "u00E3"};

---- String str = new string (b, encoding); ----

---- If encoding = 8859_1, there will be two words, but Encoding = GB2312 has only one word ----

- This problem often occurs in processing paging

4.Reader, Writer / InputStream, OutputStream

--Reader and Writer core are CHAR, INPUTSTREAM and OUTPUTSTREAM cores are BYTE.

- But Reader and Writer's main purpose is to read Char read / write InputStream / OutputStream

- An example of an reader:

- Document test.txt has only one "you" word, 0xc4, 0xe3 -

--String encoding =;

--NPutStreamReader Reader = New InputStreamReader

---- New FileInputStream ("Text.txt"), Encoding;

--char [] c = new char [10];

- whip length = reader.read (c);

--for (int i = 0; i

---- System.out.println (C [i]);

- If encoding is GB2312, there is only one character if encoding = 8859_1, there are two characters

------------

----

2. We have to know about the Java compiler:

--javac -ENCODING

We often have no encoding parameters. In fact, ENCODING is important for cross-platform operations.

If you do not specify eNCoding, follow the system's default eNCoding, the GB platform is GB2312, and the English platform is ISO8859_1.

--Java's compiler actually calls Sun.Tools.javac.main class, compiles files, this class -

There is an encoding variable in the middle of the Compile function, and the parameters of -Encoding are actually transmitted to the Encoding variable. The compiler is based on this variable, and then compiles the UTF-8 form into a Class file.

one example:

--Public void test ()

- {

---- String str = "you";

--- FileWriter Write = New FileWriter ("Test.txt");

---- Write.write (STR);

---- Write.Close ();

---- Example 3

- If you compile with GB2312, you will find the field of E4 BD A0

- If you compile with 8859_1,

- Binary binary of00c4 00e3:

--00000000 11000100 00000000 11100011 -

- Because each character is greater than 7 digits, so use 11-bit encoding:

--11000001 10000100 11000011 10100011

--C1 - 84 - C3 - a3

- You will find C1 84 C3 A3 -

But we tend to ignore this parameter, so this often has a cross-platform problem:

- Example 3 compiled in the Chinese platform to generate zhclass

- Example 3 Compiled in English platform, output enclass

--1. ENCLASS executes OK on the Chinese platform, but not on the English platform

--2. Enclass executes OK on the English platform, but not on the Chinese platform

the reason:

--1. After compiling on the Chinese platform, in fact, STR is running the state of char [] is 0x4f60, ----

- Run on the Chinese platform, the default code of FileWriter is GB2312, so

--ChartobyteConverter automatically converts STR to CONVERTER calling GB2312

- It is entered into the fileoutputstream, so 0xC4, 0XE3 is put into the file.

- But if it is in English platform, the default value of Chartobyteconvert is 8859_1.

- FileWriter will automatically call 8859_1 to transform STR, but he can't explain, so he will

- Output "?" ----

--2. After compiling on the English platform, in fact, STR is running with char [] is 0x00c4 0x00e3, ----

- Operate on the Chinese platform, Chinese can't identify, so will there be??

- On the English platform, 0x00c4 -> 0xC4, 0x00E3-> 0xE3, therefore 0xC4, 0XE3 is put into it.

--file

----

1. Explanation for the text of the JSP:

- Tomcat first looks at the "<% @ Page Include symbol in your foliature. Yes, the same

- Place set response.setContentType (..); read according to Encoding, do not follow 8859_1

- Read the file, then write this file with the UTF-8, then read this file with sun.tools.main.

- (Of course it uses UTF-8 to read), then compile into a class file

--SetContentType changes the properties of the OUT, OUT variable default encoding is 8859_1

2. Explanation of Parameter

- Unfortunately, Parameter is only interpreted by ISO8859_1, which can be found in the server's implementation code.

3. Interpretation of Include

Format, but very unfortunately, since the person written "org.apache.jasper.compiler.parser" is in array jsputil.validattribute [] Forgot to add a parameter: Encoding, therefore causes no support

Hold this way. You can compile the source code, plus support for Encoding

to sum up:

If you are under NT, the easiest way is to deceive Java, do not add any Encoding variables:

Hello, <% = Request.getParameter ("Value")%>

http://localhost/test/test.jsp? value = you

Result: Hello you

However, this method is limited, such as segmentation of the uploaded article, such a practice is dead, the best

Solution is to use this solution:

<% @ Page ContentType = "text / html; charset = GB2312"%>

Hello <% = new string (Request.GetParameter ("Value"). GetBytes ("8859_1"), "GB2312")%>

I must read it, but the solution does not dare to compliment

-------------------------------------------------- ------------------------------

1. The web page transmission parameter does not advocate the GET method, and the user can adjust whether to send with UTF-8

2. It is best not to use it in JSP. In fact, it does not add this sentence to implement Chinese normal display. I don't add convenient, at least write these code, as follows, I think I can make Chinese normal display:

All JavaBean compiles with ISO8859-1

B. Jsp file Do not write the statement of the above Charset = GB2312 (written by writing)

Pay attention to the above 2 points in Tomcat ---, etc., for other JSP servers that may not work, plus the following

c. The operating system language on the server is set to English (LINUX, like a BluePoint Chinese system, is generally in English)

That's

Who is not right, please report ....

Re: I must read it, but the solution does not dare to compliment

-------------------------------------------------- ------------------------------

Tomcat's parameter problem is encoded with 8859_1 whether it is GET or POST mode. This source code can be seen in Tomcat Servlet:

a) for the POST method

Javax.Servlet.http.httputils ParsepostData method: (for POST's FORM data)

String postedbody = new string (PostedBytes, 0, Len, "8859_1");) This is not a problem because Chinese will use% to explain. But the Parsename is a function, but it is not integrated with Chinese things. He is just simple and scattered, so he can determine that he is using the 8859_1 coding rule.

Sb.append ((char) Integer.Parseint (S.Substring (i 1, i 3), 16);

---- i = 2;

b) for a GET method

Org.apache.tomcat.Service.http.httpRequestadapter

- line = new string (buf, 0, count,

Constants.Characterencoding.default); ---- constants.characterencoding.default = 8859_1

This code is not well tracked, don't be confused by some illusion. HttpRequestAdapter is derived from Requestimpl. However, actually uses the 8080 port of Server to use RequestiMPL directly, but use httprequestadapter to get querystring.

For additional encoding, I keep my opinion, because if you want to resolve the upload file paging problem, you must use him to encode. And encoding can guarantee the transfer property in some Beans.

It seems that I have to explain here.

-------------------------------------------------- ------------------------------

Tomcat is just a standard for JSP 1.1, servlet2.2. We should not ask this free software to be in detail and performance, it mainly considers English users, which is why not special conversion us. Chinese characters have issued problems with URL methods, and we have always used UTF-8 to send URLs in the advanced settings of our browser IE. The default is selected. What is the language of the current operating system, it seems to be compiled by ISO8859, I think it is a bit notgent, but no matter what, the realization of new standards and popular software will always consider in English.

What is my plan to say something better?

1. Or that sentence, the software of the English country will always consider the English, Java virtual machine requirements require the virtual machine to implement ISO8859, Unicode, UTF-8, other non-requirements, we use the JDK The virtual machine is like this, and the embedded type is not to say, that is to say that other Encode is likely to be directly supported by the Java virtual machine. Our Chinese nature is not in its column, and the external package support conversion is required. JDK should be in I18n.jar, using ISO8859 speed, no other calls and exchange, no reading IO operations

2. At least fewer write code, no additional operation, simple style who doesn't like

3. The written JSP page is internationally, I wrote a JSP Javabeans chat room software (not used to servlet, JSP is really very good), the same program Americans use their browser to enter is English interface, Chinese entry is the Chinese interface, if it is incorporated in trouble, please

4. Limited GB2312, if the user wants to use GBK, how to do, don't add better, no matter what the character set, as long as my current browser is set, I can display it.

转载请注明原文地址:https://www.9cbs.com/read-123634.html

9cbs

New Post(0)