3. Java files and encoding
Java is used so wide, so that the Java file may be used in any character encoding. If you don't know what the Java file's coding standard is, it may give our javac myclass.java to bring embarrassment. The storage of all files is all stored in the byte, and the word that is not the file is not filed on the disk.
Figure 2-5 JVM output
In but first encode characters into bytes, then store these bytes to disk. When reading a file (especially text file), it is also a byte one byte to form a byte sequence, so reading the file may involve byte decoding to the character, if Javac is reading the Java file There is no correct decoding, just decoded with the GB2312 byte stream with UTF-8, if there is a string constant in the Java source program (String str = "I am Chinese";), and these constants are not in English If the character is character, Javac will not decode correctly and form garbled, or even Javac report Java file syntax errors. Since Java.exe's options do not have an encoding or decoding option, you can definitely be in the Class file generated by Javac MyClass.java, the string constant is compiled when compiling the Java source file, it is fixed according to certain fixed. Code Standard Encoding In the Class file, pressing the JVM (Java Virtual Machine, Java virtual machine) specification, this fixed coding is UTF-8, using the constant_utf8_info structure in the class file represents a constant string value. Figure 2-6 is using a DOS command debug.exe to view a class file (Debug.exe does not directly support files greater than three ACSII characters, copy show.class to show.txt), using the red line marked exact The string "I am Chinese" UTF-8 coding result.
Figure 2-6 View class file with Debug
Let's do some experiments.
Public Class show
{
Static
{
String str = "I am Chinese";
System.out.println (STR);
}
Public static void main (string args []) {}
}
The default charSet I used is GBK, so my show.java store is also GBK, while Javac's Default Decoding is the same as the decoding used by the system, that is, I don't have to add Java's options "-Encoding GBK, Javac can also decode the correct show.java. But if I added "-Encoding UTF-8" or "-Encoding ISO8859-1", you may already know some results, as shown in Figure 7-2, maybe you have discovered things that you have found there. But I don't want to be garbled with the same garbled in Figure 2-5. Why don't you look at Figure 2-1, but please don't compare two garbled strings, because they are incorrect strings, I have been comparable to incorrect things. I have no sense, you can't confirm what they are, even if they look like the same string.
Figure 2-7 Take different decoding on Javac Show.java
Don't worry first, you can think about what I said, then do some experiments below.
Use Notepad to open show.java, then save it with the same name, but you can do this when encoding "UTF-8" (Win 2000 and later WINDOWS) is selected. You already know, we should now use the Javac -Encoding UTF-8 show.java command, but we have encountered embarrassment, as shown in Figure 2-8 Figure 2-8 Compile error - We have once again
Javac reported my grammatical error, this can be a little ideal, I can't find the wrong place, yes, you can't find it, no matter what you use Notepad, this show.java or VJ in the interdev of MS can't find an error, everything is fine, in the first line of error, there is no two characters at all. Fortunately, you can't escape the eyes of Debug.exe. The top three bytes of show.java files are 0xef, 0xbb and 0xbf, although I don't understand what it means, but I can still understand, because before I open the file, Windows doesn't know what code it wants is to process. This is not like in the Internet, we can know the encoding used by the information from the additional information of the information, but Windows can't (maybe it should be done), probably To identify the file, the file is encoded with UTF-8, the system has only added the three bytes before the file (we can't make a few bytes to judge or determine what code code is used. Don't you specify encoding in XML? It can not agree with Javac (perhaps this approach is MS you think about it.), It is still wrong with the first three bytes of the file rather than a special logo, which is wrong. Use Debug to delete the three bytes in front (see Figure 2-11), but when we use GBK or GB2312 to decode, the Javac newspaper is wrong, it should be; it should be compiled with ISO8859-1. But there is a garbled, which should be it. Figure 2-9
Figure 2-9 Show. Java is encoded with UTF-8
We can also give a conclusion: If there is no string of non-English characters in the Java file, we have reason not to care about Javac's options. No matter whether this Java class will deal with English characters, because we have completely generated the correct class, the rest of the coding or decoding, why not give it to it, say this, we need to do an experiment.
4. Documentation code
This experiment is to read Chinese files in the Java applet, but before this, you'd better look at the Java Document, java.io.fileinputstream, Java.Io.inputstreamReader, Java.Io.inputstreamReader, Java.Io.inputstreamReader, Java.Io.io.inputstreamReader, Java.Io.io.io.inputstreamReader, Java.Io.io.i BufferedReader, I checked us from Sun's "Javatm 2 SDK, Standard Edition Documentation Version 1.4.0", which makes us excipused Document:
An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.. each invocation of one of an InputStreamReader's read () methods may cause one or more bytes to be read from the underlying byte-input stream. to enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are Necessary to satisfy the current read operation.
...
Public int tent ()
Throws oException
Read a Single Character.
Read files or other input streams through inputStreamReader, we have not read directly to Byte, and the CHAR we read is also the use of "self-righteous correct decoding" to decode word stream (Type Stream). As a result, because if we construct an InputStreamReader object, there is no use of parameter charset to specify the encoding of the word stream (encoding the encoding of the character flow), it will use the default encoding (encoding); decoding; OutputStreamWriter has the same mechanism to perform the opposite action. UTF_8File.java Source Code:
Import java.io.fileinputstream;
Import Java.io.InputStreamReader;
Import java.io.bufferedreader;
Import java.io.file;
Import java.io.ioException;
Import java.io.filenotfoundexception;
Import java.io.unsupportedEncodingexception;
Public Class UTF_8File
{
Public static void main (string args [])
Throws FilenotFoundException, IOException, unsupportedEncodingexception
{
IF (args.length <3) {
System.out.Println ("cmd encoding decoding file");
Return;
}
FileInputStream Fis = New FileInputStream (Args [2]);
InputStreamReader ISR = New InputStreamReader (FIS, ARGS [0]);
BufferedReader Br = New BufferedReader (ISR);
String Str;
System.out.println ("File Content:");
While (Str = Br.Readline ())! = null) System.out.println (STR);
br.close ();
File f = new file (args [2]);
FIS = New FileInputStream (Args [2]);
BYTE BS [] = New byte [(int) f.Length ()];
INT B, INDEX = 0;
While ((b = fis.read ())! = -1)
BS [INDEX ] = (Byte) b;
fis.close ();
System.out.println ("File Content:");
System.out.print (New String (BS, Args [1]));
}
}
Edit a show.txt file with Notepad, only five Chinese characters: I am Chinese, use UTF-8 encoding when saving, as shown in Figure 2-10, please pay attention to each of my command lines.
Yes, the first time I first read Show.txt, there was a mess, all the three hate rogue by the file, we used Debug to delete the three bytes in front, as shown in Figure 2-11, No. It is good twice. In addition, we seem to have discovered garbled, but you will also find the second "File Content:" of "Java UTF_8File UTF-8 GB2312 Show.txt", we don't have anything, we use GB2312 to decode UTF-8 encoding The byte stream is failed (I didn't say what character did not get the other UTF-8 coded byte stream GB2312 decoding).
Figure 2-10 Java program reading show.txt
Figure 2-11 Goodbye, you have these small rogues
You are very careful, discovering the MalforMedInputException in Figure 2-10, which is explained, because of the second "File Content:" of "Java UTF_8File UTF-8 GB2312 Show.txt", said Over, we didn't get any characters here, perhaps, may get characters elsewhere, but that is a mALFORMED Character.
Figure 2-12 We have received this Exception
Now, you can also understand why SYSTEM.OUT and SYSTEM.IN can do it, you can also know that we have never considered these things, JVM can help us with default encoding / decoding Do this. Well, why don't you use GB2312 encoding SHOW.TXT to try.
Now we can figure out the embarrassment of Figure 2-1.
5. JSP file and code
Yes, my JSP file is encoded is GB2312. If I use Tomcat compile this JSP file, I don't use GB2312 as decoding, or the DisboFiture $ JSP is compiled, and does not use GB2312 as decoding, then the constant containing non-English characters The string will not be compiled correctly, and there will be garbled because we can affirm that system.out does not output the correct character output. In fact, there is not so much, JSP Engine (JSP Engine) After compiling JSP files into Java files, the encoding parameters used when Javac compiles this Java file is UTF-8, which means that the JSP engine generates Discomfiture $ JSP.JAVAP, The encoding when the store of this file is UTF-8, and we are just just the encoding used by the JSP engine compiles the JSP file. Let the JSP engine correctly compile discomfiture.jsp, we have to look at the discomfiture $ jsp.java is not the same, Figure 2-13.A is a copy of the discomfiture $ jsp.java opened by Notepad, Figure 2-13 .b and 2-13.c All use Debug's copy of DiscomFiture $ Jsp.java, Notepad and Debug are very cute little things, we can always trust them.
Figure 2-13.A No garbled here
Figure 2-13.b There is no rogue here
Figure 2-13.c I am Chinese
Let JSP $ JSP.JAVA use fixed UTF-8 coding, because we can easily control the storage coding of the JSP file, and easily tell the various storage codes to the JSP engine, but also take care of The interests of English characters and non-English characters may be more complicated by UTF-8, but JSP $ JSP.JAVA is just a TEMP, we should be tolerated.
Yes, we can easily tell the JSP file encoding to the JSP engine, so that the constant string containing non-English characters will not be compiled again when compiling the JSP file.
Figure 2-14 Haina Baichuan
The PageEncoding property in the JSP file, tells the JSP file encoding to the JSP engine, such as:
<% @ Page PageEncoding = "GB2312"%>
The properties in the page directive are optional properties, and the default value of Pagenoding is the value of charset in ContentType, so if we use:
<% @ Page ContentType = "text / html; charset = GB2312"%>
There is no clear pointencoding, and the value of pageEncoding is also GB2312, so we often don't care about this property, and the constant string of non-English characters in the JSP file is not compiled. If this attribute is not set, it is a bit bad, and the default value of ContentType is text / html; charset = ISO8859-1, our JSP file encoding has also become ISO8859-1, so it will come out. Figure 2-1. But why is the browser? Let's take a look at "I am Chinese" process, Figure 2-15
Figure 2-15 Constant string flow chart
The JSP engine handles a known encoded JSP file for us to be transparent, and we fully trust it correctly. Our discomfiture.jsp files are saved in GBK byte stream, and the constant string "I am Chinese" is encoded:
0xCE0XD20XCA0XC70XD60XD00XB90XFA0XC80XCB When the JSP engine reads the byte stream of this file, because we don't specify the CHARSET of PageEncoding or ContentType, the JSP file encoding is considered ISO8859-1, then the JSP engine also decodes these byte by ISO8859-1. According to the standard of ISO8859-1, each byte is decoded into a value equal, and the JSP engine thinks this constant string (Note that Java is a Unicode character set) is:
/ u00CE / U00D2 / U00CA / U00C7 / U00D6 / U00D0 / U00B9 / U00FA / U00C8 / U00CB
This is the case until the execution of Servlet DisboFiture $ JSP. When servlet uses the character string to output this string in the server locally, follow the default GBK decoding, then output the GBK byte stream to the underlying operating system, which is garbled. We can use the following JSP code to prove this speculation:
<% - discomfiture2.jsp -%>
<%
String str = "I am Chinese";
System.out.println (STR);
CHAR CHS [] = str.tochararray ();
For (int i = 0; i { System.out.print (Integer.tohexString ((int) CHS [I])); System.out.print (""); } Out.println (STR); %> Server-side output is shown in Figure 2-16 Figure 2-16 Unicode code for each character in the string So we assign a value to each character in the string in diskomfiture1.jsp, the servlet is the correct string "I am Chinese", but also output to System.out. When we don't have a ContentType in the JSP's PAGE, servlet outputs the byte stream of the network. Press the default encoding ISO8859-1, this encoding method directly discards the number of the Unicode character to zero, will the low word In the output stream (when the high position is not 0, the character '/ u3f' is written into the word stream, so we get the character '?'), Then we have received it again. Byte stream: 0xCE0XD20xca0xc70xd60xd00xb90xfa0xc80xcb When the browser outputs the string represented by these words (the browser output is of course the character is not byte), the default encoding method according to the system is encoded, and we get a string "I am Chinese ", In fact, this is an incorrect coupled coupled. Now we are not difficult to understand the garbled - five '?'.