Analysis and Solution of Chinese Characters in Java Programming Technology, File Operation

xiaoxiao2021-03-05  34

In the Java language-based programming, we often encounter the process of Chinese characters and the problem of display. A lot of piles don't understand

It is definitely not that we are willing to see, how can I make those Chinese characters display correctly? Java language

The default encoding method is Unicode, and our Chinese usually used files and databases are based on GB2312

Or BIG5, etc., how can I properly select the Chinese character encoding method and correctly handle the edits of Chinese characters.

Code? This article will start from the common sense of Chinese character encoding, combined with the Java programming example, analyze the above two problems and proposed

Solve their solutions.

Now the Java programming language has been widely used in the Internet World, when Sun is developing Java language

It has considered support for non-English characters. Java operating environment published by Sun is itself

Both English and international editions, but only international version supports non-English characters. But in the application of Java programming language

Support for Chinese characters is not as perfect as the standard specification of Java Soft, because Chinese words

Not only one, and different operating systems have different support for Chinese characters, so there will be many and Han

The problem related to word coding processing plasted us in our application development. There are a lot of answers about these issues

But it is trivial, and it is not able to meet the desire to solve the problem, and the system about Java Chinese issues

There are not many research, this article starts from the common sense of Chinese characters, analyzes the java Chinese problem, I hope to solve this question for everyone.

The question is helpful.

Common sense of Chinese character encoding

We know that English characters are typically represented by one byte, the most commonly used encoding method is ASCII. One

Bytes can only distinguish 256 characters, while Chinese characters thousands, so now they are bored to represent Chinese characters, in order to

Can be separated from the English characters, the highest bit of each byte must be 1, so that the double byte can represent 64k characters.

. The encoding method we often encounter has GB2312, BIG5, Unicode, etc. Detail of specific coding method

If you are interested, readers can check the relevant information. My skin talks about the close GB2312 and UNIs

Code. GB2312 code, China National Standard Chinese Character Exchange Code is a total of the people of the Chinese

The code issued by the State State State State Administration of China on simplifying Chinese characters, passing in mainland China and Singapore, referred to

Code. In the two bytes, the value of the first byte (high byte) is the area code value plus 32 (20h), the second byte (low)

The value of bytes is 32 (20h) of the bit number value, and the two values ​​are used to represent a Chinese character encoding. UNICODE code is micro

Soft proposed multi-byte equation for multi-country character problems, which takes the English characters to add "0" bytes in front.

Slightly achievement. If the ASCII code of "A" is 0x41, Unicode is 0x00, 0x41. Special

Tools can be converted between various codes.

Preliminary understanding of java Chinese

When we developed development based on Java programming languages, we must inevitably handle Chinese. Java programming language default

The encoding method is Unicode, and the database and files we usually use are based on GB2312 encoding, I

We often encounter such situations: Browse JSP technology-based websites to see garbled, and the file is opened after the file is opened.

It is garbled, and the content of the database modified by Java cannot continue to properly provide information when applied in other occasions.

String senglish = "apple";

String schinese = "Apple"; string s = "Apple Apple";

The length of SengLish is 5, and the length of SCHINESE is 4, while the S default length is 14. For Senglish

All classes in Java have supported very good, and it will definitely display correctly. But for Schinese and S

Although Java Soft declares Java's basic class has taken into account support for multi-country characters (default Unicode

Code), but if the default code of the operating system is not Unicode, it is national code. Source code from Java

To get the correct result, go through "Java Source Code -> Java Bytecode ->; Virtual Machine -> Operating System ->

The process of devices. In each of the above processes, we must have correctly handled the code of Chinese characters.

Enough to make the final display result correctly.

"Java Source Code -> Java Bytecode", the character set of the standard Java compiler Javac is the system

The recognition character set, such as GBK on the Chinese Windows operating system, and on the Linux operating system

ISO-8859-1, so everyone will find Chinese characters in the source files compiled on the Linux operating system.

The problem, the solution is to add an Encoding parameter when compiling, so that it can be independent of the platform.

Use

Javac -Encoding GBK.

"Java bytecode -> Virtual Machine -> Operating System", Java Operation Environment (JRE) Divided English and International Edition,

But only international version supports non-English characters. Java Development Kit (JDK) is definitely supporting multi-country characters, but

Non-all computer users have JDK installed. Many operating systems and applications can better support JA

VA, which embed JRE's international version, which is convenient for supporting multi-country characters.

"Operating System -> Display Devices", for Chinese characters, the operating system must support and display it. English operation

If you don't match a special application, it is definitely unable to display Chinese.

There is also a problem, that is, in the Java programming process, the Chinese characters are correctly encoded. For example,

When you output a Chinese string, no matter you use

Out.println (String); // String is a string containing Chinese

Still use

All must be converted to GBK, or manual, or automatic. In JSP 1.0

You can define an output character set to implement an automated conversion of the internal code. Use

But in some JSP versions don't provide support for the output character set, (for example, JSP 0.92), this needs

To manually encode and output, there are a lot of methods. The most common method is

String S1 = Request.getParameter ("Keyword");

String S2 = New String (S1.GetBytes ("ISO-8859-1"), "GBK");

The GetTes method is used to transform the Chinese characters in "ISO-8859-1" encoding into byte arrays, and "GBK"

It is a target encoding method. We read the Chinese string S1 from the database encoded in the ISO-8859-1.

The above conversion process can display Chinese strings correctly in the operating system and application software supporting the GBK character set.

S2.

Surface Analysis and Treatment of Java Chinese Problem

background

Development environment

JDK1.15

Vcafe2.0

JPadPro

Service-Terminal

NT IIS

Sybase system

JConnect (JDBC)

Client

IE5.0

PWIN98

.CLASS file is stored on the server side, runs the applet by the client's browser, and Applet only transfers to Fr.

The role of the main program such as the AME class. The interface includes TextField, TextArea, List, Choice, etc.

I. Take Chinese

Execute the SELECT statement with JDBC After reading data (Chinese) from the server side, use the APPEND method to add the data.

To TextArea (TA), you cannot display correctly. But when adding to the List, most Chinese characters can be displayed correctly.

Transforming the data into a byte array according to the "ISO-8859-1" encoding mode, and then converted to String according to the system default encoding method, you can display correctly in TA and LIST. The segment is as follows: dbstr2 = results.getstring (1); // after reading the result from db server, converting it to string.dbbyte1 = dbstr2.dbyte ("ISO-8859-1"); dbstr1 = new string (dbbyte1) The system default encoding method is not used when the string is converted, and "GBK" or "GB2312" is used directly, and there is no problem from the data library in both cases of A and B. II. Write Chinese to the database processing mode and "Take Chinese" phase reverse, first transform the SQL statement by system default encoding method into byte arrays, then press "ISO-8859-1" encoding to string to string, finally sent Execute, the Chinese information correctly writes the database. The block is as follows: SQLSTMT = TF_INPUT.GETTEXT (); // Before sending statement to db server, converting it to sql statement.dbbyte1 = sqlstmt.getbytes (); sqlstmt = newstring (dbbyte1, "ISO-8859-1); _STMT = _con.createstatement (); _ stmt.executeUpdate (SQLSTMT); ... Problem: The program code can be executed correctly if there is ClassPath to point to the JDK Classes.zip (called A). But if the client only has a browser, there is no JDK and ClassPath (called B), the Chinese characters cannot be converted correctly. Our analysis: 1. Test, in A case, the default encoding method of the system is GBK or GB2312. In B, the program is displayed in the Java console when the program starts: can't find resource for sun.awt.windows.awtlocalization_zh_cn The default encoding method of the system is "8859-1". 2. If the system default encoding mode is not used when the string is converted, "GBK" or "GB2312" is used directly, and the program can still run normally in A, in B, the system has an error: UNSUPPORTEDENCODINGEXCEPTION. 3. On the client, after extracting the JDK's classes.zip, place in another directory, and the classpath only contains the directory. Theclass files in this directory are then deleted, and the other is running the test program. Finally, it is found in a thousand Class files, which is essential, the file is: sun.io.chartobytedoublebyte.class. Put this file to the server side and other classes together, and in the beginning of the program, the program still does not function properly in B.

4. In A case, if Sun.IO.CHARTOBYTEDOUBLETE.CLASS is removed in ClassPth, the program is running to measure the default encoding mode as "8859-1", otherwise "GBK" or "GB2312". If the version of JDK is 1.2 or more, the problem encountered in B is a good solution. The steps of the test are the same, and interested readers can try it. [/ b] The root analysis and solution of java Chinese problem [/ b] Under Simplified Chinese MS Windows 98 JDK 1.3, you can use system.getproperties () to get some basic attributes of the Java run environment, class poorchinese can help us get these Attributes. Source code for class Poorchinese: Public class poorchinese {public static void in (String [] args) {system.getproperties (). List (system.out);}} After performing Java Poorchinese, we will get: System Variable File.Encoding The value of GBK, the value of the user.Language is en, and the value of User.Region is CN. The value of these system variables determines the system default encoding mode is GBK.

In the above system, the following code converts GB2312 files into a BIG5 file, which can help us understand the transformation of Chinese characters coding in Java: import java.io. *; Import java.util. *; Public class gb2big5 {static int icharnum = 0; public static void main (string [] args) {system.out.println ("INPUT GB2312 File, Output Big5 File."); If (args.length! = 2) {System.err.Println ("Usage: JVIEW GB2BIG5 GBFILE BIG5FILE "); System.exit (1);} String InputString = ReadInput (Args [0]); WriteOutput (InputString, Args [1]); System.out.Println (" Number of Characters In File: " iCharNum ) ".";} static void writeOutput (String str, String strOutFile) {try {FileOutputStream fos = new FileOutputStream (strOutFile); Writer out = new OutputStreamWriter (fos, "Big5"); out.write (str); out.close ();} catch (IOException e) {e.printStackTrace (); e.printStackTrace ();}} static String readInput (String strInFile) {StringBuffer buffer = new StringBuffer (); try {FileInputStream fis = new FileInputStream (STRINFILE); InputStreamReader ISR = New InputStreamReader (FIS, "GB2312"); Reader IN = New BufferedReader (ISR); INT Ch; While ((ch = in.read) ))> -1) {iCharnum = 1; buffer.Append ((char) ch);} in.close (); return buffer.tostring ();} catch (ooException e) {E.PrintStackTrace (); return NULL;}}} The process of encoding transformation is as follows: bytetochargb2312 chartobytebig5GB2312 ------------------> Unicode ------------> BIG5 Execute Java GB2BIG5 GB.TXT BIG5.TXT, if gb.txt content is "Today Wednes", the characters in the file BIG5.TXT can be displayed correctly; if GB.TXT is "Happy Valentine", The files of the file BIG5.TXT correspond to "Festival" and "music" characters are symbols "? "(0x3f), visible sun.io.bytetochargb2312 and sun.io.chartobytebig5 are not composed. As in the above example, the basic class of Java may also have problems.

Since the work of internationalization is not completed in China, there is no strict test before these basic classes, so the support of Chinese characters is not as perfect as Java Soft claims. Not long ago, one of my technical friends sent me a letter that he finally found the root of Java servlet Chinese issues. Two weeks, he has been troubled for the Chinese issue of Java servlet, because each string containing a Chinese characters must be enforced to get the correct result (this is the only unique solution) . Later, he didn't want to continue to rest, because such things should not be the work to be done by senior programmers, he finds the source code for servlet decoding, because he doubizes the problem. . After four hours of struggle, he finally found the root of the problem. It turns out that his suspicion is correct, the decoding portion of the servlet does not consider the double-byte and directly regards% XX as a character. (The original Java Soft will also make this low-level mistake!) If you are interested in this question or if you have the same troubles, you can modify servlet.jar according to his step: find the STATIC Private in the source code httputils String Parsename, copy the SB (StringBuffer) into Byte BS [] before returning to Return New String (BS, "GB2312"). After the above modification, you need to decode yourself: Hashtable form = httputils .parsequeryString ()) or form = httputils.parsePostData (...) Don't forget to put it in servlet.jar. V. Summary of Java programming languages ​​on Java Chinese issues growing on the network world, which requires Java to have good support for multi-country characters. The Java programming language adapted to calculate the needs of networked, and laid a solid foundation for it to grow rapidly in the network world. Java Soft has taken into account the support of the Java programming language to multi-country characters, just now there are many defects in the current solution, and we need some compensatory measures. The World Standardization Organization is also trying to unify all the words of human beings in a code, one of which is ISO10646, which uses four bytes to represent a character. Of course, before this solution is not adopted, it is desirable that Java Soft can strictly test its product to bring more convenience to users. Attachment of a process function for removing Chinese garbled from a database and a network, a string that is a problem with a problem, and the outbound is a string that has been resolved.

转载请注明原文地址:https://www.9cbs.com/read-33465.html

New Post(0)