Analysis and Solution of Chinese Characters in Java Programming Technology

zhaozj2021-02-08  429

In the Java language-based programming, we often encounter the process of Chinese characters and the problem of display. A lot of garbled is not what we are willing to see, how can they make those Chinese characters display correctly? Java language default encoding mode is Unicode, and our Chinese usually used files and databases are encoded based on GB2312 or BIG5, how can I properly select the Chinese character encoding mode and correctly handle the code of Chinese characters? This article will start from the common sense of Chinese character encoding, combined with Java programming instance, analyze the above two problems and solve their solutions. -------------------------------------------------- -------------------------- ---- Now Java programming language has been widely used in the Internet World, when Sun is developing Java language in SUN It has considered support for non-English characters. Sun's Java Operation Environment (JRE) published by Sun is divided into English and international editions, but only international version supports non-English characters. However, in the application of the Java programming language, the support of Chinese characters is not as perfect as those in the standard specification of Java Soft, because the Chinese character set is not only one, and the different operating systems have different support for Chinese characters. Therefore, there will be many problems related to Chinese character encoding processing plague us in our application development. There are a lot of answers about these issues, but they are trivial, and they are not able to meet the desire to solve the problem. There are not many systems in the Java Chinese issue. This article starts from Chinese character coding common sense, analyzes java Chinese issues, I hope to everyone Solve this problem. The common sense of Chinese character encoding We know that English characters are generally represented by one byte, and the most commonly used encoding method is ASCII. But one byte can only distinguish 256 characters, and Chinese characters are thousands of characters, so now in double bytes, in order to be separated from the English characters, the highest bit of each byte must be 1, such a pair Bytes can represent up to 64K characters. The encoding method we often encounter has GB2312, BIG5, Unicode, etc. For detailed funds for specific encoding methods, interested readers can consult relevant information. My skin talks about the closeness of the GB2312 and UNI Code with us. GB2312 code, the national standard Chinese character exchange code, is a code issued by the State Administration of China, which is issued to simplify Chinese characters, which is generally in mainland China and Singapore, referred to as national code. In the two bytes, the value of the first byte (high byte) is a zone value plus 32 (20h), the value of the second byte (low byte) is 32 (20h), with These two values ​​represent a Chinese character encoding. The Unicode code is a multi-byte equal length encoding that adds a multi-country character problem that solves the multi-country character problem. It is compatible with the English character to add "0" bytes of the "0" byte. If the ASCII code of "A" is 0x41, Unicode is 0x00, 0x41. Use special tools to be converted between various codes. Preliminary understanding of Java Chinese issues We are inevitably handled in Chinese based on Java programming languages. Java programming language default encoding method is Unicode, and the database and files we usually use are based on GB2312 encoding. We often encounter such situations: Browse JSP technology-based websites are garbled, after the file is opened It is also garbled, and the content of the database modified by Java cannot continue to properly provide information when applied in other occasions.

String sendlish = "apple"; string schinese = "Apple"; string s = "Apple Apple"; SENGLISH is 5, the length of SCHINESE is 4, and the S default length is 14. For SengLish, all classes in Java have supported very good and will definitely display correctly. However, for Schinese and S, although Java Soft declares Java's basic class has taken into account support for multi-Japanese characters (default Unicode encoding), if the default code of the operating system is not Unicode, but national codes, etc. From the Java source code to get the correct result, the process of "Java Source Code -> Java Bytecode ->; Virtual Machine -> Operating System -> Displays the device". In each of the above processes, we must handle the encoding of Chinese characters correctly, in order to make the final display result correctly. "Java Source -> Java Bytecode", the standard Java compiler Javac uses the character set is the system default character set, such as GBK on the Chinese Windows operating system, and is ISO-8859 on the Linux operating system- 1. So everyone will find that Chinese characters in the source files compiled on the Linux operating system have problems, the solution is to add an Encoding parameter when compiling, so that it can be independent of the platform. The usage is Javac -Encoding GBK. "Java bytecode -> virtual machine -> operating system", Java Run Environment (JRE) is divided into English and international, but only international version supports non-English characters. Java Development Kit (JDK) is definitely supporting multi-state characters, but not all computer users have installed JDK. Many operating systems and applications are better supporting JA VA, which is embedded in JRE's international version, which is convenient for supporting multi-country characters. "Operating System -> Display Devices", for Chinese characters, the operating system must support and display it. If you don't match a special application software, it is definitely unable to display Chinese. There is also a problem, that is, in the Java programming process, the Chinese characters are correctly encoded. For example, when you output a Chinese string to the web page, whether you use out.println (string); // String is a string containing Chinese or <% = string%> must be converted to GBK, or Manual, or automatic. In JSP 1.0, the output character set can be defined to implement an internal conversion of the internal code. Usage is <% @ page contentType = "text / html; charset = GB2312"%> But there is no support for output character sets in some JSP versions, (for example, JSP 0.92), this requires manual encoding output, method Much. The most common method is String S1 = Request.GetParameter ("keyword"); string s2 = new string (S1.GetBytes ("ISO-8859-1"), "GBK"); getBytes method is used to "" ISO-8859-1 "Encoding mode is transformed into byte arrays, and" GBK "is a target encoding method.

From the database encoded in the ISO-8859-1, we read the Chinese string S2 through the above conversion process, and the Chinese string S2 can be correctly displayed in the operating system and application software supporting the GBK character set. Surface analysis and processing background development environment JDK1.15 VCAFE2.0 JPADPRO server NT IIS Sybase System Jconnect (JDBC) Client IE5.0 PWIN98 .Class file is stored on the server side, run Applet by the client's browser, Applets only play a role of main programs such as the Fr AME class. The interface includes TextField, TextArea, List, Choice, etc. I. After performing the select statement with JDBC, read the data (Chinese) from the server side, add the data to TextArea (TA) with the APPEND method, and cannot be displayed correctly. But when adding to the List, most Chinese characters can be displayed correctly. Translate the data into a byte array in the "ISO-8859-1" encoding method, then press the system default encoding method to string to String, you can display correctly in TA and LIST. The segment is as follows: dbstr2 = results.getstring (1); // after reading the result from db server, converting it to string. Dbbyte1 = dbstr2.getbytes ("ISO-8859-1"); dbstr1 = new string (dbbyte1) The system default encoding method is not used when the string is converted, and "GBK" or "GB2312" is used directly, and there is no problem from the data library in both cases of A and B. II. Write Chinese to the database processing mode and "Take Chinese" phase reverse, first transform the SQL statement by system default encoding method into byte arrays, then press "ISO-8859-1" encoding to string to string, finally sent Execute, the Chinese information correctly writes the database. The block is as follows: SQLSTMT = TF_INPUT.GETTEXT (); // Before sending statement to db server, converting it to sql statement. Dbbyte1 = sqlstmt.getbytes (); sqlstmt = newstring (dbbyte1, "ISO-8859-1"); _stmt = _con.createstatement (); _stmt.executeUpdate (SQLSTMT); ... Problem: The program code can be executed correctly if there is ClassPath to point to ClassS.zip to JDK (called A). But if the client only has a browser, there is no JDK and ClassPath (called B), the Chinese characters cannot be converted correctly. Our analysis: 1. Test, in A case, the default encoding mode of the system is GBK or GB2312. In B, the program is displayed in the Java console when the program starts: can't find resource for sun.awt.windows.awtlocalization_zh_cn The default encoding method of the system is "8859-1".

2. If the system default encoding mode is not used when the string is converted, "GBK" or "GB2312" is used directly, and the program can still run normally in A, in B, the system has an error: UNSUPPORTEDENCODINGEXCEPTION. 3. On the client, after extracting the JDK's classs.zip, place in another directory, ClassPath only contains this directory. Theclass files are then gradually deleted, and the other side runs the test program, and the last one thousands of Class files are now existing, the file is: sun.io.chartobytedoublebyte.class. Put this file to the server side and other classes together, and in the beginning of the program, the program still does not function properly in B. 4. In A case, if Sun.IO.CHARTOBYTEDOUBLETE.CLASS is removed in ClassPth, the program is running to measure the default encoding mode as "8859-1", otherwise "GBK" or "GB2312". If the JDK version is 1.2 or more, the problem encountered in B is a good solution. The steps of the test are in the same way, and interested readers can try it. [/ b] The root analysis and solution of java Chinese problem [/ b] Under Simplified Chinese MS Windows 98 JDK 1.3, you can use system.getproperties () to get some basic attributes of JA VA running environment, class poorchinese can help us get These properties. Source code for class Poorchinese: Public class poorchinese {public static void in (String [] args) {system.getproperties (). List (system.out);}} After performing Java Poorchinese, we will get: System Variable File.Encoding The value of GBK, the value of the user.Language is en, and the value of User.Region is CN. The value of these system variables determines the system default encoding mode is GBK.

In the above system, the following code converts GB2312 files into a BIG5 file, which can help us understand the transformation of Chinese characters coding in Java: import java.io. *; import java.util. *; Public class gb2big5 {static int icharnum = 0; public static void main (string [] args) {system.out.println ("INPUT GB2312 File, Output Big5 File."); If (args.length! = 2) {System.err.Println ("Usage: JVIEW GB2BIG5 GBFILE BIG5FILE "); System.exit (1);} String InputString = ReadInput (Args [0]); WriteOutput (InputString, Args [1]); System.out.Println (" Number of Characters In File: " iCharNum ) ".";} static void writeOutput (String str, String strOutFile) {try {FileOutputStream fos = new FileOutputStream (strOutFile); Writer out = new OutputStreamWriter (fos, "Big5"); out.write (str); out.close ();} catch (IOException e) {e.printStackTrace (); e.printStackTrace ();}} static String readInput (String strInFile) {StringBuffer buffer = new StringBuffer (); try {FileInputStream fis = new FileInputStream (STRINFILE); InputStreamReader ISR = New InputStreamReader (FIS, "GB2312"); Reader IN = New Bufferedrea DER (ISR); int Ch; while ((ch = in.read ())> -1) {icrnum = 1; buffer.Append ((char) ch);} in.close (); return buffer.tostring ();} catch (ioException e) {E.PrintStackTrace (); return null;}}} The process of encoding transformation is as follows: bytetochargb2312 chartobytebig5 gb2312 ------------------> Unicode -------------> BIG5 executes java gb2big5 gb.txt big5.txt, if gb.txt content is "Today Wednes", the characters in the file BIG5.TXT can be correct Display; and if gb.txt is "Happy Valentine's Day", the resulting file BIG5.TXT corresponds to the "Section" and "Le" characters "? "(0x3f), visible sun.io.bytetochargb2312 and sun.io.chartobytebig5 are not composed. As in the above example, the basic class of Java may also have problems.

Since the work of internationalization is not completed in China, there is no strict test before these basic classes, so the support of Chinese characters is not as perfect as Java Soft claims. Not long ago, one of my technical friends said to me that he finally found the root of the Java Servlet Chinese issue. Two weeks, he has been troubled for the Chinese issue of Java servlet, because each string containing a Chinese characters must be enforced to get the correct result (this is the only unique solution) . Later, he didn't want to continue to rest, because such things should not be the work to be done by senior programmers, he finds the source code for servlet decoding, because he doubizes the problem. . After four hours of struggle, he finally found the root of the problem. It turns out that his suspicion is correct, the decoding portion of the servlet does not consider the double-byte and directly regards% XX as a character. (The original Java Soft will also make this low-level mistake!) If you are interested in this question or if you have the same troubles, you can modify the servlet .jar according to his step: Find the STATIC Private in httputils String Parsename, copy SB (S TRINGBUFFER) to BYTE BS [] before returning to Return New String (BS, "GB2312"). After the above changes, you need to decode yourself: havehtable form = httputils .parsequeryString ()) or form = httputils.parsePostData (...) Don't forget to build it in servlet.jar. V. Summary of Java programming languages ​​on Java Chinese issues growing on the network world, which requires Java to have good support for multi-country characters. The Java programming language adapted to calculate the needs of networked, and laid a solid foundation for it to grow rapidly in the network world. JAVA Soft has taken into account the support of the Java programming language to multi-country characters, just now there are many defects in the current solution, and we need to put into some compensatory measures. The World Standardization Organization is also trying to unify all the words of human beings in a code, one of which is ISO10646, which uses four bytes to represent a character. Of course, before this solution is not adopted, it is desirable that Java Soft can strictly test its product to bring more convenience to users. Attachment of a process function for removing Chinese garbled from a database and a network, a string that is a problem with a problem, and the outbound is a string that has been resolved.

转载请注明原文地址:https://www.9cbs.com/read-850.html

New Post(0)