Multibyte-Character Processing in J2EE
Develop J2EE Applications with Multibyte Characters
Summary
Most J2EE servers can support multibyte-character languages (like Chinese and Japanese) very well, but different J2EE servers and browsers support them differently. When developers port some Chinese (or Japanese) localized applications from one server to another, they will always face multibyte -Character Problems. in this Article, WANG Yu Analyzes The root causes of problem.......................
4,500 Words;
April 19, 2004)
WANG Yu
he Chinese language is one of the most complex and comprehensive languages in the world. Sometimes I feel lucky to be Chinese, specifically when I see some of my foreign friends struggle to learn the language, especially writing Chinese characters. However, I do not feel So Lucky When Developing Localized Web Applications Using J2EE. This Article Explains Why.
Though The Java Platform and Most J2EE Servers Support Internationalization Well, I am Still Confronted by Multibyte-Character Problems When Developing Chinese or Japanese Language-based Applications:
What is the difference between encoding and charset? Why do multibyte-character applications display differently when ported from one operating system to another? Why do multibyte-character applications display differently when ported from one application server to another? Why do my multibyte-character applications DISPLAY WELL ON THE INTERNET Explorer Browser? Why Do Applications ON MOST J2EE Servers DISPLICILY WHEN USING UTF-16 (Universal Transformation Format) eNCoding?
If you are asking the same set of questions, this article helps you answer them.Basic knowledge of characters Characters have existed long before computers. More than 3,000 years ago, special characters (named Oracles) appeared in ancient China. These characters have special visual forms and special meanings, with most having names and pronunciations. All of these facets compose the character repertoire, a set of distinct characters defined by a special language, with no relationship to the computer at all. Over thousands of years, many languages evolved and THOUSANDS OF CHARACTERS WERE CREATED. AND NOW We are trying to digitize all these characters INTO 1S AND 0S, SO Computers Can Unden THEM.
When typing words with a keyboard, you deal with character input methods. For simple characters, there is one-to-one mapping between a key and a character. For a more complex language, a character needs multiple keystrokes.
Before you can see characters on the screen, the operating system must store characters in memory. In fact, the OS defines a one-to-one correspondence between characters in a character repertoire and a set of nonnegative integers, which are stored in memory and Used by the Os. The constegers are called.
Characters can be stored in a file or transmitted through the network Software uses character encoding to define a method (algorithm) for mapping sequences of a character's character code into sequences of octets Some character code maps into one byte, such as ASCII code..; Other Character Code, Such As Chinese and Japa, Map Into Two or Moretes, Depending On The Different Character-Encoding Schemas.
Different languages may use different character repertoires;. Each character repertoire uses some special encodings Sometimes, when you choose a language, you may choose a character repertoire implicitly, which uses an implied character encoding For example, when you choose the Chinese language, you. may, by default, use the GBK Chinese character repertoire and a special encoding schema also named GBK.I avoid the term character set because it causes confusion. Apparently, character set is the synonym of character repertoire. Character set is misused in the HTTP Mime (MULTIPURPOSE Internet Mail Extensions) Header, where "charset" is buy for "encoding."
One of Java's many features is the 16-bit character. This feature supports Unicode use, a standard way of representing many different kinds of characters in various languages. Unfortunately, this character also causes many problems when developing multibyte J2EE applications, which this article focuses ON.
Development Phases Cause Display Problems J2EE Application Development Includes Several Phases (Shown in Figure 1); Each Phase Can Cause Multibyte-Character Display Problems.
Figure 1. J2EE Application Developments Life Cycle
Coding phase When you code your J2EE applications, most likely, you use an IDE like JBuilder or NetBeans, or an editor like UltraEdit or vi. Whatever you choose, if you have a literal string in your JSP (JavaServer Pages), Java, or HTML Files, Andi There Literal Strings Are Multibyte Characters Such As Chinese or Japanese, Most Likey, You Will Encounter Display Problems if you are not careful.
A literal string is static information stored in files. Different encodings are used for different language characters. Most IDEs set their default encoding to ISO-8859-1, which is for ASCII characters and causes multibyte characters to lose information. For example, in the chinese version of NetBeans, the default setting for file encoding is, unfortunately, ISO-8859-1. When I edit a JSP file with some chinese characters (shown in Figure 2), everything seems correct. As I mentioned above, we know that all these characters shown in the screen are in memory, having no direct relationship with encoding. After saving this file, if you close the IDE and reopen it, these characters appear incomprehensible (shown in Figure 3) because ISO-8859-1 encoding loses Some Information When Storing Chinese Characters.figure 2. Chinese Characters in NetBeans
Figure 3. Chinese Characters in chaos
Character-encoding APIs There are several APIs in the servlet and JSP specifications that handle the character-encoding process in J2EE applications. For a servlet request, setCharacterEncoding () sets the encoding schema for the current HTTP request's body. For a servlet response, setContentType () And setlocale () set mime header encoding for the output http response.
These APIs cause no problems themselves. On the contrary, the problems exist when you forget to use them. For example, in some servers, you can display multibyte characters properly without using any of the above APIs in your code, but when you run the . application in other servers, characters appear incomprehensible The reason for this multibyte-character display problem lies in how the servers treat character encoding during HTTP requests and responses The following rules apply to most servers when determining the character encoding in requests and responses:. When Processing a servlet Request, The Server Uses The Following Order of Precedence, First To Last, To Determine The Request Character Encoding:
Code-Specific Settings (for example: the setcharacterencoding () Method) Vendor-Specific Settings the default setting
When Processing a Servlet Response, The Server Uses The Following Order of Precedence, First To Last, To Determine The Response Character Encoding:
Code-Specific Settings (for Example: The setContentType () and setlocale () Methods Vendor-Specific Settings the Default Setting
According to the above rules, if you give instruction codes using these APIs, all servers will obey them when choosing the character-encoding schema. Otherwise, different servers will behave differently. Some vendors use hidden fields in the HTTP form to determine the encoding request Others use specificet files. The default settings can diffro. Most Vendors Use
ISO-8859-1
AS Default Settings, While a Few Use The OS's Locale Settings. Thus, Some Multibyte Character-Based Applications Have DISPLAY Problems When Porting To Another Vendor's J2EE Server.
Compile phase You can store multibyte literal strings, if correctly set, in source files during the edit phase. But these source files can not execute directly. If you write servlet code, these Java files must be compiled to classfiles before deploying to the application server. for JSP, the application server will automatically compile the JSP files to the classfiles before executing them. During the compile phase, character-encoding problems are still possible. to see the following simple demo, download this article's source code.Listing 1. EncodingTest. Java
1 import java.io.ByteArrayOutputStream; 2 import java.io.OutputStreamWriter; 34 public class EncodingTest {5 public static void main (String [] args) {6 OutputStreamWriter out = new OutputStreamWriter (new ByteArrayOutputStream ()); 7 System.out .println ("Current Encoding:" Out.Getencoding ()); 8 System.out.println ("Literal Output: Äãã £"); // you may not see Chinese string9} 10}
Some Explanation About The Source Code:
We use the following code to determine the system's current encoding: 6 OutputStreamWriter out = new OutputStreamWriter (new ByteArrayOutputStream ()); 7 System.out.println ( "Current Encoding:" out.getEncoding ()); Line 8 includes a direct Print-Out of a Chinese Character Literal String (You May Not See String Correctly Due To Your Os Language Settings) Store THIS JAVA SOURCE File with GBK ENCODING
Look at The Execution Result Shown in Figure 4.
Figure 4. Sample Output. Click on thumbnail to view full-size image.
From The Result in Figure 4, We can complude That:
The Java compiler (javac) uses the system's language environment as the default encoding setting, so does the Java Runtime Environment Only the first result is correct;. Other strings display incomprehensibly Only when the runtime encoding setting is the same as the one used to. store the source file can multibyte literal strings display correctly (alternatively, you must convert from one encoding schema to another; please see the "Runtime phase" section) .Server configuration phase Before you run your J2EE application, you should configure your application to meet special needs. In the previous section, we found that different language settings can cause literal-string display problems. Actually, different levels of configuration exist, and they all can cause problems for multibyte characters.
OS levelLanguage support of the operating system is most important. The language supports on the server side will affect JVM default encoding settings as described above. And the language support on the client side, such as font, can also directly affect character display, but this Article doesn't focus on what.
J2EE application server levelMost servers have a per-server setting to configure the default behavior of character-encoding processing For example, Listing 2 is part of Tomcat's configuration file (located in $ TOMCAT_HOME / conf / web.xml).:
Listing 2. Web.xml
JVM level Most servers can have multiple instances simultaneously, and each server instance can have an individual JVM instance. Plus, you can have separate settings for each JVM instance. Most servers have locale settings for each instance to define the default language support.
Figure 5. Sun ONE Application Server Setting
Shown in Figure 5, the Sun ONE (Open Network Environment) Application Server has a per-instance setting for locale. This setting indicates the default behavior of encoding characters for the logging system and standard output.
On the other hand, different servers may use distinct JVM versions;.. And different JDK versions support various encoding standards All these issues can cause porting problems For example, Sun ONE Application Server and Tomcat support J2SE 1.4, while others support only J2SE 1.3. J2SE 1.4 supports Unicode 3.1, which has many new features previous versions lacked.Per-application level Every application deployed on the server can be configured with its unique encoding settings before it runs within the server container. This feature allows multiple applications using different languages to Run Inside One Server Instance. For Example, In Some Servers, You Can Give The Following Character-Encoding Settings for Each Deployed Application To Indicate Which Encoding Schema Your Application Should Uses:
The reason for all these configuration levels is flexibility and maintenance. However, unfortunately, they will cause problems when porting from one server to another, because not all configurations adhere to standards. For example, if you develop your application in a server that supports the Locale-Charset-Info Setting, You May Have Difficulties if you want to port the application to another server trises not support this encoding setting.
Runtime phase At runtime, your J2EE application most likely communicates with other external systems. Your applications may read and write files, or use databases to manage your data. In other cases, an LDAP (lightweight directory access protocol) server stores identity information. Under all these situations, data exchange is needed between J2EE applications and external systems. If your data contains multibyte characters such as Chinese characters, you may face some issues.Most external systems have their own encoding settings. For example, an LDAP server most likely uses UTF-8 to encode characters;.. Oracle Database System uses environment variable NLS_LANG to indicate encoding style If you install Oracle on a Chinese OS, this variable resets to ZHS16GBK by default, which uses GBK encoding to store Chinese characters So if your J2EE application's Encoding Settings Differ from The External System, Conversion IS Needed. The Following Code Is Common for these Situations:
Byte [] defaultbytes = Original.getbytes; string newencodingstr = new string (defaultbytes, old_encoding);
The above code shows how to convert a string from one encoding to another. For example, you have stored a username (multibyte characters) in an LDAP server, which uses UTF-8 encoding and your J2EE application uses GBK encoding. So when your application gets usernames from LDAP, they may not be encoded correctly. to resolve this, you can use original.getBytes ( "GBK") to get the original bytes. Then construct a new string using new String (defaultBytes, "UTF-8") , Which Can Display Correctly.
Client display phase Most J2EE applications now use the browser-server architecture, which employs browsers as their clients To display multibyte characters correctly in browsers, you should take note of the following:. Browser language supportTo display multibyte characters correctly, the browser and the OS WHERE The Browser Runs Should Have Language-Specific Supports, Such as Fonts and The Character Repertoire.
Browser encoding settingsThe HTML header that the server returns, such as. gives the browser an instruction about which encoding this page uses Otherwise, the Browser Uses The default encoding setting or automatically detects one. AlternATISELY, USERS CAN SET the page's encoding as shown in Figure 6.
Figure 6. Netscape's Encoding-Setting Page
Thus, IF a page lacks any instructions, The Multibyte Characters May Display Incorrectly. Under Such Situations, Users Must Manually Set The Current Page's Encoding.
HTTP POST encoding The situation grows more complicated when you post data to the server using the form tag in HTML pages. Which encoding the browser uses depends on the current page's encoding settings, which contains the form tag. That means if you construct an ASCII- encoded HTML page using ISO-8859-1, in this page, a user can not possibly post Chinese characters. Since all post data uses ISO-8859-1 encoding, it causes Chinese characters to lose some bytes. That is the HTML standard, which All browsers abide by.
HTTP GET encodingThings become more troublesome when you add multibyte characters to URL links, like View detail information of this user (** represents multibyte characters). Such scenarios are common; for example, you can put usernames or other information in links and transfer them to the next page But when non US-ASCII characters appear in a URL, its format is not clearly defined in RFC (request for comment) 2396. Different. browsers use their own methods for encoding multicharacters in URLs.Take Mozilla, for example, (shown in figures 7, 8, 9, 10); it will always perform URL encoding before the HTTP request is sent As we know, during the URL. -encoding process, a multibyte character first converts into two or more bytes using some encoding scheme (such as UTF-8 or GBK). Then, each byte is characterized by the 3-character string% xy, where xy is the byte's two- Digit Hexadecimal Representation. For more information, please consult the html specificat .
I use the folowing gbk_test.jsp Page as a demo:
Listing 3. GBK_TEST.JSP
<% @ Page ContentType = "Text / HTML; Charset = GBK"%>
Test for GBK ENCODED URL H1> body> html>The x738b is the escape sequence of a chinese character That is my family name. This page displays as level 7.
Figure 7. URL in Mozilla
When the mouse moves above the link, you can see the link's address in the status bar, which shows a Chinese character embedded inside this URL. When you click the link in the page, you can see clearly in the address bar that this character is URL-encoded. Character x738b encodes to% CD% F5, which is the result of URL encoding combined with GBK encoding. And on the server side, I can get the query string using a simple method, request.getQueryString (). In the Next Line, I Use Another Method, Request.GetParameter (String), To Show this Character As a Comparison to the Query String, Shown In Figure 8 .figure 8. URL ENCODING in Mozilla
When i change the current page's encoding from GBK To UTF-8, You Can See The Result: x738b Encodes TO% E7% 8E% 8B, Shown As Figure 9, Which Is The Result of Url Encoding Combined with UTF-8 Encoding.
Figure 9. URL ENCODING in Mozilla
. But Microsoft Internet Explorer treats the multibyte URL differently IE never completes URL encoding before the HTTP request is sent; the encoding scheme the URL-encoding method uses depends on the current page's encoding scheme, shown in Figure 10.
Figure 10. No url encoding in ie
IE Also Has An Advanced Optional Setting That Forces The Browser To ALWAYS Send The URL Request with UTF-8 Encoding, Shown in Figure 11.
Figure 11. Advance Option Setting in IE
According to the above explanation, you will face a problem: if your application pages have multibyte characters embedded into URL links and can work using Mozilla with GBK encoding, this application will encounter problems when users employ IE with the setting that forces the browser to always Send the URL Request with UTF-8 Encoding.
Solution to multibyte-character problems Writing J2EE applications that can run on any server and be viewed correctly using any browsers is a challenge Some solutions for multibyte-character problems in J2EE applications follow:. General principle: Never assume any default settings on both the client Side (browser) and server side.
In the edit phase, never assume that your IDE's default encoding settings are what you want;. Set them manually If your IDE does not support a specific language, you can use the / uXXXX escape sequence in your Java code and use the & # XXXX escape sequence in your HTML pages, or use the native2ascii tool shipped with the JDK to convert the native literal string to a Unicode escape sequence. That can help you avoid most of your problems. in the coding phase, never assume your server's default encoding- Processing Settings Are Correct. Use The Following Methods To Give Specific Instructions:
Request: setCharacterEncoding () Response: setContentType (), setLocale (), <% @ page contentType = "text / html; charset = encoding"%> When developing multilanguage applications, choose a UTF-8-encoding scheme or use the / uXXXX escape sequence for all language characters. When compiling a Java class, ensure the current language environment variables and encoding scheme are correctly set. in the configuration phase, use the standard setting as much as possible. for example, in the Servlet 2.4 specification, a standard is available for configuring every application's character-encoding scheme:
} Static public string chartohex (char c) {// returns hex string representation of char c byte hi = (byte) (c >>> 8); Byte LO = (byte) (C & 0xFF); Return ByToHex (Hi) byteToHex (lo);}} Always give obvious instructions to browsers in HTML pages, such as , and do not assume that the browsers' default settings are correct. do not embed multibyte characters into links. For example, do not take usernames as query strings, take the user's ID instead. If your links must embed multibyte characters, encode the URL manually, either through server-side Java Programming Or Client-Side Programming, Such as JavaScript or Vbscript.
A harder problem to solve: UTF-16 Using the above knowledge, let's analyze a real problem in one of my ISV's (independent software vendor) projects:. UTF-16 in J2EE The current Chinese character standard (GB18030) defines and supports 27,484 Chinese characters. Though this number seems large, it is not substantial enough to satisfy all Chinese people. Today, the Chinese language has more than 60,000 characters and is rapidly increasing every year. This situation will greatly hinder the Chinese government in its effort to achieve information digitalization. For example, my sister's given name is not in the standard character set, so bank or mail-system computers can not print it. My ISV wants to build a complete Chinese character system to satisfy all people. It defines its own character repertoire. TWO Options Exist for defining these character code: use the gb18030 standard, Which CAN Extend To more Than 160,00, Characters. Or Use Unicode 3.1, Which CAN Support 1,112,064 . Characters The GB18030 standard defines encoding rules, also called GB18030; it is simple to use and the current JDK supports it However, if we use Unicode 3.1, we can choose from three encoding schemes:. UTF-8, UTF-16, or UTF-32. My ISV wants to use UTF-16 encoding to handle its Unicode extension for Chinese characters. The most important feature of UTF-16 encoding is that all the ASCII characters are encoded as 16-bit units, which causes problems at all Phases. After Trying Several Servers, The ISV Found That J2EE Applications Cannot Support UTF-16 ENCODING AT All. Is this True? let '
s analyze every development phase to find the problems Edit phase If we have multibyte literal strings in our Java, JSP, or HTML source files, we need the IDE's support I use NetBeans, which can easily support UTF-16 encoding;.. just set the text-encoding attribute to UTF-16. Figure 12 shows a simple UTF-16-encoded JSP page containing only the static literal string "hello world!" This page executes in Tomcat and displays in Mozilla.Figure 12. UTF-16 page In Mozilla
Compile phase Since we have UTF-16-encoded characters in our Java or JSP source files, we need compiler support. We can use javac -encoding UTF-16 to compile Java source files. With NetBeans, setting the compiler attribute through the GUI is easy. By running some simple code, we find that we can use UTF-16-encoded characters in servlet files and execute them with no problems. Compiling JSP files dynamically at runtime proves trickier. fortunately, most servers can be configured to set Java encoding for its JSP pages. But, unfortunately, when tested in Tomcat and Sun ONE Application Server, I found that the Jasper tool, which converts JSP files to servlet Java source files, fails to recognize JSP tags, such as <% page ..% >, encoded with UTF-16-all these tags are treated as literal strings! I think the root cause may lie in Jasper-which most application servers use as a JSP compiler-because it uses byte unit to detect JSP special tokens and tags. Browser Test Presently, We Find That JSP CA nnot support literal UTF-16-encoded characters because of the failure to detect the UTF-16-encoded JSP tags. But servlets can work with no problems. Hold on! To make the test more meaningful, let's add a POST function to our test Code to let users post some uTF-16 Encode Characters through the html's form tag. Download the Following Demos from this article
s source code:. servlet PostForm.java and servlet ByteTest.java Servlet PostForm.java is used to output a UTF-16-encoded page with a form ready to post data to the server And in ByteTest.java, I do not use. request.getParameter () to show the post data from the browser because I am unsure if the server is configured for UTF-16 encoding. Instead, I use request.getInputStream () to retrieve the raw data from the request and print every byte of . whatever we get from the browser Listing 5. PostForm.java public class PostForm extends HttpServlet {.... protected void processRequest (HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {response.setContentType ( "text / html; charset = UTF -16 "); PrintWriter out = response.getwriter (); out.println ("
"); out.println (" "); out.println (" head> "); out.println ("