JDBC and character set summary Danci.z (small thanks), 2003.11.16
In the character set problem encountered by JDBC access to the database, it can be summarized as the following factor:
- JVM handles the character set
The JVM core completely uses the Unicode character set, and UTF-16LE (X86 and UNIX) is encoded. The Java compiler scans the .java source files will complete the pre-conversion, such as compiling .java files on Chinese Windows, you may have noticed the string in the .java file and .Class's different. Because .java file itself is used by GB2312 encoding, and .CLASS is UTF-16LE encoding. If your editor supports, you may choose to write a .java source program directly with UTF-8, and the Java compiler will decode the UTF-8 pair source program.
At the time of output, for example, a code conversion will be completed, and in the above case, the string of the UTF-16LE encoded in the memory is often converted into the GB2312 encoded on the console.
- The character set used by JSP page
It will always be preproced to .java program and compile it into .class, it is always a servlet that JSP is always a servlet, so there are two character sets here, one is the character set used by the .jsp file itself, One is a connet-type of the servlet output content. Try to keep the .jsp file itself the character set and the character set of the output content, such as UTF-8 consistent. Response implementation converts the UTF-16LE string in the JVM to <% @ Page Encoding = ...%> The coded specified,
- Character set used by Connection
The connection character set limits the characters that the SQL statement can be used. This is particularly obvious in UTF-16, and if the connection does not use the UTF-16 character set, since most Latin-1 character sets will make most SQL statements to be invalid statements, such as SELECT statements After encoding by UTF-16LEL, it will become "S / 0E / 0L / 0E / 0C / 0T / 0 ...", the server's SQL analyzer is encountered by the first '/ 0', it is considered that the statement has ended. However, it is still possible to send the UTF-16LE-encoded string to the LATIN-1 character set, the method is that the SQL statement itself still uses Latin-1 encoding, and the relevant string (inner) is UTF-16LE. In this case, the string of the UTF-16LE cannot include a Unicode character set (including Latin letters and numbers, English symbols) encoded, otherwise the SQL analyzer reports the "String End". Error. (why?)
- Database systems
Not all databases support Unicode, you may be necessary to save some special character data through the character set conversion. If the database only supports the Latin-1 character set (such a system is not a small number), for Chinese, you can use the character string to encode Latin-1, then use GB2312 to decode, feel confusing? If you (once) is a C programmer, the code here is similar to Dynamic_cast, and decoding is equivalent to ReinterPret_cast.
SQL_STR = New String (java_str.getbytes ("ISO-8859-1"), "GB2312");
It is just the opposite when getting data:
Java_Str = New String (SQL_STR.GETBYTES ("GB2312"), "ISO-8859-1");
If the database system supports Unicode, then try to use Unicode. Some manuals suggest that you decide whether to use Unicode according to the specific situation, because Unicode will take up more storage space, and if UTF-8 is used, the speed of sort will "slow down 30% (mysql), please do not for these words And concerns, most of these are not a problem. For SQL Server 2000, this article is worth reading:
http://www.microsoft.com/china/msdn/library/techart/intlfeatures/techart/intlfeatures/techart/intlfeatures/techart/intlfeatures/techart/intlFeatures
The most important thing is that you need to add N characters left left (n must be capitalized), such as
INSERT INTO TABLE (Name_en, Name_native VALUES ('Yokohama', N 'Yokohama')
For the Sybase database (Sybase 11.5, Sybase 12), UTF-16 is not supported, but support UTF-8, in order to use Unicode, you may need the following connection string: JDBC: Sybase: TDS: 127.0.1: 4000 / Database ? charset = uTF8 & jConnect_version = 0
Similarly, using characters n-modified strings in the SQL statement, so that the SQL analyzer considers the string as Unicode encoding.
For the mysql database, the system supports four levels of character set settings: connection, database, table, field mysql reference manual Chapter 9 has detailed discussion, but pay attention to version requirements 4.1.0 or more, while Windows (NT, 2K, XP) Users please note that 4.1.0 has a bug, you must use 4.1.1 to use Unicode correctly.
There are N-start field types in SQL Server and Sybase, which is designed to be used for internationalized characters storage. In SQL Server, for example, NText is actually stored with Unicode.
SQL-99 specifies that the Unicode string is uniformly used U prefix, such as u "コ コ ピュ ピュ", but currently did not see which database system supports this grammar.
Attached: Test examples supported by several character sets (the source code that needs to be tested can be: jljljl@yahoo.com)
Disclaimer: Connection C; Statement S;
Generating data: String Lit1 = "text: The People's Republic of China]"; String [] Encs = new string [] {"(default)", "ISO-8859-1", "CP850", "GB2312", "GBK "," BIG5 "," UTF-16LE "," UTF-16BE "," UTF-8 ",}; String JavaSRC =" [This is the default code " LIT1; BYTE [] Rawdata; S.ExecuteUpdate (" delete From stringtable "); for (int i = 0; i
Typical test results: SQL-Server, Type = Ntext (default): [[this is the text of the People's Republic] -> [[[this is the text of the People]] ISO- 8859-1: [[?? ISO-8859-1 ????????????] -> [[?? ISO-8859-1 ???????????] ] cp850: [[?? cp850 ????????????] -> [[?? cp850 ???????????]] GB2312: [[this is GB2312 Text: The People's Republic of China]] -> [[This is the text of GB2312: the People's Republic of China]] GBK: [[This is the text of GBK: the People's Republic of China]] -> [[this is GBK text: China People's Republic of China]] BIG5: [[? 琌 BIG5 ゅセ ? チ ㎝?]] -> [? Is Big5 text: middle? People's Republic?]] UTF-8: [[杩欐槸 UTF-8 镄勬 枃 chain 涓 崕 崕 皯鍏 拰 锲絔 锲絔] -> [[This is the text of UTF-8: People's Republic of China]]]
SQL-Server, TYPE = TEXT
(DEFAULT): [[This is the text of the People's Republic] -> [[[[[[this is the text of the People] ISO-8859-1: [?? ISO-8859 -1 ????????????] -> [[?? ISO-8859-1 ???????????] CP850: [[?? CP850??? ?????????????????????????????????????????????????] GB2312: [[This is the text of GB2312: People's Republic of China]] -> [ [This is the text of GB2312: People's Republic of China]] GBK: [[This is the text of GBK: People's Republic of China]] -> [[This is GBK text: People's Republic of China] BIG5: [[? 琌 BIG5 ゅセ い? チ ㎝?]] -> [? Is the text of BIG5: middle? Sanmin 囝 and?]] UTF-8: [[杩欐 槸 UTF-8 镄勬 枃 ] 崕 崕 皯鍏 皯鍏 拰 锲絔] -> [[This is the text of UTF-8: the People's Republic of China]]
Sybase, Type = Char (Default): [[?? (default) ???????????] -> [[?? (default) ??????????? ]] ISO-8859-1: [?? ISO-8859-1 ????????????] -> [?? ISO-8859-1 ???????? ???]] CP850: [[?? cp850 ????????????] -> [[?? cp850 ???????????]] GB2312: ?? GB2312 ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? ?????]] -> [?? GBK ????????????] BIG5: [?? big5 ???????????] > [?? big5 ???????????] UTF-16LE -> [[? ??????????? People?] UTF-16BE : -> [[?? 唀吀 ????????????????]
UTF-8: [[??? uTF-8 ?????????????????] -> [[??? UTF-8 ????????? ????????]
Sybase, Type = nchar (default): [[?? (default) ???????????] -> [[?? (default) ??????????? ]] ISO-8859-1: [?? ISO-8859-1 ????????????] -> [?? ISO-8859-1 ???????? ???]] CP850: [[?? cp850 ????????????] -> [[?? cp850 ???????????]] GB2312: ?? GB2312 ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? ?????]] -> [?? GBK ????????????] BIG5: [?? big5 ???????????] > [?? big5 ???????????] UTF-16LE -> [[? ??????????? People?] UTF-16BE : -> [[?? 唀吀 ??????????? 吿 ??] UTF-8: [[??? uTF-8 ??????????? ??????] -> [[??? uTF-8 ?????????????????]
Sybaset = UTF8 (Default): [[this is the text) of the text: the People's Republic of China] -> [[this is the text of the "DEFAULT): People's Republic of China] ISO-8859- 1: [?? ISO-8859-1 ????????????] -> [[?? ISO-8859-1 ???????????]] CP850 : [?? cp850 ?????????????????????] -> [[?? cp850 ???????????] GB2312: [[This is the text of GB2312: People's Republic of China] -> [[This is the text of GB2312: People's Republic of China] GBK: [[This is the text of GBK: People's Republic of China]] -> [[This is the text of GBK: People's Republic of China ]] BIG5: [[? 琌 BIG5 ゅセ ? チ ㎝?]] -> [? is BIG5 text: middle? People's Republic?]] UTF-16LE: [[This is UTF- 16LE text? People's Republic of China]] UTF-16BE: [[This is the text of UTF-16BE? Chinese People's Republic]]
UTF-8: [[杩欐 槸 UTF-8 镄勬 枃 涓 崕 皯鍏 拰 拰 锲絔] -> [[[This is the text of UTF-8: People's Republic of China]]
Sybase, Type = nchar, charset = utf8
(DEFAULT): [[This is the text of the People's Republic] -> [[[[[[this is the text of the People] ISO-8859-1: [?? ISO-8859 -1 ????????????] -> [[?? ISO-8859-1 ???????????] CP850: [[?? CP850??? ?????????????????????????????????????????????????] GB2312: [[This is the text of GB2312: People's Republic of China]] -> [ [This is the text of GB2312: People's Republic of China]] GBK: [[This is the text of GBK: People's Republic of China]] -> [[This is GBK text: People's Republic of China] BIG5: [[? 琌 BIG5 ゅセ い? チ ㎝?]] -> [? Is the text of BIG5: middle? People Republic?]] UTF-16LE: -> [[Is this a UTF-16LE text? Chinese people Republic]] UTF-16BE: -> [[? This is the text of UTF-16BE? The Chinese people's republican]] UTF-8: [[杩欐 槸 UTF-8 镄勬 枃 涓 崕 崕 皯鍏 拰 拰 锲絔] -> [[[This is the text of UTF-8]] Sybase, Type = Char, Charset = CP936 (DEFAULT): [[this is the text) of the text: People's Republic of China]] -> [[[This is the text of the "DEFAULT]] ISO-8859-1: [?? ISO-8859-1 ????????????] -> [ [?? ISO-8859-1 ????????????] CP850: [[?? cp850 ???????????]] -> [?? CP850? ?????????????????????] GB2312: [[This is the text of GB2312: the People's Republic of China]] -> [[This is the text of GB2312: People's Republic of China]] GBK: [[This is GBK Text: The People's Republic of China] -> [[[This is the text of GBK: the People's Republic of China]] BIG5: [[? 琌 BIG5 ゅセ ? チ ㎝?]] -> [? Is BIG5 text: middle? People's Republic?] UTF-16LE: -> [[This is the text of UTF-16LE? The People's Republic of China]] UTF-16BE: -> [[] This is the text of UTF-16BE ? Chinese people's republic]] UTF-8: [杩欐 槸 UTF-8 镄勬 枃 涓 崕 皯鍏 皯鍏 拰 锲絔 锲絔 锲絔 锲絔 锲絔 锲絔 锲絔 文 文 文 文 文 文 文 文 文 文 文 文 文 文 文 文 文
Sybaset = cost, charset = eucgb (default): [[This is the text of the People]] -> [[this is the text) of the text: People's Republic of China] ISO-8859- 1: [?? ISO-8859-1 ????????????] -> [[?? ISO-8859-1 ???????????]] CP850 : [?? cp850 ?????????????????????] -> [[?? cp850 ???????????] GB2312: [[This is the text of GB2312: People's Republic of China] -> [[This is the text of GB2312: People's Republic of China] GBK: [[This is the text of GBK: People's Republic of China]] -> [[This is the text of GBK: People's Republic of China ]] BIG5: [[?? big5? ゅセ? ?? チ ???] -> [?? big5? text ??? people ???]] UTF-16LE: -> [ [? 吀 ??????????? People?]] UTF-16BE: -> [?? 唀吀 ??????????? ] UTF-8: [[杩 ?? uTF-8 ?????? ?? 浜? ????] -> [[??? UTF-8 ???????? ? People ????????] Sybase, Type = nchar, charset = eucgb
(DEFAULT): [[This is the text of the People's Republic] -> [[[[[[this is the text of the People] ISO-8859-1: [?? ISO-8859 -1 ????????????] -> [[?? ISO-8859-1 ???????????] CP850: [[?? CP850??? ?????????????????????????????????????????????????] GB2312: [[This is the text of GB2312: People's Republic of China]] -> [ [This is the text of GB2312: People's Republic of China]] GBK: [[This is the text of GBK: the People's Republic of China]] -> [[This is GBK text: People's Republic of China] BIG5: [?? Big5 ? ゅセ? ?? チ ???] -> [?? big5? Text? Medium ????]] UTF-16LE: -> [[? ????? ????????????]] UTF-16BE: -> [?? 唀吀 ??????????? 共 ??] UTF-8: [[杩? UTF-8 ?????? ?? 浜 烘 ?? ???] -> [[??? uTF-8 ???????? people ??????? ?]
(It can be seen from the above test, my .java file is encoded with GB2312.)