Oracle Database Character Set Problem Analysis ZZ

xiaoxiao2021-03-06 109

I often see some friends asking the Oracle character set, I want to introduce iteration. First iteration: master the basic concept of the character set. Some friends may think that this is more, but it is actually due to the unclearness of the relevant basic concepts, it has caused many problems and questions. The first is the concept of a character set. We know that the electronic computer was originally used to conduct scientific calculations (so called "computer"), but with the development of technology, it also requires other applications of other aspects of the computer. This requires that the computer can deal with the value, but also handle other information such as text, special symbols, while the computer itself can directly process only value information, so it is required to encode these text, symbol information, and the original character set is We are all familiar with ASCII, which uses seven binary positions to represent 128 characters, and later with the needs of different countries, organizational needs, there have been many character sets, such as the ISO8859 series of Western European characters. Represents characters such as GB2312-80, GBK, etc. of Chinese characters. The essence of the character set is to give a different value encoding, respectively, respectively, for the processing of the computer. Conversion between character sets. There are many characters, which will bring a problem, such as a character, is encoded as a value in a character set, and in another character set is encoded as another value, such as I have created two character set demo_charset1 Demo_charset2, in Demo_Charset1, I specified the code of three symbols: A (0001), B (0010),? (1111); In Demo_Charset2, I also specified the code of three symbols: A (1001), C (1011),? (1111), I received a task, to write a program, responsible for conversion between Demo_Charset1 and Demo_Charset2. Since the coding rule of the two character sets is known, the 0001 in DEMO_CHARSET1 is converted to Demo_Charset 2, and the encoding is changed to 1001; for the 1111 in DEMO_CHARSET1, the value is changed to Demo_Charset 2, and the value is constant; and in Demo_Charset1 0010, its corresponding character is B, but there is no corresponding character in demo_charset2, so it cannot be converted from theoretically, we can uniformly convert them into a special character in the target character set for all such classes. For "replace characters"), such as here we can? As an alternate character, B is converted to? There is a loss of information; the same principle, the information loss will occur when the DEMO_CHARSET2 C characters are converted to Demo_Charset1. Therefore, during the character set conversion process, if a character in the source character set is not defined in the target character set, the information is lost. Database character set selection. When we create a database, a question that needs to be considered is to choose what character set is set to the national character set (specified by the Create Database in the National Character Set clause).

Consider this problem, we must clear what data needs to be stored in the database. If you only need to store English information, then US7ASCII can be used as a character set; but if you want to store Chinese, then we need to choose the character set that can support Chinese ( Such as zhs16GBK); if you need to store multi-language text, then choose UTF8. The determination of the database character set, actually explaining the collection of characters that can be processed by this database, and has a lot of restrictions after the character set is selected, so you must consider clear after the database is created. . And many of our friends are creating a database, do not consider clear, often choose a default character set, such as WE8ISO8859P1 or US7ASCII, and these two character sets do not have Chinese character encodings, so use this character set to store Chinese characters from principle. It is wrong. Although I can use this character set as if I choose this character set, it will bring a series of trouble to the use and maintenance of the database. In the following iteration, we will analyze. The client's character set. Have some Oracle experienced friends, most of which will be able to set the client through NLS_LANG, NLS_LANG consists of the following sections: NLS_LANG = _ . , where the third part It is intended to specify the character set used by the client operating system default. Therefore, in accordance with regular usage, NLS_LANG should configure according to the actual situation of the client machine, especially for character set, so Oracle can maximize the automatic conversion of the database character set and the client character set (of course Is if you need to convert it). Summarize the focus of the first iteration: the character set: Code the specific symbolic set code as the value of the computer to handle; the conversion of the character set: the symbols existing in the source character set and the target character set, theoretical conversion will not The information loss will be generated; and for symbols that do not exist in the source character set, theoretical conversion will generate information loss; database character set: Select the character set that can contain all information symbols to be stored; client Character Set Setting: Indicates that the client operating system default character set. Second iteration: Deepen the understanding of basic concepts by instance

Below I will quote "Character Set Research and Inspections" post published on ITPub, and the friend listed the relevant experiments in the post, and puts some questions about the experimental results, I will conduct the results of his experiment. Analysis and answer his questions. Experimental results analysis

Quote:

Originally by TELLIN release setting client character set is US7ASCII D: /> set nls_lang = american_america.us7ascii view server character set is US7ASCII SQL> SELECT * from NLS_Database_parameters; parameter value ---------------------------------------------------------------------- ------------------------------------------------ ---- NLS_CHARACTERSET US7ASCII Create Test Table SQL> Create Table Test (R1 VARCHAR2 (10)); Table Created. Insert Data SQL> Insert Into Test Values ('Northeast'); 1 Row Created.SQL> Select * from Test; r1 ---------- Northeast SQL> EXIT

The access and display of this part of the experimental data is correct, as if there is no problem, but actually hide a lot of hidden dangers. First, it is not appropriate to store Chinese characters into the database and set the database character set to US7ASCII. The US7ASCII character set only defines 128 symbols and does not support Chinese characters. In addition, since Chinese can be entered in SQL * Plus, the operating system default should be Chinese, but the character set in NLS_LANG is set to US7ASCII, which is obviously incorrect, it does not reflect the actual situation of the client. But actual display is correct, this is mainly because Oracle Check the database and client character set settings, then data will not have any conversion during access to the client and the database. Specifically, in the client input "Northeast", "East", "East" code is 182 (10110110), 171 (10101011), "Northern" code is 177 (10110001), 177 (10110001), they will not Do any changes in the database, but this actually leads to the character set of the database identity and the actual amount of content, which is also an inconsistency, which is also an inconsistency. In the process of Select, Oracle also checks that the database is the same as the client's character set setting, so it also transmits the stored content into the client, and the client operating system recognizes this is Chinese character encoding. So you can display correctly. In this example, there is a problem with the client's settings, but it seems to have played the effect of "negative negative", from the perspective of the application, it seems that there is no problem. However, there is a great hidden danger, such as when using a string function such as LENGTH or SUBSTR, it may get unexpected results. In addition, if you encounter import / export (Import / Export) will encounter greater trouble. Some friends have made a lot of tests in this regard, such as the "source database character set for US7ASCII, exported the file character set to US7ASCII or zhs16GBK, the target database character set is the zhs16GBK", he draws "if In ORACLE92, we find this situation, no matter how it is handled, this export file cannot be imported into the Oracle9i database "," For this case, we can set the export character set by using Oracle8i's export tool. US7ASCII, modify the second, three characters, modify the second, three characters, modify 0001 to 0354, so that the data of the US7ASCII character set can be imported into the ZHS16GBK database. " I would like to understand these conclusions so that it may be more appropriate: Since the zhs16GBK character set is super, the transition should have no problem, but the nature of the problem is that we should only store English characters. The US7ASCII database has unconventionally stores Chinese information, then there is no strangeness in the process of transformation, and it is somewhat strange. Therefore, it is necessary to avoid this, just choose the appropriate character set when establishing a database, does not allow the label (the character set setting of the database) and the actual (information actually stored in the database). Experimental results analysis

Quote:

[Change the client character set to zHS16GBKD: /> set nls_lang = american_america.zhs16gbkd: /> sqlplus "/ as sysdba" unable to display data SQL> SELECT *WROM TEST; R1 -------------- -------- 6 11 Question 1: ZHS16GBK is a super-collection of US7ASCII, why not display properly in the zhs16GBK environment This is mainly because Oracle Check Discovery database settings are different from the client configuration character set, it The data will be converted to the character set. The actual data actually stored in the database is 182 (10110110), 177 (10101011), 177 (10110001), because the database character set is set to US7ASCII, it is a 7bit character set, stored in the 8bit byte However, Oracle ignores the highest bit of each byte, then 182 (10110110) turns 54 (0110110), representing the number symbol "6" in the zhs16GBK (of course, in other characters are also "6"), the same process also occurs In other 3 bytes, such "Northeast" becomes "6 11". Experimental results analysis

Quote:

Originally published by Tellin with zhs16GBK inserted data SQL> Insert Into Test Values ('Northeast'); 1 Row Created.SQL> Select * from test; r1 ------------------------------------------------------------------------------------------------ --6 11 ?? SQL> EXIT

When the client character set is set to the ZHS16GBK, insert "Northeast", Oracle Check that the character set of the database settings is not consistent with the client, but the character set zhs16GBK "Northeast" in US7ASCII The corresponding character is inserted into the database with a unified "replaceable character" in the database, here "?", Encoded as 63 (00111111), at this time, the input information is actually lost, regardless of the character set setting (as below The experimental results of the referenced), the results of the second line of SELECT are both "?" (Note that it is 2, not 4).

Quote:

Change the client character set to US7ASCII

D: /> SET NLS_LANG = American_america.us7ascii

D: /> SQLPLUS "/ as sysdba"

Unable to display character sets inserted with zhs16GBK, but you can display the character sets inserted with US7ASCII

SQL> SELECT *.

------------

northeast

Change the server character set to zHS16GBK

SQL> Update Props $ SET VALUE $ = 'zhs16gbk' Where name = 'nls_characterset';

1 row updated.

SQL> commit;

Change the client character set to zHS16GBK

D: /> SET NLS_LANG = American_america.zHS16GBK

D: /> SQLPLUS "/ as sysdba"

The previous US7ASCII's character set can be displayed, but the data inserted with ZHS16GBK cannot be displayed, indicating that the data inserted with the zhs16GBK is garbled. SQL> SELECT *.

--------------------

northeast

It should be pointed out that via "Update Props $ SET VALUE $ = 'NLS_CHARACTERE Name =' NLS_CHARACTERSET ';" to modify the database character set is an unconventional approach, it is likely to cause problems, which is just the experimental results of the introduction of netizens.

Experimental results analysis

Quote:

SQL> INSERT INTO TEST VALUES ('Northeast'); 1 Row Created.SQL> SELECT * from Test; R1 -------------------- Northeast ?? Northeast SQL> Exit

Since the character set setting of the database and the client at this time is zHS16GBK, the conversion of the character set will not occur, the first line and the third line data are displayed correct, while the second line is 63 (00111111) because the data stored is 63 (00111111), so it is displayed Is it "?".

Quote:

Change the client character set to US7ASCII

D: /> SET NLS_LANG = American_america.us7ascii

D: /> SQLPLUS "/ as sysdba"

Unable to display data

SQL> SELECT *.

------------

Doubt 2: The first line of data is inserted with the US7ASCII environment, why not display properly?

After the client character set setting is changed to US7ASCII, the Oracle checks that the character set of the database settings is zhs16GBK. The data needs to be converted, while the first line and the third line of Chinese characters "East" and "North" in the customer There is no corresponding character in the terminal character set US7ASCII, so converted to "?" "("? "), And the second line of data is two"? ", So although the three lines displayed on the client Are all two "?", But the content stored in the database is different. Experimental results analysis

Quote:

SQL> INSERT INTO TEST VALUES ('Northeast'); 1 row created.sql> EXIT Change client character set to zhs16GBKD: /> set nls_lang = american_america.zhs16gbkd: /> sqlplus "/ as sysdba" unable to display US7ASCII insertion The character set, but you can display the character set SQL> Select * from test; R1 ------------------------------------------------------------------------------------------ : US7ASCII is the subset of zhs16GBK, why cannot be displayed in the US7ASCII environment? [/ B]

When the client character set is set to US7ASCII, insert the "Northeast" into the database of the zhs16GBK, the character conversion is required, "Northeast" ZHS16GBK code is 182 (1011011), 171 (10101011) and 177 (10110001), 177 (10110001), because US7ASCII is 7bit encoding, Oracle regards the two Chinese characters as four characters, ignoring the highest bit of each byte, so that the coding of the database is changed 54 (00110110), 43 (00101011) ) And 49 (00110001), 49 (00110001), that is, "6 11", the original information is changed. At this time, the client character set is set to zHS16GBK and then the information in the database does not require changes to the client. First, the three lines are not changed due to the deposit information, and the second, four The row cannot be displayed because the information is changed, so the original information cannot be displayed. It is also very simple to analyze so many content, but it is also very simple. If you want to have fewer mistakes and trouble in character sets, you need to stick to two basic principles: in the database end: Select the desired character set (through Create Database Character Set is specified in the National Character Set clause; at the client: set the character set (by environment variable NLS_LANG) that is actually used by the operating system. For example: Character Set ZHS16GBKNATIONAL Character Set Al16UTF16

转载请注明原文地址:https://www.9cbs.com/read-94561.html

9cbs

New Post(0)