Everyone is discussing the State East of the database, just participated in the development of a data warehouse project, the following things are considered to be a database optimization study actual experience, take out everyone share. Welcome to criticize Zheng A!
SQL statement:
It is the only way to operate the database (data);
Consuading 70% to 90% of database resources; independently of program design logic, relative to optimization of program source code, is very low in time cost and risk at time costs and risk;
Can have different ways of writing; easy to learn, it is difficult to master.
SQL optimization:
Fixed SQL writing habits, the same query is as high as possible, the stored procedure is highly efficient.
You should write a statement consistent with its format, including the case of the letter, punctuation, the position of the wrap, etc.
Oracle optimizer:
The expression is evaluated in any possible time, and the specific syntax structure is converted into an equivalent structure.
Either a result expression can be faster than source expressions
Either source expressions are just a variable semantic structure of the result expression
Different SQL structures sometimes have the same operation (for example: = any (subquery), Oracle maps them to a single semantic structure.
1 constant optimization:
The calculation of constants is done once when the statement is optimized, not at each execution. Below is an expression that retrieves a monthly salary greater than 2000:
Sal> 24000/12
Sal> 2000
Sal * 12> 24000
If the SQL statement includes the first case, the optimizer simply converts it into a second.
Optimizer does not simplify expressions across comparison characters, such as third statements, in view of this, should try to write expressions with field comparison, not to place fields in expressions. Otherwise, there is no way to optimize, such as if there is an index on the SAL, the first and second can be used, and the third is difficult to use.
2 Operator optimization:
The optimizer converts the retrieval expression of the LIKE operator and a expression that does not have wildcard to a "=" operator expression.
For example: the optimizer converts expression ename like 'smith' to ename = 'smith'
Optimizer can only convert an expression involving a variable length data type, in the previous example, if the type of ename field is char (10), the optimizer will not do any conversion.
Generally speaking, LIKE is more difficult to optimize.
among them:
~~ in operator optimization:
The optimizer replaces the retrieval expression using the IN comparator to use the "=" and "OR" operator retrieval expression.
For example, an optimizer will replace the expression ename in ('Smith', 'King', 'Jones')
ENAME = 'smith' or ename = 'king' or ename = 'Jones'
~~ Any and Some operator optimization:
The optimizer will replace the ANY and Some retrieval of the ANY and Some retrieve conditions of the listed list and the expression of the equivalent operator and "OR".
For example, an optimizer will replace the first statement as follows:
Sal> Any (: first_sal,: second_sal)
Sal>: first_sal or sal>: second_sal
The optimizer converts the ANY and SOME search conditions of the subquery into the retrieval expression consisting of "exists" and a corresponding subquery.
For example, the optimizer will replace the first statement as follows: x> Analy (select Sal from Emp where job = 'analyst')
EXISTS (Select Sal from Emp where Job = 'Analyst' and x> SAL)
~~ All operator optimization:
The optimizer replaces the ALL operator of the following value list with the expression of equivalent "=" and "AND". E.g:
Sal> all (: first_sal,: second_sal) Expression is replaced with:
Sal>: first_sal and sal>: second_sal
For the ALL expression of the subqueries, the optimizer is replaced with an expression of Any and another suitable comparison. E.g
X> ALL (Select Sal from Emp where deptno = 10) Replace:
NOT (x <= any (select sal from emp where deptno = 10)))
Next, the optimizer converts the second expression to the ANY expression conversion rule to the following expression:
NOT EXISTS (SELECT SAL FROM EMP Where Deptno = 10 and x <= SAL)
~~ BETWEEN operator optimization:
The optimizer always uses "> =" and "<=" to compare the Equivalent of the BetWeen operator.
For example, an optimizer will replace the expression Sal BetWeen 2000 and 3000 with Sal> = 2000 and Sal <= 3000.
~~ NOT operator optimization:
The optimizer always tries to simplify the search condition to eliminate the "Not" logic operator, which will involve the "NOT" operator elimination and the corresponding comparison operator.
For example, an optimizer will use the second statement by the following first statement:
Not deptno = (Select Deptno from Emp where ename = 'Taylor')
Deptno <> (Select Deptno from Emp where ename = 'taylor')
Normally, a statement containing the NOT operator has many different ways, the optimizer's conversion principle is to make the clause of the "Not" operator as simple as possible, even if it may make the result expression contain more "" NOT "operator.
For example, the optimizer will use the first statement as shown in the second statement:
Not (Sal <1000 or Comm is Null)
NOT SAL <1000 and common is not null sal> = 1000 and common is not null
How to write efficient SQL:
Of course, it is necessary to consider the optimization of SQL constants and operators, in addition, it needs:
1 Reasonable index design:
Example: Table RECORD has 620000 rows, trying to look at different indexes, the following SQL operation:
Statement a
Select count (*) from record
WHERE DATE> '19991201' And Date <'19991214' and Amount> 2000
Statement B
Select count (*) from record
WHERE DATE> '1990901' And Place in ('bj', 'sh')
Statement C
SELECT DATE, SUM (Amount) from Record
GROUP BY DATE
1 Built on Date has a non-aggregated index
A: (25 seconds)
B: (27 seconds)
C: (55 seconds)
analysis:
There is a lot of repetition values on the Date. Under the non-aggregated index, the data is properly stored on the data page. When searching, a table scan must be executed to find all the rows within this range.
2 A gathering index on Date
A: (14 seconds)
B: (14 seconds)
C: (28 seconds)
analysis:
Under the aggregation index, the data is physically in order on the data page, and the repetition value is also arranged together, and thus, when searching, you can find the starting point of this range, and only scan the data page within this range, avoid Large scanning, improved query speed.
3 Combination index on Place, Date, Amount
A: (26 seconds)
C: (27 seconds)
B: (<1 second)
analysis:
This is an unseasonful combination index because it is the leader, the first and second SQL do not quote Place, so there is no use of the index; the third SQL uses Place, and all columns referenced Contains in the combined index, the index coverage is formed, so its speed is very fast.
4 Combination index on Date, Place, Amount
A: (<1 second)
B: (<1 second)
C: (11 seconds)
analysis:
This is a reasonable combination index. It uses DATE as the leader, allowing each SQL to utilize indexes, and forms an index coverage in the first and third SQLs, and thus performance has achieved optimal.
Summary 1
The index established by default is a non-aggregated index, but sometimes it is not the best; reasonable index design is based on analysis and prediction of various queries. General:
There are a large number of repetition values, and often have a range query (Between,>, <,> =, <=) and the columns that occur in ORDER BY, GROUP BY, consider establishing a gathering index;
Multi-columns are often accessed simultaneously, and each column contains a repetition value to consider establishing a combined index; establish a search in columns often used in the conditional expression, do not establish an index on the columns of different values. . For example, in the "gender" column of the employee, there is only two different values of "men" and "women", so there is no need to establish an index. If the establishment index does not improve query efficiency, it will seriously reduce the update speed.
The combined index should try to make a key query form an index coverage, and the leading list must be the most frequent column.
2 Avoid using incompatible data types:
For example, Float and Int, CHAR and VARCHAR, BINARY and VARBINARY are incompatible. The incompatibility of data types may enable optimizers to perform some optimized operations that can be performed. E.g:
Select Name from Employee Where Salary> 60000
In this statement, if the Salary field is Money type, the optimizer is difficult to optimize because 60000 is an integer. We should transform the integer when programming, not to wait until running.
3 is Null & IS NOT NULL:
You cannot use NULL to index, and any columns containing NULL values will not be included in the index. Even if there is a multi-column such that there is a column in these columns contains NULL, the column is excluded from the index. That is to say, if there is a null value, even if the index of the index is not improved. Any statement optimizer using is NULL or IS NOT NULL in the WHERE clause is not allowed to use the index. 4 in and EXISTS:
The EXISTS is far more efficient than in. Inside is related to Full Table Scan and Range Scan. Almost all In operator sub-queries are rewritten to subquery using Exists.
example:
Statement 1
SELECT DNAME, Deptno from DEPT
WHERE Deptno NOT IN
Select Deptno from EMP;
Statement 2
SELECT DNAME, Deptno from DEPT
WHERE NOT EXISTS
(SELECT Deptno from EMP
Where dept.deptno = Emp.deptno;
Obviously, 2 is more than 1 execution performance.
This is a waste of time operation because of the EMP. And I didn't use the EMP INDEX in 1.
Because there is no WHERE clause. The statement in 2 is in the Range Scan.
5 IN, OR clauses often use worksheets to make indexes:
If a large number of repetition values are not generated, you can consider unpacking the clause. An index should be included in the unpublished clause.
6 Avoid or simplify sort:
The large table should be simplified or avoided. When an output can be generated using an index to generate an output in an appropriate order, the optimizer avoids the step of sorting. Here are some influencing factors:
The index does not include one or several columns to be sorted;
The order of the columns in the Group By or Order By clause is different from the order of the index;
Sort columns come from different tables.
In order to avoid unnecessary sorting, it is necessary to correctly enhance indexes, reasonably consolidate database tables (although sometimes it may affect the standardization of the table, but is worthy of efficiency). If sort is inevitable, you should try to simplify it, such as the range of zodes of sorting.
7 Eliminate the order of the large table line data:
In nested queries, sequential access to the table may have a fatal impact on query efficiency. For example, use sequential access strategy, a nest 3 query, if each layer queries 1000 lines, then this query is to query 1 billion row data. Avoiding the main method of this is to index the column of the connection. For example, two tables: student table (student number, name, age ??) and select class (student number, course number, grade). If two panels are connected, they must establish an index on the "Learning" connection field.
It is also possible to use and set to avoid sequential access. Although there are indexes on all check columns, some form of WHERE clause is forced optimizer to use sequential access. The following query will force the order of operation of the ORDERS table:
Select * from Orders where (Customer_Num = 104 and ORDER_NUM> 1001) or ORDER_NUM = 1008
Although indexing is built in Customer_NUM and ORDER_NUM, the optimizer is used in the above statement or the sequential access path to scan the entire table. Because this statement is to retrieve a collection of separated rows, it should be changed to the following statement:
Select * from orderers where customer_num = 104 and order_num> 1001
Union
SELECT * from ORDERS WHERE ORDER_NUM = 1008 This can use the index path to process queries.
8 Avoid related subsis:
A column label occurs in the query in the inquiry and the WHERE clause, then it is likely that the subquery must be re-query after the column value in the main query changes. The more nesting, the lower the efficiency, so you should try to avoid subquery. If the child query is inevitable, then filter out as much row as possible in the child query.
9 Avoid formal expressions:
Matches and Like Keywords support wildcard matching, and the technical is called regular expressions. But this match is particularly time consuming. For example: SELECT * from Customer WHERE ZIPCODE LIKE "98_ _ _ _"
Even the index is established on the zipCode field, in this case, it is also possible to use sequential scanning. If the statement is changed to SELECT * from customer where zipcode> "98000", you will use the index to query when you execute the query, obviously greatly improves the speed.
In addition, it is necessary to avoid non-start substrings. For example, the statement: select * from customer where zipcode [2,3]> "80", the non-start substring is used in the WHERE clause, so this statement does not use an index.
10 Unexpected connection conditions:
Example: Table Card has 7896 lines, there is a non-aggregated index on Card_no, table Account has 19112 lines, there is a non-aggregated index on Account_NO, trying to see the execution of two SQL under different table connections:
Select SUM (A.Amount) from Account A, Card B Where A.Card_no = B.Card_no
(20 seconds)
Change SQL to:
SELECT SUM (A.Amount) from Account A, Card B Where A.Card_no = B.Card_no and a.account_no = B.account_no
(<1 second)
analysis:
Under the first connection condition, the best query is the extraction table, the CARD as the inner layer, and the number of I / O can be estimated by the following formula:
Layer 22541 on page 22541 (Outer Table Account 191122 line * inner layer table Card) 3 pages to find on the first line of the outer table) = 595907 times I / O
Under the second connection conditions, the best query is the CARD as an external table, an Account as an inner table, and uses an index on the Account. The number of I / O can be estimated by the following formula:
Page 1944 (Outer Table Card 7896) on the outer table Card * Inside table Account, the 4 pages of each line to find each other in each table) = 33528 times I / O
It can be seen that only the fully connected conditions, the true best solution will be executed.
Multiple Table Operations When performing actual execution, the Query Optimizer lists several groups of possible connection scenarios and finds the best solution for system overhead based on the connection conditions. The connection condition should be considering the table with indexes, the number of rows of rows; the selection of the inner and outer tables can be determined by the formula: the number of matches in the outer table * The number of times in the inner layer table is determined, the minimum is the best Program.
Improfable WHERE clause
example 1
The columns in the following SQL condition statements have a proper index, but the speed is very slow: select * from record where substring (CARD_NO, 1, 4) = '5378'
(13 seconds)
Select * from Record Where AMOUNT / 30 <1000
(11 seconds)
Select * from Record Where Convert (25), Date, 112) = '19991201'
(10 seconds)
analysis:
Any operation result of the column in the WHERE clause is calculated by the SQL runtime, so it has to perform a table search, without using the index above the column; if these results can be compiled in the query, Then you can be optimized by the SQL optimizer, use the index, avoid the table search, so rewrite SQL to the following:
Select * from record where card_no like '5378%'
(<1 second)
SELECT * from Record Where Amount <1000 * 30
(<1 second)
Select * from record where date = '1999/12/01'
(<1 second)
11 During storage, temporary table optimization query:
example
1. Data from Vendor_Num from the PARVEN table:
SELECT Part_NUM, VENDOR_NUM, Price from Parven Order By Vendor_Num
INTO TEMP PV_BY_VN
This statement sequence reads PARVEN (50 pages), write a temporary table (50 pages), and sort. Assume that the overhead of the sort is 200, a total of 300 pages.
2. Connect the temporary table and vendor table, output the result to a temporary table and press part_num:
SELECT PV_BY_VN, * VENDOR.VENDOR_NUM FROM PV_BY_VN, VENDOR
Where pv_by_vn.vendor_num = vendor.vendor_num
ORDER BY PV_BY_VN.PART_NUM
INTO TMP PVVN_BY_PN
DROP TABLE PV_BY_VN
This query reads PV_BY_VN (50 pages), which is transmitted through the Vendor table by index, but since the vendor_num order is arranged, it is actually read sequentially by the index (40 2 = 42 pages), the output of the output Page approximately 95 lines, a total of 160 pages. Write and access these pages to trigger 5 * 160 = 800 read and write, index a total of reading and writing 892 pages.
3. Connect the output and Part to get the last result:
SELECT PVVN_BY_PN. *, Part.part_desc from PVVN_BY_PN, PART
Where pvvn_by_pn.part_num = part.part_num
DROP TABLE PVVN_BY_PN
In this way, the query is sequentially read in the PVVN_BY_PN (160 pages), and 15,000 times by the index read PART table. Due to the establishment of an index, 1772 disks read and write, the optimization ratio is 30: 1. Ok, get it.
In fact, SQL optimization, all databases are interoperable.