Query optimization technology http://www.chinaunix.net of the database: Blue keyboard Posted: 2003-06-03 14:00:49
1. Rational use index index is an important data structure in the database, and its fundamental purpose is to improve query efficiency. Most of the database products are now using IBM's first ISAM index structure. The use of indexes is just right, and the principles of use are as follows: ● Inconditioning, but not specified as the column of the foreign key, and the unconnected field is automatically generated by the optimizer. ● Establish an index on the columns of frequent sorting or grouping (ie, GROUP BY or ORDER BY operation). ● Establish a search in columns that are often used in the conditional expression, do not establish an index on the columns of different values. For example, only two different values of "male" and "female" in the "sex" column of the employee table, so it will not be necessary to establish an index. If the establishment index does not improve query efficiency, it will seriously reduce the update speed. ● If there are multiple columns to be sorted, a composite index can be established on these columns. ● Use system tools. If the Informix database has a TbCheck tool, you can check on the suspicious index. On some database servers, the index may fail or because of frequent operation, the read efficiency is reduced. If a query using the index is unknown, you can try the integrity of the index with the TbCheck tool, and fix it if necessary. In addition, after the database table updates a large amount of data, the index can be removed and reconstructed can increase the query speed. 2. Avoiding or simplifying sorts should be simplified or avoided to repeat the large table. When an output can be generated using an index to generate an output in an appropriate order, the optimizer avoids the step of sorting. The following is some influencing factors: In order to avoid unnecessary sorting, it is necessary to correctly enhance indexes, reasonably consolidate database tables (although sometimes it may affect the standardization of the table, but is worthy of efficiency). If sort is inevitable, you should try to simplify it, such as the range of zodes of sorting. 3. Eliminating sequential access to large table row data In nested queries, sequential access to tables may have fatal impact on query efficiency. For example, use sequential access strategy, a nest 3 query, if each layer queries 1000 lines, then this query is to query 1 billion row data. Avoiding the main method of this is to index the column of the connection. For example, two tables: student table (student number, name, age ...) and selection class (student number, course number, grade). If both tables are connected, they must establish an index on the "Learning" connection field. It is also possible to use and set to avoid sequential access. Although there are indexes on all check columns, some form of WHERE clause is forced optimizer to use sequential access. The following query will force the order to perform the order of the OrderS table: select * from Orders Where (Customer_Num = 104 and ORDER_NUM> 1001) or ORDER_NUM = 1008 Although the index is built in Customer_Num and ORDER_NUM, the optimizer is still used in the above statement Sequential access path scans the entire table. Because this statement is to retrieve the collection of separate rows, it should be changed to the following statement: select * from Orders where customer_num = 104 and order_num> 1001 Union Select * from Orders where order_num = 1008 This can use the index path processing query.
4. Avoiding a column query of a column at the same time in the query in the inquiry and WHERE clause, then it is likely that the subquery must be re-query after the column value in the main query changes. The more nesting, the lower the efficiency, so you should try to avoid subquery. If the child query is inevitable, then filter out as much row as possible in the child query. 5. Avoid difficult forms of regular expressions Matches and Like keywords support wildcard matching, which is called regular expressions. But this match is particularly time consuming. For example: SELECT * from Customer WHERE ZIPCODE LIKE "98_ _ _" Even in this case, in this case, it is also possible to scan in order. If the statement is changed to SELECT * from customer where zipcode> "98000", you will use the index to query when you execute the query, obviously greatly improves the speed. In addition, it is necessary to avoid non-start substrings. For example, the statement: select * from customer where zipcode [2,3]> "80", the non-start substring is used in the WHERE clause, so this statement does not use an index. 6. Use a temporary table to accelerate the query to sort a subset of the table and create a temporary table, sometimes accelerating queries. It helps to avoid multiple sorting operations and simplify the work of optimizer in other ways. For example: SELECT cust.name, rcvbles.balance, ...... other columns FROM cust, rcvbles WHERE cust.customer_id = rcvlbes.customer_id AND rcvblls.balance> 0 AND cust.postcode> "98000" ORDER BY cust.name If the query to Multiple times, more than once, you can find all unpaid customers in a temporary file and sort by the customer's name: Select Cust.Name, Rcvbles.balance, ... Other Column from Cust, RCVBLES WHERE cust.customer_id = rcvlbes.customer_id aND rcvblls.balance> 0 oRDER BY cust.name INTO TEMP cust_with_balance then the following manner in the temporary table query: SELECT * FROM cust_with_balance WHERE postcode> main line than "98000" in the temporary table There are fewer columns in the table, and the physical order is the desired order, reducing disk I / O, so query workload can be greatly reduced. Note: The primary table is not modified after the temporary table is created. When data is frequently modified in the primary table, be careful not to lose data. 7. Using sorting to replace non-sequential access non-sequential disk access is the slowest operation, manifested in the back-page movement of the disk access arms. The SQL statement hides this situation so that we can easily write a query to access a large number of non-sequential pages when writing applications. Sometimes, use the sort capability of the database to replace the sequential access to improve the query. Let's take an example of a manufacturing company to explain how to perform query optimization.
3 tables in the manufacturing company database, the mode as follows: 1. Table Part Number Part part described other column (part_num) (part_desc) (other column) 102,032Seageat 30G disk ... 500, 049Novel 10M Network Card ...... 2. Watch Manufacturers vendor number manufacturer name other columns (vendor _num) (vendor_name) (other column) 910,257Seageat Corp ...... 523,045 ... 3. PARVEN table parts (vendor_num) (vendor_num) (part_amount) 102, 032910, 2573, 450,000 234, 423 Part_num = parven.part_num and parven.vendor_num = vendor.vendor_num ORDER BY Part.part_num If the index is not built, the above query code will be very huge. To this end, we build an index on the part number and the vendor number. The establishment of indexes avoids repeated scans in nested. The statistics on the table and index are as follows: Table 10,000 vendor the number of keys per page number of pages (Indexes) (key Size) (keys / page) (Leaf pages) part4500 20 Vendor45002 Parven825060 seems to be a relatively simple 3 tables, but its query overhead is very large. As can be seen by viewing the system table, there is a cluster index on Part_num and Vendor_NUM, so the index is stored in the physical order. The PARVEN table does not have a specific storage order. The large novels of these tables will be small from the success rate of unprecedented access from the buffer page.
Optimized query planning of this statement is: First read 400 pages from Part in Part, and then access 10,000 times in the PARVEN table, 2 pages 2 (a index page, one data page), total 20,000 disks Page, finally accessing 15,000 times in the Vendor table, 30,000 disk page. It can be seen that the disk takes 50,400 disks on this cable. In fact, we can improve query efficiency by using 3 steps using temporary degrades: 1. From the PARVEN table, press Vendor_Num, read data: select part_num, vendor_num, price from parven order by vendor_num inTo Temp PV_BY_VN This statement sequence reads PARVEN (50 pages), write a temporary table (50 pages), and sort. Assume that the overhead of the sort is 200, a total of 300 pages. 2. The temporary tables and table joins vendor, outputs the result to a temporary table, press part_num sort: SELECT pv_by_vn, * vendor.vendor_num FROM pv_by_vn, vendor WHERE pv_by_vn.vendor_num = vendor.vendor_num ORDER BY pv_by_vn.part_num INTO TMP pvvn_by_pn DROP TABLE PV_BY_VN This query reads PV_BY_VN (50 pages), which is transmitted through the Vendor table by index, but since the vendor_num order is arranged, it is actually read in the vendor table in the index (40 2 = 42 pages), the output table Page approximately 95 lines, a total of 160 pages. Write and access these pages to trigger 5 * 160 = 800 read and write, index a total of reading and writing 892 pages. 3. Connect the output and part to the last result: SELECT PVVN_BY_PN. *, Part.part_Desc from PVVN_BY_PN, Part Where PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_PN] 15,000 times, due to the establishment of an index, 1772 disk read and write is actually performed, and the optimization ratio is 30: 1. The author did the same experiment on Informix Dynamic Sever, found that the optimization ratio of time spending is 5: 1 (if added data, the proportion may be larger). Little 20% of the code used to take 80% of the time, which is a famous law in the programming and is the same in the database application. Our optimization should seize key issues, focusing on the execution efficiency of SQL for database applications. The key link of the query optimization is to make the database server read data from the disk and sequentially read the page instead of a non-order page. [Post Reply] [View CU Forum original] [close]
Blue keyboard Reply to: 2003-06-03 14:06:51 People tend to fall into a misunderstanding when using SQL, that is, pay attention to whether the result is correct, and ignoring the performance of different implementation methods Differences, this performance difference is particularly obvious in large or complex database environments (such as online transaction OLTP or decision support system DSS). The author found in work practice, poor SQL often comes from inappropriate index design, unsolicient connection conditions and inertable WHERE clauses. After proper optimization of them, its running speed is significantly improved! Below I will summarize from these three aspects: ---- For more intuitive explanation, the SQL runtime in all instances is tested, no more than 1 second is expressed as (<1 second). ---- Test Environment - ---- Host: HP LH II ---- The Cluster: 330MHz ---- Memory: 128 Magg ---- Operating System: OperServer 5.0.4 ---- Database: Sybase11 .0.3 1. Unreasonable index design ---- Example: Table Record has 620000 lines, try under different indexes, the following SQL operation: ---- 1. Built on Date with a non-commission set Index Select Count (*) from Record Where Date> '19991201' And Date <'19991214'and Amount> 2000 (25 second) Select Date, SUM (Amount) from Record Group by Date (55 second) Select Count (*) from Record where date> '19990901' and place in ('bj', 'sh') (27 seconds) ---- Analysis: ---- Date has a large number of repetition values, under non-communical index, data is physically Randomly stored on the data page, when searching, you must perform a table scan to find all the rows within this range. ---- 2. A cluster index in Date Select Count (*) from Record Where Date> '19991201' And Date <'19991214' and Amount> 2000 (14 seconds) Select Date, Sum (Amount) from Record Group BY DATE (28 seconds) Select count (*) from record where date> '19990901' and place in ('bj', 'sh') (14 seconds) ---- Analysis: ---- Under the cluster index, Data is physically in order on the data page, the repetition value is also arranged together, and thus in the range lookup, you can first find the starting point of this range, and only scan the data pages within this range, avoid a wide range of scans, Improve the query speed.
---- 3. In Place, Date, Amount SELECT Count (*) from Record Where Date> '19991201' And Date <'19991214' And Amount> 2000 (26 second) Select Date, SUM (Amount) From Record Group by Date (27 second) Select count (*) from record where date> '19990901' and place in ('bj,' sh ') (<1 second) ---- Analysis: ---- this is A unseasonful combination index, because its leading column is Place, the first and second SQL do not reference the Place, so there is no use of the index; the third SQL uses the Place, and all columns references are included in In the combined index, an index coverage is formed, so its speed is very fast. ---- 4. In Date, Place, Amount SELECT Count (*) from Record Where Date> '19991201' and And> 2000 (<1 second) SELECT DATE, SUM (Amount ) From record group by date (11 second) Select count (*) from record where date> '19990901' and place in ('bj', 'sh') (<1 second) ---- Analysis: ---- This is a reasonable combination index. It uses DATE as the leader, allowing each SQL to utilize indexes, and forms an index coverage in the first and third SQLs, and thus performance has achieved optimal. ---- 5. Summary: ---- The index established by default is a non-clustered index, but sometimes it is not the best; reasonable index design is based on analysis and prediction of various queries. Generally, ---- 1. There are a large number of repetitive values, and often have a range of queries (Between,>, <,> =, <=) and the columns that occur, the columns that occur, can consider establish a cluster index; --- 2. Always access multiple columns simultaneously, and each column contains repetition values to consider establish a combined index; ---- 3. Combined index should try to make critical queries to form index coverage, the front lead list must be the most frequent use Columns.
Second, the connection condition: ---- Example: Table Card has 7896 lines, there is a non-aggregated index on Card_no, table Account has 191122 lines, there is a non-aggregated index on Account_no, trying to look at different tables Under connection conditions, the implementation of two SQL: select SUM (A.Amount) from Account A, Card B where a.card_no = B.Card_no (20 seconds) --- Change SQL to: Select Sum (a. Amount) from Account A, Card B WHERE A.CARD_NO = B.Card_no and a. account_no = B.account_no (<1 second) ---- Analysis: ---- Best Query under the first connection condition The solution is to make an access to an external table, a CARD, and use the index on the Card. The number of I / O can be estimated by the following formula: ---- Outer Table Account 22541 page (outer table Account 191122 line * Inside table Card, the first line of the outer layer is 3 pages to be found in the first line) = 595907 times I / O ---- Under the second connection condition, the best query scheme is to make a CARD out Table, Account is in the inner table, using the index on Account, the number of I / O can be estimated by the following formula: 1944 pages on the outer table Card (7896 lines of outer table Card) On the 4 pages of the external table to find each other of the external table) = 33528 times I / O ---- Visible, only the fully connected condition, the real best solution will be executed. ---- Summary: ---- 1. Before being executed before being actually executed, the query optimizer lists several groups of possible connection schemes and finds the best solution for system overhead based on the connection conditions. The connection condition should be considering the table with indexes, the number of rows of rows; the selection of the inner and outer tables can be determined by the formula: the number of matches in the outer table * The number of times in the inner layer table is determined, the minimum is the best Program. ---- 2. View the implementation method - Use SET Showplanon to open the showplan option, you can see the connection order, use the information of the index; want to see more detailed information, you need to perform DBCC with SA roles (3604 310, 302).
Third, unmoderable WHERE clause ---- 1. Example: The columns in the following SQL condition statements have a proper index, but the execution speed is very slow: select * from record where substring (CARD_NO, 1, 4) = '5378' (13 seconds) Select * from Record Where Amount / 30 <1000 (11 second) Select * from record where communication (char (10), DATE, 112) = '19991201' (10 seconds) ---- Analysis: ---- Any operation result of the column in the WHERE clause is calculated by the SQL runtime, so it has to perform a table search without using the index above the column; if these results are inquiry You can get it, then you can be optimized by the SQL optimizer, use the index, avoid the table search, so rewritten SQL to the following: SELECT * from Record where card_no like '5378%' (<1 second) Select * from record WHERE AMOUNT <1000 * 30 (<1 second) Select * from record where date = '1999/12/01' (<1 second) ---- you will find SQL significantly faster! ---- 2. Example: Table stuff has 200,000 lines, there is a non-clustered index on id_no, please see the following SQL: select count (*) from stuff where id_no in ('0', '1') (23 seconds) ---- Analysis: - 'I in' in WHERE Condition is logically equivalent to 'or', so the grammatical analyzer converts in ('0', '1') into ID_NO = '0' OR ID_NO = '1' is executed. We expect it to find separately according to each OR clause, then add the result, which can take the index on ID_no; but in fact (according to Showplan), it adopts "OR Strategy", that is, take out each The line of the OR clause, in the worksheet of the temporary database, then establish a unique index to remove the repetition, and finally calculate the results from this temporary table. Therefore, the actual process does not use ID_no to index, and the completion time is also affected by the performance of the Tempdb database. ---- Practice has proved, the more the number of lines, the worse the performance of the worksheet, when STUFF has 62,00000 lines, the implementation time has reached 220 seconds! It is better to distinguish the OR clause: select count (*) from stuff where id_no = '0' select count (*) from stuff where id_no = '1' ---- get two results, then make a additional addition to the calculation. Because each sentence uses an index, the execution time is only 3 seconds, and the time is only 4 seconds at 620000.