Database query optimization technology
The database system is the core of the management information system. Based on database-based online transaction processing (OLAP), online analysis processing (OLAP) is one of the most important computer applications of banks, enterprises, government and other departments. From the application example of most systems, the query operation is the largest proportion in various database operations, and the SELECT statement based on the query operation is the largest statement in the SQL statement. For example, if the amount of data accumulates to a certain extent, such as a bank account database table information accumulates to hundreds of thousands or even tens of thousands of records, the full table scans often takes ten minutes, or even hours. If you use better query strategies than full menu, you can easily reduce query time to minutes, thereby visible to query optimization technology importance. The author found in the implementation of the application project, and many programmers use some front-end database development tools (such as PowerBuilder, Delphi, etc.) when using database applications, only paying gorgeous interfaces, do not pay attention to the efficiency of query statements, leading to The development of application system is low, and the waste is wasteful. Therefore, how to design efficient and reasonable query statements is very important. This article is based on application examples, combined with database theory, introduces the application of query optimization techniques in the real system. Analysis issues Many programmers believe that query optimization is the task of DBMS (Database Management), and the SQL statement written by programmers is not large, which is wrong. A good query plan can often increase the program performance by dozens of times. The query plan is a collection of SQL statements submitted by the user, and the query plan is a collection of statements generated after optimization processing. The process of the DBMS processing query plan is this: After the language of the query statement, after the syntax check, submit the statement to the Query Optimizer of the DBMS, after the optimization of the algebraic optimization and access path, by the precompilation module Processing the statement and generate query planning, then submitted to the system processing at a suitable time, finally returning the execution result to the user. In the high versions of actual database products such as Oracle, Sybase, etc., are adopted based on cost-based optimization methods. This optimization can estimate the cost of different query planning based on information obtained from the system dictionary table, then select one Better planning. While the current database products have been getting better and better in query optimization, the SQL statement submitted by the user is the basis of system optimization. It is difficult to imagine that a bad query plan will become efficient after the system is optimized, so The advantage of the user's writings is critical. The system is doing query optimization. We will not discuss it, follow the following focus on improving the solution to the user query plan. Solution The following is based on the relational database system Informix as an example, introducing the method of improving the user query plan. 1. Rational use index index is an important data structure in the database, and its fundamental purpose is to improve query efficiency. Most of the database products are now using IBM's first ISAM index structure. The use of indexes is just right, and the principles of use are as follows: ● Inconditioning, but not specified as the column of the foreign key, and the unconnected field is automatically generated by the optimizer. ● Establish an index on the columns of frequent sorting or grouping (ie, GROUP BY or ORDER BY operation). ● Establish a search in columns that are often used in the conditional expression, do not establish an index on the columns of different values. For example, only two different values of "male" and "female" in the "sex" column of the employee table, so it will not be necessary to establish an index. If the establishment index does not improve query efficiency, it will seriously reduce the update speed. ● If there are multiple columns to be sorted, a composite index can be established on these columns. ● Use system tools. If the Informix database has a TbCheck tool, you can check on the suspicious index.
On some database servers, the index may fail or because of frequent operation, the read efficiency is reduced. If a query using the index is unknown, you can try the integrity of the index with the TbCheck tool, and fix it if necessary. In addition, after the database table updates a large amount of data, the index can be removed and reconstructed can increase the query speed. 2. Avoiding or simplifying sorts should be simplified or avoided to repeat the large table. When an output can be generated using an index to generate an output in an appropriate order, the optimizer avoids the step of sorting. The following is some influencing factors: In order to avoid unnecessary sorting, it is necessary to correctly enhance indexes, reasonably consolidate database tables (although sometimes it may affect the standardization of the table, but is worthy of efficiency). If sort is inevitable, you should try to simplify it, such as the range of zodes of sorting. 3. Eliminating sequential access to large table row data In nested queries, sequential access to tables may have fatal impact on query efficiency. For example, use sequential access strategy, a nest 3 query, if each layer queries 1000 lines, then this query is to query 1 billion row data. Avoiding the main method of this is to index the column of the connection. For example, two tables: student table (student number, name, age ...) and selection class (student number, course number, grade). If both tables are connected, they must establish an index on the "Learning" connection field. It is also possible to use and set to avoid sequential access. Although there are indexes on all check columns, some form of WHERE clause is forced optimizer to use sequential access. The following query will force the order to perform the order of the OrderS table: select * from Orders Where (Customer_Num = 104 and ORDER_NUM> 1001) or ORDER_NUM = 1008 Although the index is built in Customer_Num and ORDER_NUM, the optimizer is still used in the above statement Sequential access path scans the entire table. Because this statement is to retrieve the collection of separate rows, it should be changed to the following statement: select * from Orders where customer_num = 104 and order_num> 1001 Union Select * from Orders where order_num = 1008 This can use the index path processing query. 4. Avoiding a column query of a column at the same time in the query in the inquiry and WHERE clause, then it is likely that the subquery must be re-query after the column value in the main query changes. The more nesting, the lower the efficiency, so you should try to avoid subquery. If the child query is inevitable, then filter out as much row as possible in the child query. 5. Avoid difficult forms of regular expressions Matches and Like keywords support wildcard matching, which is called regular expressions. But this match is particularly time consuming. For example: SELECT * from Customer WHERE ZIPCODE LIKE "98_ _ _" Even in this case, in this case, it is also possible to scan in order. If the statement is changed to SELECT * from customer where zipcode> "98000", you will use the index to query when you execute the query, obviously greatly improves the speed. In addition, it is necessary to avoid non-start substrings.
For example, the statement: select * from customer where zipcode [2,3]> "80", the non-start substring is used in the WHERE clause, so this statement does not use an index. 6. Use a temporary table to accelerate the query to sort a subset of the table and create a temporary table, sometimes accelerating queries. It helps to avoid multiple sorting operations and simplify the work of optimizer in other ways. For example: SELECT cust.name, rcvbles.balance, ...... other columns FROM cust, rcvbles WHERE cust.customer_id = rcvlbes.customer_id AND rcvblls.balance> 0 AND cust.postcode> "98000" ORDER BY cust.name If the query to Multiple times, more than once, you can find all unpaid customers in a temporary file and sort by the customer's name: Select Cust.Name, Rcvbles.balance, ... Other Column from Cust, RCVBLES WHERE cust.customer_id = rcvlbes.customer_id aND rcvblls.balance> 0 oRDER BY cust.name INTO TEMP cust_with_balance then the following manner in the temporary table query: SELECT * FROM cust_with_balance WHERE postcode> main line than "98000" in the temporary table There are fewer columns in the table, and the physical order is the desired order, reducing disk I / O, so query workload can be greatly reduced. Note: The primary table is not modified after the temporary table is created. When data is frequently modified in the primary table, be careful not to lose data. 7. Using sorting to replace non-sequential access non-sequential disk access is the slowest operation, manifested in the back-page movement of the disk access arms. The SQL statement hides this situation so that we can easily write a query to access a large number of non-sequential pages when writing applications. Sometimes, use the sort capability of the database to replace the sequential access to improve the query. Case Analysis
Let's take an example of a manufacturing company to explain how to perform query optimization. 3 tables in the manufacturing company database, the mode as follows: 1. Table Part Number Part part described other column (part_num) (part_desc) (other column) 102,032Seageat 30G disk ... 500, 049Novel 10M Network Card ...... 2. Watch Manufacturers vendor number manufacturer name other columns (vendor _num) (vendor_name) (other column) 910,257Seageat Corp ...... 523,045 ... 3. PARVEN table parts (vendor_num) (vendor_num) (part_amount) 102, 032910, 2573, 450,000 234, 423 Part_num = parven.part_num and parven.vendor_num = vendor.vendor_num ORDER BY Part.part_num If the index is not built, the above query code will be very huge. To this end, we build an index on the part number and the vendor number. The establishment of indexes avoids repeated scans in nested. The statistics on the table and index are as follows: Table 10,000 vendor the number of keys per page number of pages (Indexes) (key Size) (keys / page) (Leaf pages) part4500 20 Vendor45002 Parven825060 seems to be a relatively simple 3 tables, but its query overhead is very large. As can be seen by viewing the system table, there is a cluster index on Part_num and Vendor_NUM, so the index is stored in the physical order. The PARVEN table does not have a specific storage order. The large novels of these tables will be small from the success rate of unprecedented access from the buffer page.
Optimized query planning of this statement is: First read 400 pages from Part in Part, and then access 10,000 times in the PARVEN table, 2 pages 2 (a index page, one data page), total 20,000 disks Page, finally accessing 15,000 times in the Vendor table, 30,000 disk page. It can be seen that the disk takes 50,400 disks on this cable. In fact, we can improve query efficiency by using 3 steps using temporary degrades: 1. From the PARVEN table, press Vendor_Num, read data: select part_num, vendor_num, price from parven order by vendor_num inTo Temp PV_BY_VN This statement sequence reads PARVEN (50 pages), write a temporary table (50 pages), and sort. Assume that the overhead of the sort is 200, a total of 300 pages. 2. The temporary tables and table joins vendor, outputs the result to a temporary table, press part_num sort: SELECT pv_by_vn, * vendor.vendor_num FROM pv_by_vn, vendor WHERE pv_by_vn.vendor_num = vendor.vendor_num ORDER BY pv_by_vn.part_num INTO TMP pvvn_by_pn DROP TABLE PV_BY_VN This query reads PV_BY_VN (50 pages), which is transmitted through the Vendor table by index, but since the vendor_num order is arranged, it is actually read in the vendor table in the index (40 2 = 42 pages), the output table Page approximately 95 lines, a total of 160 pages. Write and access these pages to trigger 5 * 160 = 800 read and write, index a total of reading and writing 892 pages. 3. Connect the output and part to the last result: SELECT PVVN_BY_PN. *, Part.part_Desc from PVVN_BY_PN, Part Where PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_NUM DROP TABLE PVVN_BY_PN] 15,000 times, due to the establishment of an index, 1772 disk read and write is actually performed, and the optimization ratio is 30: 1. The author did the same experiment on Informix Dynamic Sever, found that the optimization ratio of time spending is 5: 1 (if added data, the proportion may be larger). Small
20% of the code used to take 80% of the time, which is a famous law in programming and is the same in database applications. Our optimization should seize key issues, focusing on the execution efficiency of SQL for database applications. The key link of the query optimization is to make the database server read data from the disk and sequentially read the page instead of a non-order page.