Accelerate SQL Query Feature Function

xiaoxiao2021-03-06  114

http://www.chinaitlab.com/www/news/Article_show.asp?id=1455 1. The challenge of the Query problem is always an important and practical problem in the query, and the application-based application In this almost a successful problem. But so far, the various solutions proposed in this question are highly divided into two categories, namely the advantages of hardware architecture and the advantages of DBMS on parallel processing capabilities and completely Designed to handle the program. In the articles previously published in this article, it has been recommended to use temporary intermediaries and table update methods and fast query processing strategies. In the same article, we have also mentioned the idea that it is possible to use program transformation to support query optimization. All of these suggestions and ideas belong to the application design class, which has a certain generality in a sense. However, actual applications continue to make such a problem that is difficult and "blame", these problems are challenging, and it is often expected to be solved with very expensive system resources with very expensive system resources.

The purpose of this paper is to introduce the reader to the first proposed method first proposed by E.Birger, that is, the characteristic function method of accelerating query processing. This method applies to most SQL database systems, and if such systems include few (least 2) internal functions, such as ABS () and Sign (), etc., this method is directly available. . In E. In the research report of Birger et al. On this method, many highly difficult and very typical query requirements and solutions have their solutions, including stratigraphic conditions, seeking internal boundary values, seeking histograms, table Transposition, seeking a median value, ordered segment interference, and detergent to the boundary value. The commonality of these problems is, if the conventional method is solved, the system is very difficult to solve in the storage overhead or handling overhead, and some issues (such as medals) are quite difficult. This article will refer to these interesting query issues and their solutions. At the same time, we will also discuss "feature function" as a possibility of other applications.

2. The feature function and its representation feature function are a pure mathematical concept from point set topology. The characteristic function of the collection S is as follows: 1 If X? S D S (x) = (0) 0 If x? S

Here, any element X is a collection S and determines a different value to take different values. At the same time, it also implies a preamble, ie the set S of any elements is a range of homematings that are completely determined, and there is no case where the elements X is unknown. Obviously, the feature function is an identification (or determination) device. It is this feature that enables it to be an equivalent (and more efficient) replacement component of the Select Guideline in the database query. Therefore, we say that the feature function is an accelerated inquiry implementation technology.

In order to query the database more directly, we will transform the general form of feature functions into the following "Database Edition":

1 If a = ture d (a) = (1) 0 If a = false where α is a Boolean expression. When constituting the arithmetic expression of the Boolean expression is constructed by the table attribute and the internal function of the database, the selection of the feature function is very clear.

It is well known that the general relationship database uses three-value logic, that is, the Boolean expression may take an uncertain value ("Maybe"). However, in order to simplify expression and thus highlight the essential effect of the characteristic function in the accelerated query, this article does not consider the case where the table property is notified. In addition, the database (internal) function of the feature function ("meticone we call the feature") will vary depending on the system and our subjective choice. For example, Sybase's Transact SQL has two very useful internal functions ABS (), and SIGN (), which can be directly used as a function of the feature function. If A and B are any two table properties, then

D [a! = B] = ABS (SIGN (A-B)) (2) In order to make the meticone are defined, the table attribute must be numeric variables. Therefore, in addition to special declarations, this article will define all the table properties in all examples and general discussions as non-air-specific variables. Equation (2) can be defined from the meta function

ABS (a) = | a | (3) -1 if a <0

Sign (a) = 0 If a = 0 (4) 1 if A> 0

Directly derived. Generally, the feature function implemented by ABS () and SIGN () is D [A = B] = 1-ABS (SIGN (A-B)) D [A! = B] = ABS (SIGN (A-B)) (5) D [a

D [a <= b] = SIGN (1-SIGN (A-B))

D [a> b] = 1-sign (1-Sign (A-B))

D [a> = b] = SIGN (1 SIGN (A-B))))))

Further, α and b are any Bur expression, then D [NOT] = 1-D [α] D [αandb] = D [α] * D [6) D [αOR B] = SIGN (D [Α] D [b]) The A and B here are the table properties, which is non-air-number values. Equation (5) gives a meta-function representation of six simplest feature functions, but this is not the only representation, and it is possible to other representations. Equation (6) is a general export rule of the Boolean Operator. For Boolean expressions composed of the simplest relationship expression, equation (5) and (6) constitute a realization rule of its feature function. For general Boolean expressions, equation (5) and (6) are also the basis for exporting its feature functions. In general, a feature function class can be pushed by (5) and (6), and some feature functions correspond to the selection operator of SQL. For example, the feature function such as D [A <= x <= B] clearly be related to whether the determination variable X is in closed interval [a, b]. Use the second export rule in the fourth characteristic function and (6) in (5), D [a <= x <= b]

= D [(a <= x) and (x <= b)]

= D [a <= x] * d [x <= b]

= SIGN (1-SIGN (A-X)) * SIGN (1-Sign (x-b)) (7)

Obviously, the zone of the right end of equation (7) is an example of an equivalent expression to select the operator Between. Other three feature functions related to interval values ​​can be obtained, ie δ [a)

3. Example Analysis To illustrate the role of the above-mentioned feature functions in accelerated query processing, let us analyze an example. Try to examine a table students that describe a selfincome (8) into the situation (Name, S

Tatus, ParentIncome,

Where Name is the primary key, the attribute status is a label. When STATUS is 1 value, it indicates that the students 'income is completely self-parent. When STATUS is 0, it indicates that the students' income is completely labor income. In response to this table, assuming that we want to get the query result of Name, INCOME, in which income is incorporated for students (when the corresponding Status takes 0 values) or from parents (when the corresponding status That is 1.). Semantic analysis from the structure and query results of the table students, the conventional method of completing the query should be select name, income = ParentIncome from student where student = 1 Union (9) SELECT NAME, INCOME = SelfIncome from student where student = 0 is a Naturally, very straightforward query expression, but it is also a very low and very resource-consuming expression. The general process of performing this query is: Firstly, the two subquers connected by the operator Union are first executed, and then generate a temporary table for the intermediate result of the query and store the results of the two subquers in this temporary table. The temporary table is sequence to eliminate possible repetition values. At this point, the result of the final query is obtained. In such a process, in addition to traversing the entire table Students, it is obvious to the intermediate results, the intermediate results are sorted, and the troubles and resources on the processing are obvious. Query (9) The only advantage is that it is naturally straightforward, no one wants to get.

For this example, there is a more compact and more effective query expression. For example, it is not difficult to verify the following query seiect name, income = ParentIncome * Status SelfIncome * (1-status)

From students; (10) is completely equivalent to the query (9) in semantics. However, the query (10) not only consumes less storage, but also has much effect, because it only traverses table students and avoids terrible sorting operations. This example shows that the same query results, different query expressions may be far from the processing efficiency and resource consumption. Therefore, seeking effective query expressions, not only necessary and feasible.

The query expression (10) is different from conventional expression like (9), the latter query conditions are explicitly given by two WHERE clauses and operators Union, and the first person is indirectly hidden in SELECT. In the arithmetic expression of the sentence. Regardless of the form of query use, this example is a query type of "Condition Retrieval". If the query requirement and query expression (10), it is not difficult to find that it can give such a simple and correct rebound, it is a little "more happening." If the problem is slightly changing (for example, the attribute student is 0 and 1 value other than 0 and 1, or the Student takes more than two labels, and so, the problem will not be as simple as it. Therefore, whether there is a very general, very systematic solution that allows us to find any modified arithmetic expressions to any explicit expression in the WHERE clause and related operators? The answer is yes, this is the "feature function method" we need to introduce below. 4. Several typical features of typical queries solve the above, the feature function can achieve our desire, to convert the explicit Boolean conditions into scalar expressions. Therefore, the most direct and easy application of the feature function is a query for the conditional search, but its effect is not only here. In order to fully understand the role of the feature function in solving complex queries, this section will be easily introduced and analyzed in several typical examples. For some instances, we will also explain its application. In order to express more compact, all feature functions that appear in scalar expressions are not expanded with a meta. Therefore, if you want to verify the instance here, you must replace the features of the features here in (5), (6).

4.1 Condition Retrieval by the query given by (10) can be learned from the feature function as SELECT NAME, INCOME = ParentIncome * D [status = 1] selfincome * d [status = 0]

From student (11)

If the search criterion is only in this, it is not an essential meaning with (11) instead (10). However, the retrieval conditions in practical problems are far more complicated than this. For example, if the requirements of the previous example are slightly modified, that is, in the case of retaining the original semantics of the STATUS, add the age of 19 and 23 as the age of 19 and 23, and the age does not exceed In 19, the students who rely on their parents are a group, and those students who have complete self-sufficient self-sufficient self-sufficiency are the second group. All other students are third groups. In the query result, the income (INCOME) has a different meaning: for the first two groups, respectively correspond to their parent's income and students' own income, the third group of students, corresponding to the first two groups of student revenue Arithmetic average. People who are accustomed to handling inquiry with conventional methods, such conditions are more complicated. In fact, this is a natural requirement, and the requirements for the extension are very slight relative to the original problem. Control Query Expression (11), it is not difficult to verify SELECT NAME, INCOME = ParentIncome * D [atatus = 1] * D [age <= 19] selfincome * sign (D [status = o] D [age> 23]) ((ParentINCEOME SELFINCO ME) / 2.0* (1 D [status = 1]) * D [age <= 19] -sign (D [Sign = 0] D [AGE> 23])

From students; (12) is the effective expression required for the above query. From the expression of INCOME, it is completely one-word requirement with the requirements of the query condition, which has a typical cascading IF & # 0; then & # 0; ELSE structure. In general, under the participation of the characteristic function, regardless of the query conditions, the same property value is divided into more sections, etc.), and the conditional search type query Typical structures such as (11) and (12). The difference is only that the more the conditions are in the case, but the logical structure of the regular arithmetic expression is the same. All such queries are expressed, and only the table student is traversed once to Table Student in execution. Conversely, if you solve according to the common rule, in principle, each classification condition requires a subquery to answer, and the final result is that all subqueries result to the Union operation. Two effects of two effects, the exception is very bad.

4. 2 histogram problems

Seeking the histogram is that the problem that is often resolved in statistical applications If statistics are from the database and the amount of data is large, it is not a very easy task with conventional methods. However, borrowing feature functions can successfully solve problems, not only is it efficient and intuitive. To illustrate this, let us see a specific example. Assuming that the statistics are stored in Table Employee (Name, AGE DEPT, KIDS), where Kid also shows the number of children of each employee. Requires statistical results for nokids, onekids, fewkids, and mankids, that is, the total number of employees who have no children in all employees, have children, two or three children and the total number of employees of more than three children. If you use a conventional method, you need to find the Employee four times, calculate the value of NOKIDS, ONEKIDS, FEWKIDS, and MANYKIDS, and then the final result can be obtained by 3 UNION operations. If the original problem is not divided into 4 sections of the employee to 4 sections but 8 or more sections, the inefficient method of conventional methods is more obvious. Using the characteristic function, the solution of the above problem is obviously

Select Nokids = SiJM (D [kids = 0]), Onekids = SUM (D [kids = 1]) Fewkids = SUM (D [2 <= kids <= 3])

Manykids = SUM [KIDS> = 4])

From Employee; (13) The correctness of this query result is easy to verify: for any line in the table, if kids = 0, D [kids = 0] = 1 and D [kids = L] = D [kids> = 4 ] = D [2 <= kids <= 3] = 0, so the row is in the section Nokids in summing rather than in any other three segments, and for the other values ​​of KIDS, this indicates that this indicates that 13) The result is the result of the original problem. It is important that this result is not only correct, and the way to get this result is very effective because of the traversal of the table. If the employee is divided into more value segments, the query processing using the feature function is still only traversed once, not more, and the difference is only in the selection of the calculation in the table, the query is expressed Logical complexity has not increased.

The same basic problem can also guide different variations. If there is no above basic solution foundation, it is often difficult to solve these variations directly.

One of the variations: For the same table Employee, the distribution of children is calculated according to the different departments of the employee, namely (the result of the DEPT, NOKIDS, ONEKID, FEWKIDS, and MANYKIDS).

This problem is clearly the following query express Select dept. Nokids = sum (D [kids = 0]), Onekid = SUM (D [kids = 1]) Fewkids = SUM (D [2 <= kids <= 3] Manykids = SUM (D [kids> = 4])

From Employee; Group By DEPT; (14) Secondary Multi-Variation: The histogram of the child distribution according to the age section. For the sake of determination, the age of employees is divided into three sections, namely, less than 25 years old, greater than 45 years old and between the ages of 25 and 45, respectively, called 1st, 3rd and 2nd age. . This problem is actually the result of the form of Employ, Agecategoy, Nokids, ONEKIDS, FEWKIDS, MANYKIDS, where Age <25

D (a) = 2 If 25 <= age <= 45 3 If AGE> 45

(15) Although this problem has a considerable difficulty, for the system that occurs in the Group By clause (for example, Sybase's Transact SQL), the answer is directly. It is not difficult to verify that the following query is exactly what we need. SELECT AGECATEGORY = 1XD [AGE <25] 2 × D [25 <= agn <= 45] 3 × D [AGE> 45],

Nokids = SUM (D [kids = 0]), Onekids = SUM (D [kids = 1]), Onekid = SUM (D [KIDS <= 3 and Kids> = 2])

Manykids = SUM (D [kids> = 4])

From Employee Group by 1 'D [Age <25] 2 × D [25 <= age <= 45] 3 × D [AGE> 45]; (16) This problem is only in the selection table and The GROUP BY clause uses 3 years of age expressions Agecaegory, which is easy to verify (16) is indeed the query we need according to the (15).

Walk down along this idea, you can also handle more complex issues. When the histogram is getting more and more "wide", the validity of the feature function can also appear.

4.3 Table Transfer Table Transformation is a transformation process, which transforms a narrow and long table into a wide short table, which is often encountered in database application design. C. J Date has long noticed this and gives the general principles for processing this issue. Date is a "column" representation of the table in the previous form of the Chinese, and the latter form of a "row" indicated. Given the SQL set function is essentially a column-oriented indicated instead of a row-oriented, the column indicates the advantages of facilitating application processing to application processing. Therefore, the design of the base Table is much more considered to adopt a column representation.

When the base table indicated by the column is query, the feature function is a powerful master that implements a table transition. For example, a Table Bonus (Name, Month, Amount) records an employee monthly bonus. This table is obviously the column representation, which is a row of row, for example, can be written into bonus' (Name, Janamount, ..., Decamount). If you want to get a lot of bonuses for each employee, query from Bonus' table is the easiest. However, the form indicated by this line is essentially effective to this query, and it is difficult to adapt to other inquiry requirements. Instead, the column of shaped Bonus indicates not only the above query but also can reply to other processing requirements. For example, the feature function for the above requirements is indicated

SELECT NAME,

Janamount = SUM (Amount × D [MONTH = 1]), febamount = sum (Amount × D [Month = 2]),.... Decamount = SUM (Amount × D [Month = 12]) from bonus group by name (17) The reader may wish to think about how to satisfy the feature function for the same query for the Bonus table.

4.4 Survey in the experimental data processing, there is often a "median" (Median's requirements. It is well known that there are two definitions, which are statistically defined and so-called " Finance "definition. According to the statistically defined, the median must be one of this group. Therefore, when there is an even number, you must make a choice from two numbers, or the monks or select the pad. It is necessary to depend on the specific application background. The last definition of the median (when there is an even number), the number of fruit values ​​is odd (set to n), the median number It is the number of items (n 1) / 2 in arrays. No matter what definition is adopted, the median number of the number of groups is always a problem. Being a trained SQL programmer, if you want to use a general Methods to handle this problem, you have to write a very complex process. However, borrowing a characteristic function, it is easy to get a very simple solution to this problem. For the sake of sure, consider such an experimental data DATA (Value). This indicates that there is no duplicate value in this set of data. In addition, we have to assume that all data is non-empty data. Under such assumptions, the median number of data set DATA is the following query statement RESULTS: SELECT X.Value from Data X, Data Y Group by x.Value Having Sum (D [Y.Value <= x.v alue]) = (count (×) 1) / 2 (18)

Because for each X. Value, the result of the expression sum (D [Y.Value <= x.Value] is a number of data sets less than or equal to the value data, so the desired median is selected by this haVing clause ( The careful readers are not difficult to find that we use the result of the Sybase two integers to remove, and the value obtained after the actual removal is taken again. In addition, when the data set contains even elements, the number is two numbers. A smaller one, this is in line with the statistical definition of the meditile.)

The above method of summing up the number of bits is easy to extend to the situation where the data set to some attributes is divided into several subsets. For example, check the data set DATA 2 (Partition, Value), where Value is still non-air value, but attribute Partition can be any data type, and the entire data set DATA2 is divided into a separate subset. The result of the following query statement is the median of each subset:

SELECT X. Partition, x. Value from data2 x, data2 y where x. Partition = y. Partition group by x. Partition, x. Value Having SUM (D [Y.Value <= x.Value]) = (count (x) 1) / 2 (19) Self-linking in the above formula is the need for table segmentation, in addition to the two methods Essentially difference.

4.5 Trying Directions In some practical problems, the row data of the designed table contains several data items that can be analyzed with each other. We will ask for the largest or minimal value of these data items called "end value issues". For example, in the exam table Scores (Name, Sat1, Sat2), where Sat 1 and Sat 2 represent the results of the students twice. Assuming that you need to get the list of best results in each person's test, that is, the results of seek, bestsat, where Bestsat indicates the best results in each student twice.

Some database systems (such as oracle) have internal functions Greatest (Value 1, Value 2 ...) for direct solutions. In systems that do not have this function (such as Sybase, etc.), the general solution is that the first traversal full table can meet the conditions, SAT1> = SAT 2 SAT 1, the second traversal full table is satisfied with the conditions SAT2> SAT1's SAT2, then the intermediate result is ON operation to obtain the final result.

With the feature function, it is only necessary to scan the table over again and the query expression is very simple, namely: select name, bestsat = sat 1% d [sat 1> = sat 2] sat 2 × D [sat 2> SAT1]

From score; (20) Assuming that we don't just get the best achievements in each student's two exams, but also want to know which test is getting this result, it is only necessary to add one in the (20) option. which is:

Select Name Bestsat = Sat L% D [SAT1> = SAT 2] SAT 2% D [SAT 2> Sat 1] Whichsat = 1% D [SAT 1> = SAT 2] 2% D [SAT 2> Sat L] from Score; (21) This result is only a bit ambiguous in SAT 1 = SAT2 but is not bad (in this case, (21) is considered to be the first test). In addition, it seems that you don't do not have any explanation. The above only considers the maximum value, and the minimum value can be applied. Interested readers may wish to consider the various variations of this issue and their answers.

5 Several questions worthy of further thinking are indicated by the feature function given by (5), not the only form, and there are other forms of expression. Different meta functions will differ in computational complexity. Therefore, select a function with lower calculation complexity, it is possible to further improve the efficiency of the feature function. Second, this article starts with the attributes that appear in the feature function must be a non-empty number value, and is to ensure that the metafunists ABS () and Sign () are defined and the entire (5) correctly sufficient condition. Therefore, there is a possibility that reduces this conditional strength. Considering that almost all mainstream database systems support three-value logic (3VL), it is especially necessary to remove non-empty assumptions.

转载请注明原文地址:https://www.9cbs.com/read-101233.html

New Post(0)