Query Optimization and Paging Algorithm Scheme for Massive Database

xiaoxiao2021-03-06  107

Query Optimization and Paging Algorithm Scheme for Massive Database

As the "Golden Shield Project" is gradually in-depth and high-speed development of public security information, the public security computer application system is widely used in various police species, departments. At the same time, the core of the application system system, the storage of system data - the database also sharply expands with practical applications, some large-scale systems, such as the data of the population system, even more than 10 million, which can be sealed. Then, how to achieve data (query), analysis, statistics, and extract data from these large-capacity databases, have become a problem that all local system administrators and database administrators need to solve.

In the following articles, I will explore how to implement fast data extraction and data paging in the MS SQL Server database with 10 million data in the "Office Automation" system. The following code illustrates some of the data structure of the "Red Slim File" table of the database in our instance:

Create Table [DBO]. [TGONGWEN] (TGONGWEN is a redhead file name

[GID] [INT] Identity (1, 1) Not Null, - The ID number of this table is also the primary key

[title] [varchar] (80) collate Chinese_prc_ci_as null, - Title of the Redhead file

[Fariqi] NULL, - Release Date

[neibuyonghu] [varchar] (70) Collate Chinese_PRC_CI_AS NULL, - Publishing User

[reader] [varchar] (900) collate Chinese_prc_ci_as null,

- Users who need to browse. Each user is separated by separator,

) On [primary] textImage_on [primary]

Go

Below, we have added 10 million data to the database:

Declare @i int

Set @ i = 1

While @i <= 250000

Begin

Insert Into Tgongwen (Fariqi, Neibuyonghu, Reader, Title) Values ​​('2004-2-5', 'Communication ",' Communication Branch, Office, Director, Secretary, Secretary, Admin, Criminal Investigation Detachment, Traffic Patrol Detachment, Economic Investigation Detachment, Household Department, Security Detachment, Foreign Disease ',' This is the first 250,000 records')

Set @ i = @ i 1

end

Go

Declare @i int

Set @ i = 1

While @i <= 250000

Begin

INSERT INTO TGONGWEN (Fariqi, Neibuyonghu, Reader, Title) Values ​​('2004-9-16', office ", 'office, communication department, Wang Director, Liu Director, Secretary, Administrative Detachment, Shutou, Community Patrol Detachment, Economic Investigation Detachment, Housekeeping, Foreign Disease ',' This is the middle 250,000 records')

Set @ i = @ i 1

end

Go

Declare @h int

Set @ h = 1

While @H <= 100

Begin

Declare @i int

SET @ i = 2002

While @i <= 2003

Begin

Declare @J INT

Set @ J = 0

While @J <50

Begin

Declare @k Int

Set @ k = 0

While @K <50

Begin

Insert Into Tgongwen (Fariqi, Neibuyonghu, Reader, Title) Values ​​(Cast (@i as varchar (4)) '- 8-15 3:' Cast (@J As Varchar (2)) ':' CAST (@J As Varchar (2)), 'Communication Branch', 'Office, Communication Branch, Director, Secretary Liu, Secretary, Administrative Tribe, Special Detachment, Traffic Patrol Detachment, Economic Investigation Detachment, Household Administration, Foreign Affairs', 'This is the last 500,000 records') set @ k = @ K 1

end

Set @ J = @ J 1

end

Set @ i = @ i 1

end

Set @ h = @ h 1

end

Go

Declare @i int

Set @ i = 1

While @i <= 9000000

Begin

Insert Into Tgongwen (Fariqi, Neibuyonghu, Reader, Title) Values ​​('2004-5-5 ",' Communication Branch ',' Communication Branch, Office, Director, Secretary, Secretary, Administrative Director, Administrative Detachment, Traffic Patrol Detachment, Economic Investigation Detachment, Housekeeping, Security Detachment, Foreign Disease ',' This is the last 9 million records added ')

Set @ i = @ i 1000000

end

Go

Through the above statement, we created 250,000 records released by Communication Division on February 5, 2004, and 250,000 records released by the office on September 6, 2004, 100 2500 in 2002 and 2003. The same date, the record of the communication department issued by the communication department (500,000), and the 9 million records released by Communication Division on May 5, 2004, totaling 10 million.

First, because of the affordability, establish "appropriate" index

Establishing an "appropriate" index is the primary premise of implementing query optimization.

Index (index) is another important, user-defined data structure stored on a physical medium. When searching data according to the value of the index code, the index provides quick access to the data. In fact, there is no index, the database can also be successfully retrieved according to the SELECT statement, but as the table gets more and more, the effect of using "appropriate" index is getting more and more obvious. Note that in this sentence, we use the word "appropriate" because if you don't care carefully consider its implementation process when using an index, the index can also improve the working performance of the database.

(1) Deeply understand the index structure

In fact, you can understand the index as a special directory. Microsoft's SQL Server provides two indexes: cluster index (also known as cluster index, cluster index) and non-aggregated index (Nonclustered Index, also known as non-cluster index, non-cluster index). Below, we will explain the difference between the aggregated index and the non-aggregated index:

In fact, the text of our Chinese dictionary itself is a gathering index. For example, we have to check the word "An", which will naturally open the first few pages of the dictionary, because "An" pinyin is "An", and the dictionary sorted by the pinyin is the beginning of English letters "A" At the end of "Z", then "An" is naturally ranked in the front of the dictionary. If you finish all the words starting with "A", you still can't find this word, then you don't have this word in your dictionary; the same, if you check "Zhang", then you will turn your dictionary The last part, because the pinyin of "Zhang" is "zhang". In other words, the text of the dictionary itself is a directory, you don't need to check other directories to find the content you need to find. We put this body content itself is a directory that is arranged in accordance with a certain rule is called "aggregation index".

If you know a word, you can quickly check this word from automatically. But you may also meet the words you don't know, don't know its pronunciation, this time you can't find the word you want to check according to the just possible, but you need to find it according to the "deflection". The word, then turn it directly to a page according to the page number after this word to find the words you are looking for. However, you combine the ordering of the words found by "Department First Catalog" and "Inspection Form", it is not a real body sorting method. For example, if you check the word "Zhang", we can see the inspection after the checkout The page number "Zhang" is 672, "Zhang" in the inspection table is "Chi" word, but the page number is 63 pages, "Zhang" is "" word, the page is 390 pages. Obviously, these words are not true respectively in the "Zhang" word. Now you have seen the continuous "Chi, Zhang, Zhang" three words actually in the non-aggregated index, is the text of the dictionary Mapping in the non-aggregated index. We can find what you need in this way, but it takes two processes, first find the results in the directory, and turn it back to the page you need.

We put this directory purely the directory, and the body is purely the sorting method of the body is called "non-aggregated index".

Through the above example, we can understand what is "aggregated index" and "non-aggregated index".

Further introduction, we can understand it: Each table can only have a gathering index because the directory can only be sorted in one way.

(2) When to use aggregated indexes or non-aggregated indexes

The following table summarizes when to use aggregated indexes or non-aggregated indexes (very important).

Action Description Use the aggregation index to use the non-aggregated index column to be sorted by the packet should return to the data within a range should not be one or very few different values. If the different value should not be small, the different values ​​should not be large. Columns should not be updated frequently should not have foreign key collections, should be used frequently to modify the index.

In fact, we can understand the previous table via the previously aggregated indexes and the definition of non-aggregated indexes. Such as: Returns one of a range of data. For example, your table has a time column, just built the aggregated index in this column, then you will check all the data between January 1, 2004 to October 1, 2004, this speed will It is very fast, because your dictionary text is sorted by date, the cluster index only needs to find the beginning and end data to retrieve in all the data to retrieve; not like non-gathering indexes, you must check it first. You find the page number corresponding to each data in the directory, then find the specific content according to the page number.

(3) Combining actual, talking about the misunderstanding used by the index

The purpose of theory is to apply. Although we have just listed when the aggregation index or non-aggregation index should be used, the above rules in practice is easily ignored or cannot be comprehensively analyzed according to the actual situation. Below we will talk about the misunderstandings used in the practice in practice, so that everyone can grasp the method of indexing.

1, the primary key is the aggregation index

This idea is considered extremely wrong and is a waste of aggregation indexes. Although SQL Server defaults to establish a gathering index on the primary key. Typically, we will establish an ID column in each table to distinguish each data, and this ID column is automatically increased, and the step size is typically 1. The column GID in our instance of our office is. At this point, if we set this to the primary key, SQL Server will give this column to the aggregation index. This is good, that is, you can make your data physical sorting in the database, but the author thinks that this is not big.

Obviously, the advantages of aggregation indexes are obvious, and only one rule of aggregation index in each table, which makes aggregation indexes more precious.

From the definition of aggregated indexes we can see, we can see that the maximum benefit of using aggregation index is to quickly reduce query scope according to query requirements, avoid full table scans. In practical applications, because the ID number is automatically generated, we don't know the ID number of each record, so we are difficult to use the ID number in practice. This makes the ID number as a gathering index into a resource waste. Second, let each of the ID numbers are different as aggregation indexes, which do not comply with "large numbers of different values ​​should not establish the aggregate index" rules; of course, this is only for users to modify the record content, especially index items. It will be negative when it is, but there is no impact on the query speed.

In an office automation system, whether the system's home page is displayed, the meeting or the user's file query is in any case, and the data query is inseparable from the field. The "date" also has the "user name" of the user itself. .

Typically, the home page of office automation displays files or meetings that each user has not yet signed. Although our WHERE statement can only limit the current user has not yet signed, if your system has established a long time, and the amount of data is large, then each user opens a full table scan when each user opens the home page. This is not big, and the vast majority of users have been viewed in 1 month ago, so they can only share the overhead of the database. In fact, we can completely allow users to open the system home page, the database only queries the file that the user has not read in the past 3 months, limits the table scan, and improves the query speed via the "Date" field. If your office automation system has been established for 2 years, then your homepage displays will be 8 times the original speed, even faster.

Here is the "theoretical" three words because if your aggregation index is blindly built on this primary key, your query speed is not so high, even if you are on the "date" field Established index (non-aggregated index). Let's take a look at the speed performance of various queries in the case of 10 million data (250,000 data within 3 months):

(1) Establish a gathering index only on the primary key, and does not divide the time period:

Select GID, Fariqi, NeiBuyonghu, Title from Tgongwen

Time: 128470 ms (ie: 128 seconds)

(2) Establish an aggregated index on the primary key and establish a non-aggregated index on Fariq:

Select GID, Fariqi, NeiBuyonghu, Title from Tgongwen

WHERE FARIQI> DATEADD (day, -90, getdate ())

Time: 53763 ms (54 seconds)

(3) Set the aggregate index on the date column (Fariqi):

Select Gid, Fariqi, Neibuyonghu, Title from Tgongwenwhere Fariqi> Dateadd (day, -90, getdate ())

Time: 2423 ms (2 seconds)

Although each statement is extracted from 250,000 data, the differences in various situations are huge, especially the difference in the date of the aggregation index. In fact, if your database really has 10 million volumes, the primary key is established on the ID column, just like the first, 2 cases above, the performance on the web page is timeout, and it cannot be displayed at all. This is also a most important factor I have abandoned the ID column as the aggregation index.

The way to get the above speed is: before each SELECT statement: Declare @d DateTime

Set @ d = getdate ()

And after the SELECT statement:

SELECT [Statement Execution Time (Mix)] = Datediff (MS, @ d, getdate ())

2. As long as the index is established, it can significantly improve the query speed.

In fact, we can find that in the above example, the second, third statements are identical, and the fields of establishing the index are also the same; the former is only the non-aggregated index built on the Fariqi field, and the latter is established on this field. The aggregation index, but the query speed has a heaters. Therefore, it is not necessary to simply establish an index on any field to improve the query speed.

From the statement of the table, we can see that this Fariqi field has 5003 different records in a table with 10 million data. The establishment of a aggregation index on this field is not yet. In reality, we will send a few files every day. The issuance date of these files is the same, which is fully compliant with the establishment of the aggregated index requirements: "There is neither the same, but not only the same rule." From this point of view, we build "appropriate" aggregate indexes are very important for us to improve query.

3. Put all the fields that need to improve the query speed to increase the aggregation index to improve the query speed.

It has been discussed above: If you don't leave the field when you perform a data query, "Date" also has the user's "user name". Since these two fields are important, we can consolidate them and build a composite index.

Many people think that just add any fields to the aggregation index, it can improve the query speed, and some people feel confused: If the composite gathering index character is separated, then the query speed will slow down? With this problem, let's take a look at the following query speed (the result set is 250,000 data): (Date Ratio Fariqi first ranked in the start of the composite aggregation index, the username Neibuyonghu ranked later)

(1) SELECT GID, FARIQI, NEBUYONGHU, TIM TGONGWEN WHERE FARIQI> '2004-5-5'

Query speed: 2513 ms

(2) SELECT GID, FARIQI, NEIBUYONGHU, TIM TGONGWEN WHERE FARIQI> '2004-5-5' and neibuyonghu = 'office'

Query speed: 2516 ms

(3) SELECT GID, FARIQI, NEBUYONGHU, TIM TGONGWEN WHERE neibuyonghu = 'office'

Query speed: 60280 ms

From the above trial, we can see that if only the aggregated index is used as the query condition and the query speed of all columns used to use the composite aggregation index, it is even more than even more composite indexes. Fast (in the case of the number of query results set); and if only the non-starting column of the composite aggregation index is used as the query condition, this index is in any effect. Of course, the query speed of statements 1, 2 is the same as the number of items of the query, if all columns of the composite index are used, and the query results are less, so that "index coverage" can be formed, and the performance can achieve optimal . At the same time, please remember: Whether you often use other columns of aggregated indexes, the front lead list must use the most frequent columns. (4) Summary of indexing experience in other books

1. Use the aggregated index to be used for the main key speed of the aggregated index.

The following is an example statement: (all 250,000 data)

Select GID, Fariqi, Neibuyonghu, Reader, Title from Tgongwen Where Fariqi = '2004-9-16'

Time: 3326 ms

Select Gid, Fariqi, Neibuyonghu, Reader, Title from Tgongwen Where gid <= 250000

Time: 4470 ms

Here, the main key speed of the aggregation index is faster than the main key speed of the aggregated index.

2. Use aggregated indexes to be used as a general primary key as the order by the order BY, especially in the case of small data

Select GID, Fariqi, Neibuyonghu, Reader, Title from Tgongwen Order by Fariqi

Time: 12936

Select Gid, Fariqi, Neibuyonghu, Reader, Title from Tgongwen Order by GID

Time: 18843

Here, when using the aggregated index than the general primary key as the Order By, the speed is 3/10. In fact, if the amount of data is small, use the aggregation index as a silend, which is more significant than using the non-aggregated index speed; and if the amount of data is large, if more than 100,000, the speed difference is not obvious. .

3. Use the time period within the aggregated index, the search time will decrease according to the percentage of the entire data table, regardless of the aggregate index.

Select GID, Fariqi, Neibuyonghu, Reader, Title from Tgongwen Where Fariqi> 2004-1-1 '

Time: 6343 ms (extraction of 1 million)

Select GID, Fariqi, Neibuyonghu, Reader, Title from Tgongwen Where Fariqi> 2004-6-6 '

Time: 3170 ms (500,000)

Select GID, Fariqi, Neibuyonghu, Reader, Title from Tgongwen Where Fariqi = '2004-9-16'

Time: 3326 milliseconds (the result of the sentence is exactly the same. If the number of collected is the same, then the same is equal to the number and equal to the number)

Select GID, Fariqi, Neibuyonghu, Reader, Title from TGongwen Where Fariqi> 2004-1-1 'and Fariqi <2004-6-6'

Time: 3280 ms

4, the date column does not slow down the query speed under the query speed, a total of 1 million data, 500,000 data after January 1, 2004, but only two different dates, Date exactly to the day; there are 500,000 data, 5,000 different dates, and the date is accurate to second.

Select GID, Fariqi, Neibuyonghu, Reader, Title from Tgongwen Where Fariqi> '2004-1-1' Order by Fariqi

Time: 6390 ms

Select GID, Fariqi, Neibuyonghu, Reader, Title from Tgongwen Where Fariqi <'2004-1-1' Order by Fariqi

Time: 6453 ms

(5) Other considerations

"Water can be used as a boat, can also come to the boat", the index is the same. Index helps to improve retrieval performance, but too much or improper index will also cause system to inefficient. Because the user adds an index in the table, the database will do more work. Excessive indexing will even result in index fragmentation.

So, we have to build a "appropriate" index system, especially for the creation of aggregated indexes, more refined to make your database can get high performance.

Of course, in practice, as a due diligent database administrator, you have to test some programs, find which program is the highest, most effective.

Second, improve SQL statement

Many people don't know how the SQL statement is executed in SQL Server, and they are worried that the SQL statements you have written will be misunderstood by SQL Server. such as:

Select * from table1 where name = 'zhangsan' and Tid> 10000

And execute:

Select * from table1 where tid> 10000 and name = 'zhangsan'

Some people don't know if the execution efficiency of the above two statements is the same, because if the simple from the statement, these two statements are indeed different, if the Tid is a aggregated index, then the last sentence is only from 10,000 from the table. In the record, you will find it; and the previous sentence, you must first find a few name = 'zhangsan' from the full table, and then subsequent query results according to the restriction condition TID> 10000.

In fact, such concerns are unnecessary. There is a "query analysis optimizer" in SQL Server, which can calculate the search criteria in the WHERE clause and determine which index can reduce the search space of the table scan, that is, it can be automatically optimized.

Although the query optimizer can be optimized according to the WHERE clause, you still need to know how to "Query Optimizer" work. If it is not like this, sometimes Query the optimizer will not follow your original intention.

In the query analysis phase, query the optimizer view each stage of the query and determine whether the amount of data that needs to be scanned is useful. If a phase can be used as a scan parameter (SARG), it is called optimized, and the required data can be quickly obtained using indexes.

Definition of SARG: Used to restrict a search for a search because it is usually referring to a specific match, a matching match or two or more-over-connection. The form is as follows:

Column Operator

or

operator column name

The column name can appear on one side of the operator, and constant or variable appears on the other side of the operator. Such as:

Name = 'Zhang San'

Price> 50005000

Name = 'Zhang San' and the price> 5000

If an expression cannot meet the form of SARG, it will not limit the range of search, that is, SQL Server must determine whether it meets all the conditions in the WHERE clause. So a index is useless for expressions that do not satisfy the form of SARG.

After the introduction SARG, we summarize the experience of using SARG and the conclusions of some information in practice:

1. Whether the LIKE statement belongs to SARG depends on the type of wildcard used

Such as: Name Like 'Zhang%', this belongs to SARG

And: Name Like '% Zhang' is not SARG.

The reason is that wildcard% is open in the string so that the index cannot be used.

2, OR will cause full table scanning

Name = 'Zhang San' AND price> 5000 Symbol SARG, and: Name = 'Zhang San' OR Price> 5000 does not comply with SARG. Using OR will cause full meter scans.

3, non-operator, function caused by the unmet SARG form

The most typical case of not satisfying SARG form is to include non-operative statements, such as: not,! =, <>,! <,!>, Not exists, not in, not like, etc., there are also functions. Below is a few examples that don't meet the form of SARG:

ABS (price) <5000

Name Like '% three'

Some expressions, such as:

WHERE price * 2> 5000

SQL Server will also think that SARG, SQL Server will convert this type to:

Where price> 2500/2

But we do not recommend this, because sometimes SQL Server does not guarantee that this conversion is completely equivalent to the original expression.

4, in the role of IN is quite

Statement:

Select * from table1 where tid in (2, 3)

with

Select * from table1 where tid = 2 or TID = 3

Yes, it will cause full mete scanning if there is an index on the TID, and its index will fail.

5, try to use NOT

6, the execution efficiency of EXISTS and IN is the same

Many data show that EXISTS is high than in the execution efficiency, while using NOT EXISTS as much as possible. But in fact, I tried it, found that both the front belt without NOT, the execution efficiency between the two is the same. Because of the subquery, we tried this time the PUBS database comes with SQL Server. We can open the SQL Server's Statistics I / O status before running.

(1) SELECT TIM, Price from Titles Where title_id in (SELECT TITLE_ID from Sales Where Qty> 30)

The result of this sentence is:

Table 'Sales'. Scan count 18, logically read 56 times, physically read 0 times, read 0 times.

Table 'Titles'. Scan count 1, logic read 2 times, physically read 0 times, read reading 0 times.

(2) SELECT TITLE, Price from Titles Where Exists (Select * from sales where sales.title_id = titles.title_id and qty> 30)

The execution result of the second sentence is:

Table 'Sales'. Scan count 18, logically read 56 times, physically read 0 times, read 0 times. Table 'Titles'. Scan count 1, logic read 2 times, physically read 0 times, read reading 0 times.

We can see this with exists and use in in the use of EXISTS.

7, using the function charIndex () and the previous addition of the grouper LIKE execution efficiency

In front, let's talk about if you add a wildcard in front of the Like, it will cause full menu scanning, so its execution efficiency is low. However, some information said that using a function charIndex () instead of the LIKE speed will have a big improvement, after I tried, I found that this instruction is also wrong:

Select Gid, Title, Fariqi, Reader from Tgongwen WHERE Charindex ('Criminal Investigation Detachment "> 0 and Fariqi> 2004-5-5'

Time: 7 seconds, in addition: scan count 4, logic read 7155 times, physically read 0 times, read reading 0 times.

SELECT GID, TITLE, FARIQI, Reader from Tgongwen Where Reader Like '%' 'Criminal Investigation Detachs' '%' and Fariqi> '2004-5-5'

Time: 7 seconds, in addition: scan count 4, logic read 7155 times, physically read 0 times, read reading 0 times.

8, Union does not have a high performance efficiency of the OR

We have already talked about the use of OR in the WHERE clause that will cause full menu scanning. Generally, the information I have ever seen is recommended here with union instead of OR. It turns out that this statement is suitable for most.

Select Gid, Fariqi, Neibuyonghu, Reader, Title from Tgongwen Where Fariqi = '2004-9-16' or gid> 9990000

Time: 68 seconds. Scan count 1, logic read 404008 times, physically read 283 times, read reading 392,163 times.

Select GID, Fariqi, Neibuyonghu, Reader, Title from Tgongwen Where Fariqi = '2004-9-16'

union

Select GID, Fariqi, Neibuyonghu, Reader, Title from Tgongwen Where Gid> 9990000

Time: 9 seconds. Scan count 8, logic read 67489 times, physically read 216 times, read 7499 times.

It seems that using Union is usually more efficient than using OR.

But after the test, the author found that if the query column on both sides is the same, then union is inverted and the execution speed of the OR is a lot, although the Union scan is an index, and the OR scan is a full table.

Select Gid, Fariqi, Neibuyonghu, Reader, Title from TGongwen Where Fariqi = '2004-9-16' or Fariqi = '2004-2-5'

Time: 6423 ms. Scan count 2, logic read 14726 times, physically read once, read 7176 times.

Select GID, Fariqi, Neibuyonghu, Reader, Title from Tgongwen Where Fariqi = '2004-9-16'Union

Select GID, Fariqi, Neibuyonghu, Reader, Title from Tgongwen Where Fariqi = '2004-2-5'

Time: 11640 ms. Scan count 8, logic read 14806 times, physically read 108 times, read reading 1144 times.

9, the field extraction should be "SELECT *" in accordance with the principle of "how much to be much, how much"

Let's do a test:

Select Top 10000 GID, Fariqi, Reader, Title from Tgongwen Order by Gid DESC

Time: 4673 ms

Select Top 10000 GID, Fariqi, Title from Tgongwen Order by GID DESC

Time: 1376 ms

Select Top 10000 GID, Fariqi from Tgongwen Order by GID DESC

Time: 80 milliseconds

From this point of view, each of us extracts a field, and the extraction speed of the data will have a corresponding improvement. The upgrade speed also depends on the size of the field you discard.

10, count (*) is not slower than count (field)

Certain information says: Use * will count all columns, obviously more efficient than one world column name. This statement is actually unfounded. Let's see:

Select count (*) from tgongwen

Time: 1500 ms

Select Count (GID) from TGongwen

Time: 1483 ms

Select Count (Fariqi) from Tgongwen

Time: 3140 ms

Select Count (Title) from Tgongwen

Time: 52050 ms

As can be seen from the above, if you use count (*) and use the speed of COUNT (primary key), count (*) is faster than any other field other than the primary key, and the longer the field, summarizes The slower speed. I think, if you use count (*), SQL Server may automatically find the minimum field to summarize. Of course, if you write a COUNT (primary key), it will come more directly.

11, Order By is the highest in ordering the collected index

Let's see: (GID is the primary key, Fariqi is a polymerization column)

SELECT TOP 1000 GID, Fariqi, Reader, Title from TGongwen

Time: 196 milliseconds. Scan count 1, logic read 289 times, physically read 1 time, read reading 1527 times.

Select Top 10000 GID, Fariqi, Reader, Title from TGongwen Order by GID ASC

Time: 4720 milliseconds. Scan count 1, logic read 41956 times, physically read 0 times, pre-read 1287 times.

Select Top 10000 GID, Fariqi, Reader, Title from Tgongwen Order by Gid DESC

Time: 4736 ms. Scan count 1, logic read 55,350 times, physically read 10 times, read 775 times.

Select Top 10000 GID, Fariqi, Reader, Title from Tgongwen Order by Fariqi ASC Time: 173 ms. Scan count 1, logic read 290 times, physically read 0 times, read reading 0 times.

Select Top 10000 GID, Fariqi, Reader, Title from Tgongwen Order by Fariqi DESC

Time: 156 milliseconds. Scan count 1, logic read 289 times, physically read 0 times, read reading 0 times.

From the above we can see that the speed of no sorting and the number of logical readings are quite, but these are much more queried than "ORDER BY non-aggregated index". of.

At the same time, when sorting in a field, whether it is a normal or reverse, the speed is basically quite.

12, efficient TOP

In fact, when querying and extracting a large-capacity data set, the maximum factor affecting the database response time is not a data lookup, but a physical I / 0 operation. Such as:

SELECT TOP 10 * FROM

SELECT TOP 1000 GID, Fariqi, Title from Tgongwen

Where neibuyonghu = 'office'

Order by gid desc) AS A

Order by GID ASC

This statement, in theory, the execution time of the whole statement should be longer than the child's execution time, but the fact is the opposite. Because the clause is executed, the 10000 records are returned, and the whole statement returns only 10 statements, so the factor affecting the database response time is a physical I / O operation. One of the most effective ways to limit physical I / O operations here is to use TOP keywords. Top Keywords are the words used in SQL Server to extract the first few or the first few percentage data. The application of the script in practice found that TOP is really easy, and the efficiency is also high. But this word is not in another large database Oracle, this can't be said to be a regret, although in Oracle, other methods (such as rownumber) can be used in Oracle. In the discussion of the "Paging Display Storage Process of Ten Thousands of Data), we will use TOP.

To this end, we discussed how to quickly query the data methods you need from a large-capacity database. Of course, these methods we introduce are "soft" methods. In practice, we have to consider all kinds of "hard" factors, such as network performance, server performance, performance system performance, even network cards, switches, etc.

Third, universal paging display stored procedures for realizing small data volume and massive data

Establish a web application, paging browsing features is essential. This problem is a very common problem in database processing. The classic data paging method is: ADO record set paging method, that is, using ADO's own paging function (using a cursor) to implement paging. However, this paging method is only suitable for smaller data, because the cursor itself has a disadvantage: the cursor is stored in memory, which is very consumed. The cursor is built, and the relevant record is locked until the cursor is canceled. The cursor provides a means of scanning a row of row by line, generally uses a cursor to cross data, and perform different operations depending on the different data conditions. The cycle of the cursor (large data set) defined in multi-table and big tables is easy to enter a long wait or even crash.

More importantly, for a very large data model, the page retrieves, if the method of loading the entire data source is very wasteful in accordance with the traditional method of loading each time, it is very wasteful. Nowadays, the popular paging method is typically the data of the block area of ​​the page size, rather than retrieving all the data, and then performing the current row. The first way to extract data based on page size and page number is probably "Russian Storage Process". This stored procedure uses a cursor, because this method does not have a universal recognition of everyone because of the limitations of the cursor.

Later, some online stored procedures, the following stored procedures are the paging stored procedures written in conjunction with our office automation:

Create Procedure Paging1

(@PageSize Int, - Page size, such as 20 records per page

@PageIndex Int - Current page number

)

AS

Set nocount on

Begin

Declare @indextable table (id int identity (1, 1), NID INT) - Define the table variable

Declare @pagelowerbound Int - Define the base code on this page

Declare @PageUpperBound Int - Define the top code of this page

Set @pagelowerbound = (@ pageindex-1) * @ PageSize

Set @ PageUpperBound = @ PagelowerBound @ PageSize

Set rowcount @PageUpperBound

INSERT INTO @indextable (NID) Select Gid from Tgongwen Where Fariqi> Datead (day, -365, getdate ()) Order by Fariqi DESC

Select O.GID, O.MID, O. Title, O.FadanWei, O.Fariqi from Tgongwen O, @ indextable t where o.gid = T.NID

And T.ID> @PagelowerBound and T.ID <= @PageUpperBound ORDER by T.ID

end

Set nocount off

The above stored procedure uses the latest technology of SQL Server-Table variables. It should be said that this stored procedure is also a very good paging stored procedure. Of course, in this process, you can write the table variables into temporary tables: CREATE TABLE #Temp. But it is obvious that in SQL Server, the temporary table is not used in the table variable. Therefore, the author just started using this stored procedure, it feels very good, the speed is better than the original ADO. But later, I found a better way than this method.

The author saw a small short message "Remove the method of the record of the Nth to Mth in the data sheet", the whole text is as follows:

Remove the Nth to Mth to M-N 1 * from Publish Where (Id Not in (Select Top N-1 ID FROM PUBLISH)

ID is keyword for the Publish table

When I saw this article, I really had a spirit of the spirit, I feel very good. Wait until later, I suddenly remembered this article when I was working on an office automation system (ASP.NET C # SQL Server). I would like to transform this statement, which may be a very good paging store. So I was looking for this article online. I didn't expect that the article has not found it yet, but I found a paging store process according to this statement. This stored procedure is also a popular paging store process. I regret it. Did not change this paragraph into stored procedures: Create Procedure Paging2 (@SQL NVARCHAR (4000), - SQL statement @Page Int, - page @RecsperPage Int, - Number of records accommodated per page without sort statements @ID varchar (255)

Declare @str nvarchar (4000)

Set @ Str = 'SELECT TOP' CAST (@Recsperpage As Varchar (20)) '* from (' @ SQL ') T where t.' @id 'Not in (SELECT TOP ' CAST ((@ Recsperpage * (@ Page-1)) AS VARCHAR (20)) '' @ ID 'from (' @ SQL ') T9 Order By' @ Sort ') Order by' @ Sort

Print @str

EXEC SP_EXECUTESQL @STRGO

In fact, the above statement can be simplified:

SELECT TOP page size *

From table1

WHERE (ID NOT IN

(SELECT TOP page size * page number ID

FROM table

Order by id))

ORDER BY ID

But this stored procedure has a fatal disadvantage that it contains NOT IN words. Although I can transform it as:

SELECT TOP page size *

From table1

WHERE NOT

(Select * from (SELECT TOP) * from table1 order by id) b where b.id = a.id)

ORDER BY ID

That is, use not exists instead of Not in, but we have already talked in front, and the effectiveness of the two is actually no difference.

It's so, this method of combining with Not in in TOP is still more fast than using a cursor.

Although NOT EXISTS cannot save the efficiency of the stored procedure, the TOP keyword in SQL Server is a very wise choice. Because the ultimate goal of paging optimization is to avoid excessive record sets, and we have also mentioned the advantage of TOP in front, and the amount of data on the data can be implemented by top.

In paging algorithms, there are two provets that affect our query speed: Top and Not in. TOP can improve our query speed, while NOT I will slow down our query speed, so we must improve our entire pagination algorithm, so you have to completely transform NOT IN, replace it with other methods. We know, almost any field, we can all extract the maximum or minimum in a field via the max (field) or min, so if this field does not repeat, you can take advantage of these non-duplicated fields of MAX Or MIN is used as a watershed, which makes it a reference substance for the paging algorithm. Here, we can use the operator ">" or "<" to complete this mission, so that the query statement is in line with the SARG form. Such as:

SELECT TOP 10 * from table1 where id> 200

So the following paging scheme:

SELECT TOP page size *

From table1

WHERE ID>

(Select Max (ID) from from

(SELECT TOP ((Page-1) * Page size) ID from table1 order by id) AS T

)

ORDER BY ID

When the selection is not repeated, we often select the primary key when the size of the size is easily resolved. The following table lists the tables in an office automation system with 10 million data, in GID (GID is the primary key, but not aggregation index.) Is ranked sequence, extract GID, Fariqi, Title field, respectively 10, 100, 500, 100 million, 100,000, 250,000, 500,000 pages as an example, the speed of execution of above three paging schemes: (unit: milliseconds)

Page Code 1 Program 2 Scenario 3 1 60 30 76 10 46 16 63 100 1076 720 130 17110 470 250 10,000 24796 4500 140 100,000 38326 42283 1553 250,000 28140 128720 2330 50021686 127846 7168

From the above table, we can see that when the three stored procedures are executing the paging commands below 100 pages, they can trust, and the speed is very good. But the first solution is dropped by the speed of 1000 pages or above. The second solution is approximately the speed of the page 10,000 page begins to fall. The third program has never had a big destination, and the post-strength is still very good.

After determining the third paging scheme, we can write a stored procedure accordingly. Everyone knows the SQL Server stored procedure is compiled in advance, and its execution efficiency is higher than the execution efficiency of SQL statements from the web page. The following stored procedures not only contain paging schemes, but also determine if the data is performed according to the parameters of the page.

- Get the data of the specified page

Create Procedure Pagination3

@tblname varchar (255), - Table name

@StrGetfields varchar (1000) = '*', - the column that needs to be returned

@fldname varchar (255) = '', - Sort field name

@PageSize Int = 10, - Page Size

@PageIndex INT = 1, - Page

@docount bit = 0, - Return to the total number of records, non-0 values ​​return

@ORDERTYPE bit = 0, - Set the sort type, descending order of non-0 value

@Strwhere Varchar (1500) = '' - Query Conditions (Note: Don't add where) AS

Declare @strsql varchar (5000) - main statement

Declare @strtmp varchar (110) - Temporary variable

Declare @strorder varchar (400) - Sort Type

IF @docount! = 0

Begin

IF @strwhere! = ''

Set @strsql = "Select count (*) as total from [" @TBLNAME "] where" @ strwhere

Else

Set @strsql = "Select count (*) as total from [" @TBLNAME "]"

end

- The above code means that if @Docount passes is not 0, the total number of statistics is performed. All of the following code is @docount is 0

Else

Begin

IF @ORDERTYPE! = 0

Begin

Set @strtmp = "<(SELECT MIN"

Set @strorder = "Order by [" @fldname "] desc"

- If @ORDERTYPE is not 0, it will be designed, this sentence is very important!

end

Else

Begin

Set @strtmp = "> (SELECT MAX"

Set @strorder = "Order by [" @fldname "] ASC"

end

IF @PageIndex = 1

Begin

IF @strwhere! = ''

Set @strsql = "SELECT TOP" STR (@PageSize) "" @ strGetfields "from [" @TBLNAME "] where" @strwhere "" @strorder

Else

Set @STRSQL = "SELECT TOP" STR (@Pagesize) " @ strGetfields " from [" @tblname "] " @strorder

- If you are the first page, you will implement the above code, which will speed up the execution speed.

end

Else

Begin

- The following code gives @strsql to real execute SQL code

Set @strsql = "SELECT TOP" STR (@PageSize) " @ strGetfields " from ["

@TBLNAME "] Where [" @fldname "]" @STRTMP "([" @fldname "]) from (SELECT TOP" STR ((@ pageindex-1) * @ Pagesize) "[" @fldname "] from [" @TBLNAME "]" @strorder ") as TBLTMP" @strorderif @Strwhere! = ''

Set @strsql = "SELECT TOP" STR (@PageSize) " @ strGetfields " from ["

@TBLNAME "] Where [" @fldname "]" @strtmp "(["

@fldname "]) from (SELECT TOP" STR ((@ PageIndex-1) * @ PageSize) "["

@fldname "] from [" @tblname "] where" @strwhere ""

@strorder ") as tbltmp) and" @strwhere "" @strorder

end

end

EXEC (@strsql)

Go

The above stored procedure is a universal stored procedure whose comment is written there.

In the case of big data, especially when querying the last few pages, the query time generally does not exceed 9 seconds; use other stored procedures, in practice, this stored process is very suitable for large capacity The database is queried.

The author hopes to bring a certain revelation by parsing the above stored procedures, and bring a certain efficiency to the work, and hopes to make a better real-time data paging algorithm.

Fourth, the importance of aggregation index and how to choose aggregation index

In the title of the previous section, the author writes: universal paging display stored procedures for realizing small data volume and massive data. This is because the author discovered that the third stored procedure was in the case of small data in the case of small data:

1. Page speed is generally maintained between 1 second and 3 seconds.

2. When querying the last page, the speed is generally 5 seconds to 8 seconds, even if the total number of paging is only 3 or 300,000 pages.

Although this paging implementation process is very fast in the case of oversized capacity, this 1-3 seconds, the speed is slower than the first one or no optimized paging method, borrowing The user's words are "there is no ACCESS database speed", this understanding is sufficient to lead the user to abandon the use of your development.

The author analyzed this, which is the same as this phenomenon, but it is important: Sorted fields are not aggregated index! The topic of this article is: "Query Optimization and Page Algorithm Program". The author only puts the two contacts of "Query Optimization" and "Pieces Algorithms", which is because both require a very important thing - aggregation index.

In the previous discussion, we have already mentioned that there are two biggest advantages of aggregation index:

1. Reduce the query range with the fastest speed.

2. Sort by the fastest speed.

Article 1 When the query optimization is used, the second multi-use data is sorted in the page.

The aggregation index can only build one in each table, which makes the aggregation index more important. The selection of aggregated indexes can be said to be the most critical factors that implement "Query Optimization" and "High Efficiency Paggers".

However, it is often a contradiction between the needs of the aggregation indexes in accordance with the needs of the query column, and in line with the needs of the sequence.

In the previous discussion of the author's "index", Fariqi, ie the user issued date as the starting column of the aggregated index, the date of the date is "day". The advantages of this practice have been mentioned above, and in the quick query of the time period, there is a great advantage over the list of ID primary keys.

However, in paging, since this aggregation index is repeatedly recorded, the most efficient sort cannot be achieved using MAX or MIN to be used. And if the ID primary key column is used as an aggregation index, the aggregation index removes it except for sorting, there is no use, actually a waste of the valuable resources of the aggregation index.

In order to solve this contradiction, the author has added a date column, whose default is getDate (). When the user is written to the record, this column automatically writes the time, time is accurate to milliseconds. Even though, in order to avoid the possibility of small coincidence, create a unique constraint on this column. This date is listed as a collected index.

With this time-type aggregation, the user can use this column to find the query of a certain period of time when the user is inserted, and can be used as a unique column to implement max or min, becoming a reference algorithm.

After such optimization, the author found that if it is a large amount of data, it is a small amount of data, the pages speed is generally tens of milliseconds, or even 0 milliseconds. The query speed with the reduction range with the date segment is not any dull.

The aggregation index is so important and precious, so the author summarizes it, be sure to set the aggregated index:

1, you are most frequently used to reduce the field of query scope;

2, you are most frequently used, requiring sorted fields.

Conclusion:

This article brings together the experience of receiving the database in the use of databases, is the accumulation of practical experience in "office automation" system. I hope this article can not only give you a certain help, but also hope that you can experience the method of analyzing the problem; the most important thing is that this article will be able to throw the jade, set off your interest, discussion, Together, we will make your own biggest efforts for public security science and technology strong police careers and Golden Shield projects.

Finally, it is necessary to explain that in the test, I found that the user is not the memory size when the database is in the case of the large data query, but the CPU. At my P4 2.4 machine test, check "Explorer", CPU often lasts to 100%, but the memory usage has not changed or there is no big change. Even in our HP ML 350 G3 server, the CPU peak can reach 90%, which is generally around 70%.

The test data from this article is from our HP ML 350 server. Server configuration: Double Inter Xeon Super Thread CPU 2.4G, Memory 1G, Operating System Windows Server 2003 Enterprise Edition, Database SQL Server 2000 SP3. Everyone can visit the following public security network website or the Internet website to experience our "Trianda" database office automation (ASP.NET C # language).

http://10.59.121.11:90

http://www.xx110.net/oa

Microsoft MCSE System Engineer

Microsoft MCDBA Database Engineer

Xinxiang City Public Security Bureau Communication Party Yulong

references:

[1] "SQL Server 7 Programming Technology Insider", (US) John Papa, Matthew Shepker, Machinery Industry Press

[2] "SQL Server Database Principles - Design and Implementation", Microsoft Asian Research Institute, Tsinghua University Press

[3] http://community.9cbs.net/expert/topic/2987/2987172.xml?temp=9.089297e-02, Zou Jian, 9CBS Forum

[4] Internet

转载请注明原文地址:https://www.9cbs.com/read-104038.html

New Post(0)