Application Summary of database technology in the official document management system:
The database is an important part of the current application software system, how to make the database-based application system security, reliable, and efficient operation has been a problem of software development technology research. This article discusses this issue from the database technology used in the document management system products, discusses this issue from the selection of databases, design, query optimization, and security control.
Key words:
Web server, DBMS, query optimization, security.
With the development of information technology, the electronic work of the administrative organs is further launched, and the electronic document management has become a strategic topic of government agencies, but the current document management technology and standards are not mature enough, in order to further promote government informationization. Construction, we must further study the development of Internet and intranet management systems that adapt to the new era, to improve the processing efficiency of official documents and enhance government performance.
Under this market opportunity, I as a company's technical backbone, hosted our company's official document management system product - DOCMAN document management system (hereinafter referred to as: DOCMAN), and participated in the completion of the product planning, demand analysis, Design and partial code of some code.
I am now, in the development of the issues related to database technology, and the strategies adopted are described below to communicate.
I. System Platform and Database Management System Selection
The DOCMAN document management system is the main part of the government agency, which is the main part of e-government, so the DOCMAN and other electronic government subsystems have problems with cross-platform distribution, heterogeneous, and government original application systems. In order to face the application needs of various types, our DOCMAN official management system uses multi-layer B / S architecture (client browser layer, web server layer, application server layer, database layer), and adopted J2EE and EJB technology Implement system distribution heterogeneous and cross-platform. In order to meet the needs of various types, we have a popular operating system (Win32 series, UNIX series, Linux series), web server (Tomcat4.0, IBM WebSphere4.0, Bea WebLogic 5.0), and Database Management System (Oracle, SQL Server , Sybase, InfoMix, DB2, etc.) give tried to support. When considering the application of large agencies, we selected a proxy server, multi-parallel web server and multi-application server technology to implement system load balancing and traffic management; but because the application of the current distributed database is not mature, we use centralized database technology implementation The storage of the agency data.
Second, the database design
1. Product cross-database management system design
During the system database selection, due to the demand of many databases, the design of cross-database is one of our key issues when the system database is physically designed. When we solve this problem, we first compare various types of databases, then mainly adopted the following policies to achieve the purpose.
1) Selection of database field types. In order to suit the needs of various databases, we use three types of field types: long shaping, string, binary. Cancel the automatically incremented field type. The symbol is replaced by a string; the date type is replaced by long plastic surgery. If it is accurate to the day, it is replaced with a field 8-bit shaping, such as 20020203, if it is accurate to minutes, use 2 fields 8-bit shaping replacement, The first is accurate to minutes, the structure is the same, the second uses 24 hours and accurate to 0.01 seconds, such as: 21533203.
2) DDL and DML selection. Since the system can build all databases, various tables, views, indexing, etc., so we can use different DDLs for different databases, which is not limited to DDL usage. For DML, because various database manufacturers do not strictly adopt standard SQL-92, there is a big problem between interoperability between various database management systems (DBMS); in this regard, we stipulate the processing of database operations Try to use only standard simple generic SQL statements, if you do need to use a complex query, you can use the Swithc, CASE statement, and write different query statements for different DBMS. 3) Use of database functions. For the selection of database functions, we only use standards of standard universal (all kinds of DBMS), rather than standard not.
4) Store procedure. Since the storage process processing mechanism of various DBMS is different, we do not adopt the stored procedure.
Through the above provisions, it seems that our database system has not played well, but only only takes the role of data storage. But this design also just reduces the load of our centralized database.
2. Reduce design of database management system
Since our system uses centralized database technology, the application centralized data storage for large agencies will become the main bottleneck of the system. Therefore, we should consider trying to reduce the load load as much as possible. In this respect, we mainly use the following policies to solve.
1) The transfer of the application calculation. Since our system can use a multi-web server and multi-application server mechanism, we mainly consider transferring system logic to the application layer, try not to transfers the system to the database layer.
2) In fact, other database design considerations have also reached the role of mitigating DBMS load.
3. Database specification and non-standardized considerations
After the database is normalized, the data redundancy is reduced, and the amount of data is small, and the data line is narrowed. In this way, each page of DBMS can include more lines, then the amount of data in each area is more, thus the scan of the meter, improves the query performance of a single table. However, when the query involves multiple tables, it is necessary to use many connection operations to combine information from each table, resulting in higher CPU and I / O sales. Then there are many more time you need to maintain balance between normalization and non-standardization, use appropriate redundant information to reduce system overhead, and use space cost to exchange time cost. There is an order information table ORDERDETAIL, which records the category information, payroll information, item information, price strategy, customer information .... This information is in the deliveryman information table, the payer information form, the item information form, the price policy list The customer information table is stored. If you are in accordance with normal Detail queries, you must connect to such a plurality of tables or nested queries. If the amount of data in the ORDERDetail table is in millions, then the time required for a query may reach several hours. In fact, many information can be directly redundant in the ORDERDETAIL table as long as it is designed, and these redundant data can greatly improve the efficiency of the query, thereby reducing CPU and I / O operation.
4. Data strip design
If the number of records of a table exceeds a certain size, the most basic query operation will also be affected, and the table needs to be divided according to the date level, and the recent, most often used data and history, not often used data division Open, or according to geographic locations, departments, etc. There is also a way to divide a vertical division, that is, split a property column to a few small tables, such as putting the currently used properties in a table, not often used in the other table. This can speed up the scan of the table and improve efficiency. For example, DOCMAN's official document is for more than 30 attributes for the official entity; and because the system is the most frequent operation of the entity, for a municipal government, the system will reach the INSERT operation of the entity. 10,000 (the agency receipt of a text or creation of a copy will produce an insert operation); therefore, we first use a vertical division method, which will be classified into a table, and the attributes that are not commonly used are classified into a table. , Then we will design a good design according to time level, and the half-year data is stored in a table, so for a municipal organ, a document record will not exceed 3 million records, which greatly improves operational efficiency. 5. Primary key design regulations
The primary key uses intensive improvement efficiency, and the comparison overhead of characters is much larger than the integer, and the main key with the characteristic data will make the data insertion, the update and the efficiency of the query are reduced. When the amount of data is small, this reduction may not be noted, but when the amount of data is large, the small improvement can improve the response speed of the system. Therefore, we use plastic as keywords when designing the database.
6. Index chose considering
Indexes are important data structures in the database, and its fundamental purpose is to improve query efficiency. Most of the database products are now using IBM's first ISAM index structure. Use the index to quickly, direct, orderly access data. Although the establishment of an index has accelerated the query, on the other hand, the speed of the data update, because the new data is not only to the table, but also to the index. In addition, the index also requires additional disk space and maintenance overhead. Therefore, to use the index reasonably:
● Connect frequently, but do not specify an index on the attribute column that is foreign key.
● Establish an index on the columns of frequent sorting or grouping (ie, GROUP BY or ORDER BY operation). Sort or group by index, you can improve efficiency.
● Establish a search in columns that are often used in the conditional expression, do not establish an index on the columns of different values.
● If there are multiple columns to be sorted, the composite index can be established on these columns, ie the index is complicated by multiple fields.
Third, data query optimization
In the system coding phase, the SQL statement written by the programmer will have a significant impact on the performance and response time of the system. It is difficult to imagine the original poor query statement submitted by the user after DBMS optimization, so the pros and cons of the SQL statement written during encoding is critical. Therefore, we have the following guidance instructions in the system coding so that the programmer can write the high-efficiency SQL statement as much as possible.
1. Sort
In many cases, you should simplify or avoid repeating sorting of large tables. When the output is automatically generated using an index to generate an output in an appropriate order, the step of sorting can be avoided, and when the case occurs, the sort cannot be omitted:
● The index does not include one or several columns to be sorted;
● The order of the columns in Group By or Order By clause is different from the order of the index;
● Sort columns come from different tables.
In order to avoid unnecessary sorting, it is necessary to correctly enhance indexes, reasonably consolidate database tables, although it is sometimes possible to affect the standardization of the table, but it is worthwhile with respect to efficiency. If sort is inevitable, you should try to simplify it, such as the range of zoom sequences. 2. Nested query
In the SQL language, a query block can be used as an operand of the predicate in another query block. Therefore, the SQL query can be nest in layers. For example, in a large distributed database system, there is an order table Order, an order information table ORDETAIL, if you need two table association queries:
Select Createuser
From Order
Where Orderno in
(Select ORDERNO
From OrderDetail
Where price = 0.5)
In this query, identify the list of bookmarks with a single price of 0.5 yuan. The lower layer query returns a set of values to the upper query and then continues to query according to the value provided by the lower block by the upper query block. In this nested query, each value of the upper query is ORDERNO, the lower layer query must be scanned to the table OrderDetail, and the execution efficiency is obviously not high. In this query, there are 2 layers of nested, if each layer queries 1000 lines, then this query is to query 1 million row data. In the system overhead, the scan of the table order occupies 82%, and the search for the table OrderDetail accounts for 16%. If we replace it with a connection, ie:
Select Createuser
From Order, ORDERDETAIL
Where order.orderno = OrderDetail.Orderno and praice = 0.5
Then 74% of the Scan of Table ORDER, 14% of the search of ORDERDETAIL.
Moreover, a column label occurs simultaneously in the query in the main inquiry and WHERE clause, then it is likely that the subquery must be re-query after the column value in the main query changes. The more nesting, the lower the efficiency, so you should try to avoid subquery. If the child query is inevitable, then filter out as much row as possible in the child query.
3. Wildcard
In the SQL statement, the LIKE keyword supports wildcard match, but this match is particularly time. For example: SELECT * from Order Where CreateUser Like 'M_ __'. Even if an index is established on the CreateUser field, in this case, in such cases, there is a sequential scan, and 1000 records in the Order table need to be more than 1000 times. If you change the statement to SELECT * from ORDER WHERE CREATEUSER> 'M' and createuser <'n', you will use the index to query when you execute the query, obviously greatly improves the speed.
4. Distinct
Using Distinct is to ensure that no repetition values in the result set, DistINCT generates a work table and sorts to delete repetitive records, which greatly increases the number of queries and I / O. Therefore, it should be avoided using the Distinct keyword.
5. Negative logic
Negative logices such as! =, <>, Not in, etc., will cause the DBMS table scan to complete the query. When the table is large, it will seriously affect the system performance, and you can use other operations instead.
6. Temporary table
Use the temporary table when the database will establish a corresponding data structure in the disk, as the memory has a much more than the access speed of the external memory, when using a temporary table in a complex query, the intermediate result will be imported into the temporary table, this Disk operation will greatly reduce query efficiency. In addition, in a distributed system, the use of a temporary table also brings synchronization problems between multiple query processes. Therefore, it is best not to use a temporary table when performing complex queries. In addition, since we have considered the DDL statement in the system operation, all DDL statements have been completed during system initialization. Therefore, in fact, we have banned the use of temporary tables. Fourth, the design of database security
There are a variety of security related factors, with software system itself is a key factor in system security. Software system security mainly depends on the security of the application system itself and the security of the database management system. In DOCMAN, we mainly use the following strategies to ensure the security of the system and data:
System authority control. System permission control is divided into system identity authentication and system authorization control 2 categories.
System identity authentication requires users to enter the username and password to the DOCMAN when using the DOCMAN system so that the computer confirms the user's true identity and prevents the top of the name; and as the basis for authorization control.
System authorization control, when the user has received and logged in to the system by DOCMAN, when the program and data are required, the DOCMAN rights management module checks the user privilege, and access it according to the user's permission control of the item.
System data integrity constraints. In DOCMAN, we mainly guarantee the integrity of system data: the constraint mechanism of the database itself and the system entity class itself provide the integrity of the data. Ensure that the DOCMAN system configuration parameter is not illegally changed, and the protection of DOCMAN data is not illegally modified and deleted. Ensure that there is no semantic data in the database, prevent input and output of error messages, that is, the so-called garbage invalidation and error results.
System data backup. The system's data backup mainly includes data security backups and data during data.
Data security backup, DOCMAN system In addition to the data backup / recovery mechanism provided by the Database Management System itself, the system itself provides data backup / recovery. Document data security backups have two sub-features, full backup and incremental backup. Fully backup can import the data of the current runtime to another backup database. Incremental backup, the system uses the log management module Real-time record all database successful update operations; system administrators can regularly update the SQL statements of these real-time records, re-execute them in the backup library; to achieve incremental backup purposes. The DOCMAN system not only reaches the security backup function through this backup mechanism, but also reaches the operational purpose of the system 7 * 24; when running the library machine, the DOCMAN system can transfer the database connection to the backup library to continue running.
During the data period, the DOCMAN system is running at a large government agency, and the system day trading volume is quite large. One day may reach tens of thousands. Therefore, the data volume of the database is also massive after running for a period of time; the system is dragged with massive data, which will seriously affect the operating efficiency of the system. For the official document management system, the official documents and their handling data are only useful in documentation and the appraisal of the official documents, it is not used. Therefore, we provide regular backup feature in DOCMAN; allowing system administrators to back up data for data, if you use half a year backup, the data that is no longer used for the previous year to dump to file or another database.
System log management. In DOCMAN we pass the system log management module, record each user to increase, modify, delete operation, and record the results of the operation. Records for important data queries. Use the user's operation of system data, with unrecognizable.
V. conclusion