Data warehouse technology introduction Author: Joseph Chow Published: 2001/07 / 21 data warehouse is a new kind of database applications rise in recent years. In major database vendors, the products supported the data warehouse and proposes a complete set of products to establish and use data warehouses, and the industry has set off a database fever. For example, InformixGongside's data warehouse solution; Oracle's data warehouse solution; Sybase's interactive data warehouse solution, etc. This also has also caused great interest in academia, many important academic conferences in international conferences, such as Super Large Database International Conference (VLDB), Data Engineering International Conference, etc., there is a special research data warehouse (Data Warehousing " A brief introduction to DW), online analysis processing (on-line analysis), a paper that is OLAP, Data Mining, a DM). For many companies in my country, in establishing or developing their own information systems often plaguing this problem: Why do you build a data warehouse on the original database? Can the data warehouse replace the traditional database? How to build a data warehouse? and many more. This chapter will briefly introduce the technical background of the data warehouse used, and combined with the data cleaning system design instance, more and more intended to explain the significance of data warehouse technology in reality.
§1 from database to data warehouse
Traditional database technology is a single data resource, the database centered, transaction, batch processing, decision analysis, etc., mainly divided into two major categories: operational processing and analysis processing ( Or information processing). Operating processing also is also a transaction processing, and it is to mere the daily operation of the database online. It is usually a query and modification of one or a set of records, mainly for the company's specific application service, paying attention to response time, data security and integrity. Analysis processing is used for the decision analysis of managers, often to access a large number of historical data. The traditional database system is better than the daily transaction work of the company, but it is difficult to achieve data analysis processing requirements, which has not been able to meet the data processing diversification requirements. Separation of operational processing and analysis type processing is inevitable. In recent years, with the application and development of database technology, people attempt to process data in DB, form a comprehensive, analytical environment to better support decision-making analysis, resulting in data warehouse technology (DATA) WareHousing, referred to as DW). As a decision support system (DSS), data warehouse system includes: 1 data warehouse technology; on-line analysis processing technology (OLAP); 3 data mining Technical (Data Mining, DM); Data warehouse makes up for the shortcomings of the original database, develop the original database-centric data environment into a new environment: systemized environment. As shown in Figure 1.1:
Figure 1.1 Data warehouse systemization environment
1.1 What is a data warehouse
The industry-recognized data warehouse concept founder Whinmon's definition of data warehouse in "Building Data Warehouse" is: Data Warehouse is the topic, integrated, unrecognizable (stability), continuous change over time (Different time) data sets to support the decision development process in business management, data in the data warehouse is the subject, and it corresponds to the traditional database for applications. The theme is a standard that categorized data at a higher level. Each topic corresponds to a macro analysis. The integration feature of the data warehouse refers to data processing and integration before the data is entered into the data warehouse. This is the establishment of data. The key steps in the warehouse first, first of all, the contradictions in the raw data, but also make the original data structure to a transition from facing-oriented topic; data warehouse stability refers to the record of historical data, Instead of daily transaction processing, data is rare or unworthy after processing and integrating into the data warehouse; data warehouse is a data set at different times, it requires data saving time limit in the data warehouse to meet decision analysis The need, and data in the data warehouse indicates the historical period of the data. The most fundamental feature of the data warehouse is physically stored in data, and these data is not the latest, proprietary, but derived from other databases. The establishment of the data warehouse is not to replace the database. It is based on a more comprehensive and and complete information application to support high-level decision analysis, and the transaction database is daily operability in the enterprise information environment. task. Data warehouses are a new application of database technology, so far, data warehouses are used to manage them with relational database management systems.
1.2 The production of data warehouse computer system's function has been extended from numerical calculation to data management for more than 30 years. The initial data management form is primarily a file system, and a small amount of adding some associations and semantics between data segments, but the data access must rely on a specific program, and the access method of data is fixed. ,boring. By 1969, Dr. E.codd delivered a paper in his famous relational data model. Since then, there is a new era of data management in the relational database. In recent decades, a large number of new technologies, the emergence of new ideas, the development and implementation of relational database systems: customer / server system structure, stored procedure, multi-clues concurrent kernel, asynchronous I / O, cost optimization , Etc, all this is enough to make the relationship database system's processing power is not inferior to traditional closed database systems. The benefits of relational databases in access logic and applications are far more than these, and SQL use has become an unstoppable trend, plus the processing capability of computer hardware in recent years, the relationship database ultimately became The main kickness of the online transaction system. The entire 1980s basil up to the early 1990s, online transaction processing has always been the mainstream of database applications. However, applications are constantly progressing. When the online transaction system is applied to a certain stage, the user finds that the online transaction processing is not enough to obtain the advantage of market competition, and they need to analyze their own business and the analysis of the entire market-related industry. A favorable decision. This decision requires analysis of a large number of business data, including historical business data. In this fierce market competition environment, this decision-making analysis based on business data is called online analysis, which is more important than ever. If traditional online transaction is emphasized is to update the database - add information to the database, online analysis processing is to obtain information from the database, utilize information. Therefore, the famous data warehouse expert Ralph Kimball wrote: "We spent more than 20 years to put data into the database, and now it is time to take them out." In fact, a lot of business data Applying to analysis and statistics Originally a very simple and natural idea. But in actual operation, people find that it is easier to get useful, this is mainly as it is as follows: • All online transaction deserves emphasize the dense data update processing performance and system reliability Sex, don't care for the convenience and fast in data query. Online analysis and transaction processing are different from the system, and the same database is in theory. · Business data often stores in a dispersed heterogeneous environment, not easy to query access, and there is a large number of historical data in an offline state, similar to that. · The mode of business data is designed for transaction system, and the format and description of data is not suitable for non-computer professional analysis and queries. Some people sigh: 20 years ago, the query is not data because the data is too small, and today I can't query the data is because the data is too much. In response to this problem, people envisaged a data center for business statistical analysis, which comes from online transaction processing systems, from the isomeric external data source, from the offline historical business data ... . This data center is a online system that is specifically for analyzing statistics and decision-making support applications, which can meet everything required for decision support and online analysis applications. This data center is called a data warehouse. This concept was raised in the early 1990s. If you need to give a definition of a data warehouse, the data warehouse is a structured data environment for the application data source as a decision support system and online analysis. The problem that the data warehouse to be studied and resolved is to get information from the database. So what is the relationship between the data warehouse and the database (main referral relational database)? Recall that people's solid-enclosed system is for the preference for transaction processing, and people choose the relational database to facilitate access to information.
Just open the classic "An Introduction to Database Systems" of Dr. C. Date, we will find that today's data warehouse is advocated the relationship database. However, due to the huge success of relational database systems in online transaction processing applications, people are unknown to the category of transaction processing; excessive attention to the improvement of transaction processing capacity, making relational databases in the face At the time of online analysis, there is a new issue - Today's data warehouse has put forward higher requirements for online analysis capabilities of relational databases, using a common relational database as a data warehouse in function and performance, they must There is a special improvement. Therefore, the difference between the data warehouse and the database is not only the method and purpose of the application, but also related to the product and configuration. From the perspective of syndrome, the rise of data warehouses is actually a regression of data management and is a spiral rise. Today's database is better than the hierarchical database and mesh database, which is transaction; today's data warehouse is like the relationship database of the year, it is for online analysis. Different, today's data warehouse does not have to rush for the characteristics of online transactions, due to the technical professionalism, it can more concentrate on the development and exploration of the online analysis field, the concept of data warehouse, first It is used for financial, telecommunications, insurance and other major traditional data processing intensive industries. Many large domestic data warehouses were established in 1996-1997. So, what kind of industry is most needed and possible to build a data warehouse? There are two basic conditions: First, the industry has a mature online transaction processing system, which provides objective conditions for the data warehouse; second, the industry is facing the pressure of market competition, which provides external power for the establishment of data warehouses . §2 Data organizations in data warehouses
The four basic features of data in the data warehouse have been introduced in this chapter §1. The following is to analyze these issues: What data is stored in the data warehouse? How does data organization, storage? What are the forms of organizational form? and many more. By introducing the data content stored in the data warehouse, this section will answer these questions to deepen the understanding of the four basic features of data warehouse data. 2.1 Data Organization Structure of Data Warehouse A typical data warehouse data organization structure is shown in Figure 1.2: Data in the data warehouse is divided into four levels: early detail level, current detail level, mild integration level, high integration level. After the source data is integrated, first enter the current detail level, and further synthesize according to the specific needs, so that the mild integration level and the height level, the aging data will enter early detail level, and there is different in the data warehouse. Comprehensive level, generally referred to as "particle size". The greater the particle size, the lower the degree of detail, the higher the integration. There is also an important data-metadata (Metadata) in the data warehouse. Metadata is "data about data", such as the data dictionary in the traditional database is a metadata. In the data warehouse environment, there are two metadata: the first is metadata established to conversion from an operative environment to the data warehouse, containing all source data item names, ---------- ------------------------------------ Figure 1.2 DW data organization structure 2.2 Particle size and segmentation
1. Particle size granularity is an important concept of data warehouse. The particle size can be divided into two forms. The first particle size is a measure of the integration of data in the data warehouse, which affects how much data in the data warehouse, and also affects the type of inquiry that the data warehouse can answer. In the data warehouse, multi-dimensional grania is essential. Since the main role of the data warehouse is DSS analysis, most queries are based on a certain degree of integrated data, and only a few queries involve detail. Therefore, large-scale data should be stored on a fast device such as a disk, and small particle size data is stored on a low speed device such as a tape. There is also a granular form, namely the sample database. It extracts a subset from the details database based on a given sampling rate. In this way, the particle size in the sample database is not divided according to the different levels, but the high and low sampling rate is divided, and the sample database different from the sampling particle size can have the same data integration. 2. Segmentation Segmentation is another important concept in the data warehouse, which is also the same to improve efficiency. It is to disperse data into its physical unit so that it can be independently processed separately. There are many data splitting standards for reference: such as dates, geographical, business sectors, etc., can also be combined. In general, the segmentation standard should include a date item, which is very natural and uniform.
2.3 Data Organization Form of Data Warehouse This brief introduction to the form of data organization in the data warehouse: 1. Simple stacking document: It accumulates the data extracted and processed by the database daily and stores stand up. . In a week in seven days, the data is recorded one by one in the daily data set; then, the seven days of data is integrated and recorded in the weekly data set; the next week, the day dataset is reused to record new data. The same manner, the week data set reached five, the data was once again integrated and recorded in the monthly data set. Push it in this class. The rotation integrated structure is very simple, and the amount of data is simpler, and the volume is reduced. Of course, it is based on the details of the loss data, the longer the data, the more detailed losses. 3. Simplified direct file: It is similar to simple stacking files, but it is a database snapshot of a certain time, such as every week or one month. 4. Continuous Document: Through two consecutive simplified direct files, another continuous file can be generated by comparing two simple direct files. Of course, continuous files can also generate new continuous files as new simple direct files. For the final implementation of various file structures, it is still necessary to rely on the most basic structure of "Table" in the relational database.
2.4 Data warehouse data append how to add data to data warehouse is also a very important technology. We know that the data of the data warehouse is from the database of OLTP. The problem is how we know which data is newly generated after the last addition process. Commonly used techniques and methods are: But not all databases in the database contain timecases. Delta file: It is generated by the application, records all the content changed by the application. Using Delta file efficiency, it avoids scanning the entire database, but the same problem is that the application of Delta file is not common. In addition, there is also a way to change the application code, so that the application can be automatically recorded when generating new data. However, it is difficult to apply thousands, and the modification code is very cumbersome. This method is difficult to achieve. · Method of front and rear image file: A snapshot of each of the database before and after extracting data, then compare the difference between two snapshots to determine new data. It takes up a lot of resources, which has great impact on performance, so there is not much practical significance. log file: The most desirable technology is probably using log files because it is the inherent mechanism of DB, which does not affect O LTP performance. At the same time, it also has the superior nature of the delta file, and extracts the data as long as the log file is limited, no need to scan the entire database. Of course, the format of the original log file is determined according to the requirements of the DB system, and the data it contains may have many redundancy for data warehouses. For example, a multiple update of a record, log files will be recorded in all changes; and for data warehouses, only the final result is required. But comparison, log files are still the most feasible choice. §3 Key Technologies of Data Warehouse So, what components and key technologies have data warehouses? Unlike relational databases, the data warehouse does not have a strict mathematical theory basis, which is more directed. Due to this engineering of the data warehouse, it can be skilled in technology to be divided into four aspects of data from its working process: data extraction, storage and management, data performance and technical consultation of the data warehouse. To this end, we will discuss each link separately. 3.1 The extraction of data data extraction is the entrance to the warehouse. Since the data warehouse is an independent data environment, it needs to import data from the online transaction system, external data source, offline data storage medium by extracting process into the data warehouse. Data extraction is technically related to several aspects such as interconnection, replication, increment, conversion, scheduling, and monitoring. The data of the data warehouse does not require a real-time synchronization with the online transaction system, so data extraction can be performed timed, but the time, mutual order of multiple extraction operations, the validity of the information in the data warehouse is critical. . In the development of technology, the single technical links involved in data extraction are relatively mature, some of which are hid, but the overall integration is still not enough. Most of the current market is data extraction tools. These tools automatically generate data extracted code through the user's corresponding relationship with source data and target data. However, the data species supported by the data extraction tool is limited; the data extraction process involves the conversion of data, which is a part that is closely related to the actual application, which makes the complexity that cannot be embedded in user programming tools often not meet the requirements. Therefore, the actual data warehouse is often not necessarily used in the process of extracting tools. It is more important to include effective management, scheduling and maintenance due to the use of tools. From the market development, data warehouse manufacturers who are subject to data extraction and heterogeneous products are generally possible to be annexed by other companies with database products. In the world of data warehouses, they can only become auxiliary role.
3.2 Storage and Management of Data Data Warehouse's true key is data storage and management. The management method of the data warehouse determines that it is different from the characteristics of the traditional database, and also determines its form of external data. To decide what products and technologies to establish a data warehouse core, you need to start the technical characteristics of the data warehouse to analyze the first problem encountered by the data warehouse is the storage and management of large amounts of data. The amount of data involved here is much larger than the conventional transaction, and it is accumulated over time. From the prior art and products, only the relational database system can take this anything. The relational database has been developed in nearly 30 years, which is very mature in data storage and management, non-other data management systems comparable. At present, many relational database systems have supported data segmentation technology, which can disperse a large database table in multiple physical storage devices, further enhance the expansion capability of system management large data. It is a usual thing that uses a relational database to manage hundreds of GB or even TB. Some manufacturers have also specifically considering systematic backup issues of big data, and the requirements of data warehouses are not high on online backup. The second question to be solved by the data warehouse is parallel. In conventional online transaction applications, the user access system features a short and intensive; for a multiprocessor system, it is a key to equalize the user's request. This is concurrent operation. In the data warehouse system, the user accessing the system is characterized by huge and sparse, each query and statistics are complex, but the frequency of access is not very high. At this point, the system needs to be capable of moving all the processes for this complex query request service, handling the request parallel. Therefore, parallel processing technology is more important than ever in a data warehouse. You can pay attention to the following, in the TPC-D reference test for the data warehouse, it has increased the test of a single user environment than ever, becoming "system power" (QPPD). The parallel processing capability of the system has an important impact on the value of QPPD. At present, the relational database system has been able to decompose parallel, data segmentation, parallel, and support cross-platform multi-process machine, and the MPP environment, which supports multi-hundred processors. Hardware systems and maintain performance extension capabilities. The third issue of the data warehouse is an optimization of decision support query. This problem is mainly for relational databases, as other data management environments are not perfect. Technically, optimization of decision support relates to many parts of the database system indexing mechanism, query optimizer, connection strategy, data sorting, and samples. The normal relational database uses the index of B tree, which has little effect on the fields with a large number of repetitive values, etc. for gender, age, region. The extended relational database introduces the mechanism of the bitmap index, indicates the state of the field in the binary bit, and the query process becomes the filter process, and the basic operation of a single computer can filter multiple records. Due to the amount of data in each data table in the data warehouse is often extremely uneven, the best query path to the normal query optimizer is not optimal. Therefore, the relationship database for decision-making support has also been improved on the query optimizer, and the ability to multi-index scanning is added based on the indexing characteristics. The data warehouse established by the relational database will encounter a large number of intervals connection operations when applying, and the connection operation is a time-consuming operation for the relational database. The expansion database can be pre-defined in the connection operation, which is called a connection index such that the database can directly obtain data directly without having to implement specific connection operations. The data warehouse query often only requires some records in the database, such as the largest top 50 customers, and so on. The normal relational database does not provide such query capabilities, so you have to sort the records of the entire table, thus consuming a lot of time. The relationship database of decision support has been improved here, providing this feature. In addition, the query of the data warehouse does not require it as accurate as the transaction system, but there is a need for a sufficiently short system response time in a large capacity data environment. Therefore, some database systems increase the query capability of sampling data, which greatly improves system query efficiency within the range permitted by accuracy.
In short, there are many work that transform the normal relational database into a server that is suitable for data warehouses, it has become an important research topic and development direction of relational database technology. It can be seen that the expansion of decision support is an important technical measure to enter the data warehouse market in the traditional relational database. The fourth issue of the data warehouse is one of the most severe challenges that support multi-dimensional analysis. Users have great different ways to use the data warehouse. It is very different from the traditional relational database. For access to the data warehouse is often not a simple table and record, but based on the analysis mode of user service, online analysis. As shown in Figure 1.3, it is characterized by imagining the data into a multi-dimensional cube, and the user's query is equivalent to the partial dimension (rib) in which the condition is applied, segmented, and the result is numerical. Matrix or vector, and make it a graphical or input to the algorithm of the active statistics. The relational database itself does not provide this multi-dimensional query function, and in the early days of the development of the data warehouse, people send