Data warehouses can provide convenient access and powerful analysis tools for enterprise data for generating meaningful information. The goal of the data warehouse is to obtain valuable information from enterprise data, providing information that can increase profit, improve operational efficiency, guiding business decision, and excavation of the competitive advantage. The implementation of these goals will definitely have a positive impact on the business. Today, the data warehouse is a hot topic in the client / server. For the understanding of the data warehouse, it is very valuable for client / server developers.
This article describes the data warehouse and provides many techniques on how to achieve their goals, including content on data collection, decision support systems, online analysis processing, and data warehouses. I define under each concept and introduce the processing mechanism of the data warehouse, and discuss the definition of media data for creating and maintaining the data warehouse.
Today's client / server
Early customer - work in the server system is mainly to develop online transaction processing (OLTP) applications. Enable business operations, such as order record, list, or booking system, becoming the focus of recent development. Most of this type of application is a simple two-layer application structure, which is the front-end GUI application access the backend database server. Transaction logic is often implemented by code on the customer workstation, or the process on the server.
Today, most companies have completed the establishment of large and small customers - server engineering, or at least the initial system construction. Now, the three-layer system separated by the transaction logic with the GUI and data source will become the focus of future development. The business often requires three systems to meet larger and complex applications involving thousands of clients and dozens of servers. OLTP applications require short response time, accurate and error-free and 100% availability and real-time. The basic business operations implemented by these applications are necessary for business efficient operation.
Data collection and access
Most agencies and companies have accumulated a lot of data and have been well carried out with automated work of data collection, which is working well (in accumulation and storage). Data storage is mainly in the relational database, in these systems, a large number of data and data models are reflecting the company's previous measures and performance. Today, SQL is the mainstream language of data access, which is a collection and provides a good interactivity between applications and databases. The transaction is managed by the Database Server or Transaction (TP) monitor. Large-scale applications generally use TP monitors to achieve data consistent and high performance. Most data operations for single records or related data sets. Defining reasonable relational database operations, resulting results are foreseeable. The results of the RDBMS query are generally a comprehensive or calculation of dynamic.
Decision Support Systems
Once the agency has a complete data acquisition and storage process, the new needs are generated: how to use valuable information in the database. In the increasingly competitive commercial market, it is essential to analyze this information and generate reports, which can assist commercial decisions. Soon, you know about your own business situation, it will translate into a quantitative business advantage or operational improvement.
Compared with the OLTP system, the Decision Support System (DSS) is mainly necessary to read-only access to the database, used to establish reports and query results, assist in developing important business decisions. These systems do not require a quick response, and can be exercised according to their own rhythm, maybe only occasionally use; and hope to export DSS information for other desktop software tools.
DSS uses the fundamentally different ways when watching corporate data, this is very important, data is considered to be historical records, and the initial steps of analysis need to be packet and integrated. Mathematical operations are often required for data from the subset of the data (monthly, years, regions, products), and the data content can be further interpreted by displaying the display.
DSS users generally use DSS tools directly and require data models to make a pairing. DSS usually uses the database used in the start phase to the OLTP, but DBA often needs to create summary tables and views for new features or tasks. In some cases, you need to use a copy to move the data in the running system to the DSS database, requiring freezing (fixed) production data during the transfer, avoiding problems when concurrent operations. This is nothing loss for DSS users, because they do not need to quickly delete transaction to delete data, while only care for long-term performance, trends, summary, and summary. These work requirements have analytical functions, and in most cases, they also include data for accessing different locations, which may have different formats, or cannot be directly obtained in some way. But this is usually not just the repair of the program. There is also a unified format for data, making queries more convenient and flexible.
For such analysis systems, you need to consider improve performance and provide a concentrated function. The report will handle a large number of records, only individual records only in very few cases. The predictability of the access path is slightly poor than the OLTP system, but most of the paths in the system can be established in advance.
copy
Increase in distributed systems, for high performance requirements, special nature of the DSS system, etc., to promote the rapid development of replication. Copying rapidly a basic technique for solving the performance, concurrency, availability, and capacity restrictions of large data systems. The disadvantage is that it is not real data, just a copy of the information. Therefore, it is about time (but for two-way system using the replication mechanism, bidirectional change) can be satisfactorily completed).
Online analysis
Online Analysis Processing (OLA) brings DSS to a higher level. OLAP is an analysis processing technology that collects information from the business data collection and uses mathematical operation and data processing technology. This information can be flexible and interactively provide statistics, trend analysis, and predictions. OLAP users want OLAP (such as enterprise data systems) to maintain the original advantages, such as fast access, concurrency, and consistency, and how to have DSS analysis, how the goal is quickly and efficient to obtain enterprise data. Although OLAP is likely to handle sharing data from OLTP, OLAP is inherently different from OLTP and DSS systems. You strive to build a data warehouse and get and select (or excavate) data in accordance with the requirements. That is, the data must be accessible, organized according to the flexible mode, and can be modified (replace). These aspects are not accidentally emerged, which requires a lot of efforts and cooperation of all corporate staff.
For large companies, the most complicated work that is going on may be necessary to be designed and applied to the data warehouse system. The new focus of users is: they want to display information in the system, not just access data. Although some people may still think of the data warehouse as substantially the same as the running database, but the data warehouse has been wearing access, analyzing and reporting ideas in the design process. Therefore, when the data warehouse is more likely to optimize than the DSS system. In this way, the product is separated from the data in the warehouse. But these can only be so far, it must maintain flexibility. This kind of information can be obtained by the following tools: Business Objects's Business Objects Brio Technology, BriOQuery and Cognos PowerPlay, etc .; Sybase's S-Designer 6.0 will have a function of data warehouse design.
Multidimensional database
The OLAP system must be flexible and can access data in a non-scheduled manner and must be created interactively. The key to the problem is how to express the following: complex data query, find (excavate) trend, summary, evaluation, and relationship. "If" query, it is a basic technique that is often seen when using spreadsheet software. Others are equally important as "why" or "how". These queries are usually used with other variables for context of defining information. Examples of queries can be related to the product type, time period or place. Due to the use of these analytical variables, it has brought multidimensional. At the time of analysis, the different values of these variables produce different comparison information. The prediction is also part of the processing. After analyzing generation, the user can dig downwards, see more detailed information, which is the overall subset, move up, and the previous hierarchy of the analysis can be displayed.
database
The data warehouse is a system consisting of one or more applications, and a database for analyzing and reporting. This database is data from other data sources. Typically there is a load of database data, and then perform data snapshots or step-by-step updates. The data created by the data warehouse, the topic, exactly, will provide the information you need for important business decisions.
The data warehouse server provides access to data. Applications for analysis and generation of reports can be customized, you can also purchase (I recommend using Cognos Impromptu and PowerPlay). Data collection and accumulation of tools also have ready-to-produce products, and this market will continue to expand.
Frame data warehouse
Building a member of the development team required to construct a data warehouse requires both technical knowledge and commercial knowledge, so it can have a comprehensive understanding of top and bottom. Aspects involved in data warehouses: data sets (including history), design (data model), documentation, and database maintenance. Data warehouse also requires defining system metadata (in the next section), and considering data distribution (development application and purchase tools.)
If there are multiple data sources, the maintenance of data sets will be extremely difficult. Each data source may have different formats, platforms, standards, meaning, historical influences and logos. In different data sources, even in one source, multiple instances of the same object may also occur. Data is often incomplete or programmed. You have to understand and record all details. Establishing a data warehouse is usually a big project that lasts for a few months, and it may be even a few years. When your goals are not very clear, you'd better speculate a solution, follow it to create a model data warehouse, and then run a variety of tools on this model to check if the original requirement is correct. The test of the model should be both a general situation, and special circumstances. With this step, you can find what you work, and how they work in the enterprise. You have to visit users, especially those who truly understand data and data sources. The ultimate system may require a report, certain special tools, and may even develop new applications. But in any case, your responsibilities provide data and tools for users.
In essence, access to the data warehouse should be mainly read-only, sometimes some analyzes in the data warehouse. It must be updated on time to respond to user requirements and changes in data source data. Therefore, new maintenance issues appear: check, regenerate, and delete analysis records. Production data needs to be updated in the data warehouse, while the data in the warehouse needs to be replicated to the business or department.
First, you must define the scope of the initial data warehouse system, starting with the best part of understanding, must have an analyst as assistant. Many analysts develop data warehouses on departmental databases. These databases have been thoroughly understood, and the timing of re-establishing the model is also mature.
Performance is a big problem. Cumulative table, the result of the past analysis, etc., will increase maintenance. Therefore, you should test a variety of extreme situations. Re-examine your plans from as much as possible. For design problems in the data warehouse, it is not a person or a group to answer.
Metadata: data about data
One of the most important steps in building a data warehouse process is to define and create metadata. The metadata has three levels: data source, data warehouse and user (you can also define a business view). Metadata can provide a directory that lists the data sources of data warehouses and data warehouses input data. My point is: I don't have anything in the system, it (actually) doesn't exist. User metadata can include computers, summary, and other objects for access. You should carefully record the data source, data warehouse structure, and user views, such as the format of the report, and use these documents to approve the data, and get the correctness of the information. If necessary, you can also use a document to implement a working data item and check its validity.
Metadata can play a role in the creation and maintenance of the data warehouse. When defining metadata, you will complete the most well-founded part. Finally, you will define metadata for each object type in the data warehouse. Data refine the relationship between data structure and data (from database view, or result of transaction rules and data stream). You should also record alias, code table, default, complete channels, numerical units (dollars or pounds), algorithms, and related information.
In the metadata (used to illustrate data in the warehouse), you will express your understanding of business rules. For example, you can analyze and record security access access to objects and data elements in your system. In this process, you must determine the expression of the final object (class), and the process of derived instances in the data warehouse (final data type, data conversion, data source, and expected time limits). Regarding the accumulation of data, the details that need to be defined are: "Which domains are absolutely needed?" And "If you cannot get data, try another data source of different processing procedures."
All data that enters the database must be expressed in metadata; even metadata also has metadata for yourself. These documents should describe how the metadata in your system is expressed.
Metadata server
The metadata will change like all complex systems; but fortunately, this change is progressive. When the business or process changes, you must react to the metadata when the business or process changes. You need to perform version controls, so that data generated at different times can have different formats. I have developed a metadata server in many large systems, used to provide access to data that may be changed on the structure, format, or version. With the metadata server, you can provide the data with its metadata as the result of a particular query, and provide to the user. Thus, if there is metadata, a display system that displays any format data can be created. At the same time, the information provided by the metadata server can also help users dig down in the system.
update data
You must make a plan for loading and refreshing for a data warehouse. In this process, metadata plays a key role. With the progress of the development of work, the data model is gradually developed. New models must also be tested with the content you want to use. Depending on the level of updatable, unified specifications or uniform specifications can be discussed. In order to give the data additional meaning, it is important to find and find relationships in the system.
You must consider the time and history of each data item in the database. How long is the data existed? What data items do you need to keep its changes? Is the version control system? You must consider changes because the database changes. You must allow changes in data sources, users need and data model.
Documentation and training
A good set of documents is very important, and the use of front-end tools is also very important. The key is to avoid directly dealing with the user. Provide them with tools, they can do some jobs themselves. When the tool is involved in data, it should be in an understandable level. Users do not need to know the database structure, but just know their appearance and should access the system in this way. This point of view should be, and it can be widely reflected in the requirements of the demand analysis. Possible solution
Using data warehouse is most likely to use a large host database system, provide functional, control, security, performance, and consistency. Distributed database systems are also very feasible, Sybase is a very powerful distributed system, using Sybase, data sources, and data actual positions are transparent, and can be moved from one place as needed; Sybase IQ is specifically The interactive query product optimized by the decision support system.
If a consistent comment can be formed, it is transformed into a standard DBMS, and the platforms, operating systems, and networks used can help simplify the problem. A better way is to choose a product suite that can provide DBMS, data warehouses, communications, and query tools. In this close solution, integration may be difficult, sometimes even cost. If you choose this solution, you can use Oracle, Sybase, IBM, Information Builders, and Hewlett-Packard.
It is another more complicated selection using a heterogeneous system using a TP monitor. Large customer / server applications may require TP monitors to control data operations. TP monitors such as Tuxedo and Encina have interfaces with a large application tool. In the future, the TP monitors are possible to embed the development or operating system. Things are the working unit of the TP monitor. In a large distribution system, there must be some controllers to assume synchronization and processing of events that may interact in the system.
There may be more complex requirements. You will see: graphics, Internet and multimedia are becoming increasingly important. These will increase the complexity of demand and greatly increase the amount of data transfer.
OLAP database is almost always very large, so performance is very problematic. Solutions are: separated data, distribution data, filter data, may also establish and maintain files to store common information. If the data warehouse is mainly used for DSS, you can have a large extent, hand over the copy work to a client / server system.
Develop OLAP applications
Developing OLAP applications may be very difficult, and its writing methods should be object-oriented. Many of the processes in the OLAP are very similar, so you can develop object classes for them. For example, in most OLAP applications, data access, data storage, analysis (functions), and report generated are basically consistent. Using object-oriented technology can improve reusability. OLAP applications should also be scalable and should be designed as scalable systems.
The system should use and manage metadata and data dictionary. Using the interpretable language scheme, you can do not need to recompile after the change. OLAP applications must provide a complete set of functions and require your own programming language.
to sum up
In most large companies today, the electronic data set is ready. Therefore, people recognize that (and should) obtain valuable information from the data set for commercial decision support. The data warehouse is a structure and system that can be obtained from the data. For information experts, the application and maintenance of data warehouses, will soon become an important part of work.