About author
Laura Haas Laura Haas is the senior manager of IBM Software Group, which is responsible for the development of DB2 UDB query compiler, including key technologies for information integration and life science, such as federal databases and XML queries. Previously, Dr. Haas was a research member and manager of the IBM Almaden Research Center.
Eilen Lin Eileen Lin is a senior software engineer in San Jose, California. She used to promote DataJoiner (Federal Database Products, which is one of the first members of the V7 DB2 Federal System). Dr. Lin is currently the technical person in charge of DB2 and DiscoveryLinkTM Federal features.
Introduction
In large modern enterprises, departments in the organization use different database management systems to store and search their important data, which is almost inevitable. Competition, continuous development of technologies, mergers, acquisitions, geographical distributions, and unavoidable dispersion in extensions, etc. will cause this diversity. However, only the information in these systems is combined, and companies will recognize the overall value of these systems.
For example, in the financial industry, merging is almost very common. The newly created entity follows the data store of the original agency. Many such storage are relational database management systems, but these systems often come from different manufacturers; for example, a company may mainly use Sybase, while another company uses Informix® IDs. They may have one or more document management systems (such as Documentum or IBM Content Manager) for storing text documents (for example, loan copies, etc.). Each system may have some applications to calculate important information (eg, a specific customer's loan risk) or excavate information about customer purchase mode.
After enterprise merge, they need to access all customer information from two stores, using existing and new applications to analyze their new asset portfolios, usually, but also use two institutions through a public interface. The combined resources are combined. Although different companies may identify their customers with a completely different identification keys, they need to be able to identify their public customers, merging these customers' accounts. In these cases, federal techniques effectively solve this problem by providing a unified interface of heterogeneous data.
IBM has conducted a lot of investment in federal technologies, making it ahead of the market. Now, federal technologies can unify any digital information represented in any format (structured and unstructured) in any information store. Now, through a variety of IBM products, including DB2®UDB (and DB2 RELATIONAL Connect), DB2 DataJoiner®, and IBM Enterprise Information Portal (EIP), there is also the latest INFORMATION INTEGRATOR- can use these federal technologies. This group of federal technologies will continue to be enhanced, and our customers will continue to have practical business value in all of these products.
This paper mainly discusses advanced database federal technologies, which are implemented by techniques "GARLIC", which represents the next generation of information federal enhancements in IBM software. These enhancements will enable client access and integration data to specifically calculate various relationships and non-relational data sources. Over time, GARLIC technology will continue to integrate into IBM all provide federal technology software products. Customers can be relieved, not only they are protected by the investment of existing products, but they will be able to take advantage of the advanced technologies described herein.
IBM's federal database system provides powerful tools for combining information from multiple data sources. IBM's federal database technology is built on the best technologies of early products DB2 DataJoiner [3], and in terms of scalability and performance, some of them are enhanced by some of the frontier characteristics of GARLIC research items [2], these Technology is unique in the industry. DB2 DataJoiner introduces the concept of virtual database, which is created by a federal multiple heterogeneous data sources. Users of DB2 DataJoiner can query data stored in any location in the federal system without worrying about the location of the data, the SQL language of the actual data source system or the ability to store. Conversely, for any data in the federal database, the user can operate in a DB2 mode. The GARLIC project exhibits this idea to build the feasibility of the federal database system, which can effectively use a variety of different, may be query capabilities of non-relational data sources. In these systems (such as today's DB2), the middleware query processor promotes optimization execution scenarios and makes up for the function of the data sources. In this article, we describe the main features of IBM federal technology: transparency, isomer, advanced function, autonomous, scalability, openness, and optimized performance of federal data sources. Then we go back to show you how IBM's database federal technology works. We demonstrate how to use federal technologies in various circumstances and infer some of the future development trends.
Characteristics of IBM Federal Solutions
Transparency
If the federal system is transparent, it hides the difference, the characteristics, and implementation of the underlying data source. The most ideal situation is that it makes a group of federal data sources to users. Users should do not need to know where data is stored (location transparency); there is no need to know where the data source supports the language or programming interface (call transparency); if you use SQL, you don't have to know which SQL language support (language transparency); no need I know which physical store is stored, or whether the data is partitioned and / or copied (physical data independence, segmentation, transparency); or does not need to know what network protocol (network transparency). Users should see a unified interface, including a single set of error code (error code transparency). IBM provides all of these features such that all data is in a database when writing applications, although in fact, data may be stored in a heterogeneous data source collection.
Isomeric
Isomerism refers to the difference between the data sources. Data sources can be different in many respects. They can run on different hardware, which can use different network protocols, and use different software to manage their data storage. They may have different query languages, different query capabilities, even different data models. The way they handle errors may vary, or provide different transaction semantics. They may be very similar to the two Oracle instances: one runs Oracle 8i, the other runs Oracle 9i, and the mode may be the same or different. Or, their diversity may be similar to this: a powerful relational database, a simple structured flat file, one can use URL form to query and can send back semi-structured XML sites according to some DTDs, a web service And an application that responds to a specific function call set. IBM's federal database can accommodate all of these systems, encapsulate these systems in a seamless transparent federal body.
Advanced Features
IBM's federal technology provides users with two full-specific features: for all data in the federal body, it has rich, compliant with standard DB2 SQL capabilities, and all features of the underlying data source. The SQL of DB2 supports many complex query features, including internal connection and external connections, nested subquers, table expressions, recursive, user-defined functions, aggregation, statistical analysis, automatic summary tables, and other factors Characteristics listed in this one by one. Many data sources may not provide all of these features. However, users can still use DB2 SQL's full feature of DB2 SQL because of functional compensation. Functional compensation means that if a data source cannot perform a particular query function, the federal database retrieves the necessary data and applies this feature. For example, a file system is usually not arbitrarily sorted. However, the user may still require data from that data source (ie, a subset), or require the elimination of repetition data in some order. Federal Databases only retrieve relevant data and sort themselves. Although many data sources do not provide all of DB2 SQL, there is indeed many data sources have a special feature lack of IBM Federal Database. For example, a document management system typically has a score function, for the user's search, with these functions to estimate the retrofitable documentation. In the financial industry, time sequence data is very important, and there are some systems: they can compare, draw, analyze, and divide time series data in a unique way. In the pharmaceutical industry, new drugs are based on existing compounds with special properties. Some systems with special purpose can compare chemical structures or simulate binding of two molecules. Although these functions can be implemented directly, it is often more efficient and more cost-saving utilization of the data source and application systems. IBM allows users to point out the functions of their own interested in federal data, then use them in the query, so that users of the federal system will not lose the original features of the data source.
Scalability and openness of federal
All systems need to develop over time. In the federal system, new data sources may be required to meet the needs of user business. IBM makes it easy to add new data sources. The federal database engine accesss the data source by software components called the wrapper. Access the new data source by getting or creating a wrapper for that data source. The packaging architecture supports the creation of new packaging. Once there is a wrapper, a simple data definition (DDL) statement allows the data source to be added to the federal body without stopping the ongoing query or transaction.
You can package any data sources. IBM supports ANSI SQL / MED standard [1] (MED represents management of external data, management of external data). This standard records the protocol used by the federal server and external data source communication. Any wrapper written to the SQL / MED interface can be used with the IBM federal database. Therefore, the wrapper can be written by a third party and IBM, and then used in conjunction with the IBM's federal database.
Autonomy of data sources
Typically, the data source has existing applications and users. Therefore, when the data source is introduced into the federal body, it is important to affect its operation. IBM's federal database does not affect local operations of existing data sources. The operation of existing applications does not change, neither modifying data and does not move data, and the interface remains the same. Although the implementation of global queries to the federal system may involve various data sources, the data source processing data request is not affected. Similarly, when the data source enters or leaves the federal body, it does not affect the consistency of the local system. The only exception is influenced during the implementation of the two-stage submission processing of the federal data source to the federal body. (However, there is no such problem in DB2 V7, and the federal two-stage submission is used in DataJoiner.) The data source involved in the same work unit will need to participate in the submission process, if necessary, may Require rollback related changes.
Unlike other products, our packaging architecture does not need to install any software on the machine of the supervisor data source. We communicate through the client server architecture with the data source by using the client server system structure. In this way, the IBM's federal data source looks like another application in this data source. Optimized performance
The optimizer is a component of the relational database management system that determines the best way to perform each query. Relationship queries are non-processed, and each relational operator usually has several different implementations, and in the feasible order of the operators that can be selected when performing a query. Although some optimizers use heuristic rules to select an execution policy, IBM's federal database consider various possible policies to model each policy, and then select a cost-off policy. (Usually, the cost is measured according to the consumed system resources.)
In a federal system, the optimizer must decide to be a federal server to perform these operations or data originations involved in the query or by storing data. It must also determine the order of operation, and which implementations are used to perform the local part of the query. In order to make these decisions, the optimizer must understand what you can do in some way and understand its cost. For example, if the data source is a file, think it is intelligent, and if you want to perform sort or apply a function, it is clearly meaningless. On the other hand, if the data source is a relational database system, it can apply some predicates, and can do some connection operations, then if the function of this data source can reduce the amount of data returned to the federal engine, it should be a good idea. . This usually depends on the specific details of each query. The IBM optimizer evaluates various possibilities for the various data sources involved in the query. Typically, in implementing strategies, good decisions and poor decisions can reach several orders of magnitude. IBM's federal database can use the wrapper to federal query cost into different data sources, in this regard, it is unique in the database industry. Therefore, users can expect to achieve optimal performance from their federal systems.
To further enhance performance, each package implements the active regulator provided by each data source using the local API of each data source. For example, a plurality of generated rows are formed into a message (also named block access) is a common performance adjustment. The query compiler communicates with the wrapper to point out which query segments can be used to access, so that the best performance is achieved at runtime without losing query semantics.
Configure a federal system
Create a federal system by installing the federal engine and then configuring it with the data source. Adding a new data source to the federal system includes the following steps. First, the wrapper for the data source must be installed, and then you must tell IBM's federal database to find the wrapper. This can be done by the CREATE WRAPPER statement. If you expect multiple data sources that are added belong to the same type, just a wrapper. For example, even if the federal system will contain five Oracle database instances that may be on different machines, only one Oracle wrapper is required, so there is only a CREATE WRAPPER statement. However, each individual data source must also be identified to the system. This can be done by the CREATE Server statement. If there are five Oracle database instances, you must issue five CREATE Server statements.
For example, assume that there is a specific site for accessing the wrapper and a user wishing to access its data. You can notify the federal database information about the wrapper by the following statement:
Create Wrapper Web_Wrapper
Library "/u/haas/wrappers/libweb.a"
This statement essentially tells the federal database where to find the code of Web_Wrapper. Next, the actual website is identified as a server associated with Web_Wrapper, which tells the federal database about the information about the site.
Create Server Weather_ServerWrapper Web_Wrapper Options (URL 'WWW. ****. Com ")
The role of the Options clause is to customize the basic CREATE Server statement with the information required to access the data source type.
After defining the wrapper and the server, the data located in the remote data source must be described based on the data model of the federal middleware. Since the federal database supports the object relational data model, each data set from the external data source must be described to the federal engine, and it is described as a table with the corresponding type column. Modeling the external data set of the table called an alias (Nickname), the application will use this table name and column name in SQL submitted by the federal body. Define an alias by the CREATE NICKNAME statement. Below this statement sets an alias for information on the weather, and defines some columns that can be used in the query.
Create Nickname Weather
(zone Integer, Climate Varchar (10), Yearly_Rainfall Float)
Server Weather_Server Options (query_method 'get')
The "Options" clause will play a function of the information you need to pass the wrapper, which is to handle queries for the alias.
In addition to storing data, many data sources can also perform special search or other calculations. These capabilities can be represented in SQL to user-defined functions. For example, users may wish to predict the weather based on the geographical and dates to determine customers who purchase air conditioners.
Select C.Name, C.Address
From Customers C, Stores S, Weather W
WHERE TEMP_FORECAST (W.ZONE,: DATE)> = 85 andc.shopsat = s.id and s.location = w.zone
Here, use the TEMP_FORECAST function to indicate that the data source can foresear the temperature of the weather. We call users defined functions implemented by external data sources as mapping functions. Similarly, the mapping function is identified to the federal system through the DDL statement. The CREATE FUNCTION statement tells the federal database to use this function in the SELECT statement.
Create Function Temp_Forecast (Integer, Date) Returns Float As Template
DETERMINistic No External Action
The AS Template clause tells the federal database that this function is not implemented locally. Next, the CREATE FUNCTION MAPPING statement tells the federal database which server can value this function. Several function mappings can be created for the same function. For example, the following statement can complete such a mapping:
Create Function Mapping Tf1 for Temp_Forecast Server Weather_Server
The above DDL statement generates metadata, which describes information about the alias and characterists of the mapping function. The federal query processing engine uses this metadata, which is stored in the global directory of the federal database.
Query processing
After configuring the federal system, the application can submit a query written in SQL to the federal server. The federal server optimizes the query, producing an execution scenario, where this query is decomposed into a segment that can be performed on each data source. As mentioned above, there may be a variety of decompositions for this query, and the optimizer makes choices from a variety of possible estimates according to the minimum resource estimation. Once a solution is selected, the federal database will begin to execute, call the wrapper to perform the clip assigned to them. To perform a code segment, the wrapper performs any required data source operations, which may be a series of function calls, or submitted to the data source for a query executed by its native query statement. The generated data stream is returned to the federal server, which combines them by the federal server, performs any other processing that cannot be completed by the data source, and then returns the final result to the application. The core of the IBM federal query processing method is that the federal server's optimizer and wrapper together form a scheme for performing a query [4]. The optimizer is responsible for studying the space of these possible query schemes. In the connection enumeration, dynamic programming is the default method, the optimizer first generates a single table access, and then generates a two-way connection. At each level, the optimizer considers a variety of connection sequences and connection methods. If all tables are located in a common data source, it believes that this connection can be performed in the data source or on the federal server. Figure 2 shows this process.
Figure 2. Query scheme for connection
The optimizer is different from the use of relationships and non-relational wrappers. The optimizer is used in detail the information used by the relational data source, and the information used by the model is provided by the package, which is used to generate an execution plan, and these plans indicate how the optimizer expects how the data source is expected.
However, since the non-relational data source does not have a public operation set or a public data model, it requires a more flexible arrangement for these data sources. Therefore, the optimizer uses non-relational wrappers:
1. If the candidate query segment called "Request" is applied to a single data source, the IBM Federal Database Submit the query segment to the package.
2. After receiving a request in a non-relational wrapper, it determines which section in the corresponding query segment (if any) can be executed by the data source.
3. The wrapper returns an answer, which describes the part of the query segment has been accepted. The response also includes an estimate of the number of rows to be generated, an estimate of the entire execution time and a wrapper scheme: the wrapper will need to know the package representation of all content to execute the part of the query segment.
4. Federal Database Optimizer combines this response into a global solution while introducing other necessary operators to make up for the query segment portion that the wrapper is unacceptable. Using the cost and the base information in the answer to estimate the total cost of this solution, select the total cost of the total cost from all candidates. After the selected scheme, you do not need to perform the scheme immediately; you can store it in the database directory, and use the scheme once or multiple times when the query is executed. Even if you use this scheme immediately, you don't need to execute it during the same process of creating this scenario.
For example, consider a website that broadcasts stock information (including transaction price, opening price and closing price and trading volume). This website can be searched by a form, which is very common, and the value of the partial attribute can be restrained in the form, but the value of all attributes cannot be bound. Consider the following query:
SELECT Tickersymbol, Stockname
From stocks
Where exchange = 'NYSE' and closing - Opening> 5
Since this is a single table query, there is no need to consider the connection method or order, only one segment contains the entire query, which will be submitted as a request to the web data source wrapper. If this form allows the user to specify a value that matches the "Exchange" attribute of the underlying data without providing a way to specify the difference between the opening price and the closing price, the wrapper will generate a response, accept "Exchange" Predicate, but refused predicate closing - Opening> 5. The optimizer will evaluate the following predicate on the federal server by introducing an operator to compensate for the work that the data source cannot be completed. Typically, the wrapper must be able to return any single column values requested in the SELECT list, but can reject any predicate or more complex SELECT list expression. In addition to pointed out that the data source will only be evaluated to "Exchange", the response will include a wrapper scheme, which contains sufficient information to perform the query represented by the response. For example, the wrapper scheme may contain a URL with parameters, which is equivalent to the URL that the browser will generate if the user fills in the query form. At the time of execution, the federal server returns the scheme to the wrapper, the wrapper will extract the URL and submit it to the website, parse the generated data stream, and then return the requested value to the federal server.
Use the federal body database system
Why is the federal system useful? How do customers use federal technologies? Typically, the federal system is useful when there is a plurality of data sources need to combine information of each data source. In this section, we have studied how customers are using IBM federal technologies to solve their current business problems.
Distributed operation: a large pharmaceutical company
Many companies are global companies that need to coordinate activities around the world. For example, pharmaceutical companies may have its research laboratories in Europe and the United States. Every laboratory scientists are looking for new drugs to treat certain special diseases. These scientists can access databases related to compounds that are stored in systems with specialized uses, which can be searched for special features or chemical structures (structural similarities) of these compounds. In both laboratory, scientists screened these compounds at high speeds of different organisms. These experiments were stored in the relational database of each laboratory. Other data sources that scientists can access include some large-scale flat files, patented databases, data and analysis of chromosomes and protein information, and images and text documents.
Scientists in these two labs have different tasks, and the treatment and healing methods they use are different. This makes them do different experiments, focusing on the specific compound collection. However, it is often useful to use the same compound to be useful for different goals, and sometimes one experiment may reveal the results of other experiments. Therefore, for scientists located in one of the laboratory, it is important to access the data from another laboratory, which avoids duplication of labor. Although this problem can be solved by establishing a large data warehouse (all compound data and experimental results inside), this method has several defects. First, the experimental results data change is very fast, and thousands of records from both sides of the Atlantic are added every day, which causes difficulties in maintenance. Second, the data warehouse must be replicated on both sides, otherwise the database of sides must endure the sluggish when accessing data. Replication causes the cost of this solution to increase, and increases the complexity of maintenance. Again, it is necessary to migrate the current stored data data stored in a dedicated repository to the relational database, including re-implementing search algorithms and any existing applications.
Federal solutions have eliminated these issues. Data remain in an existing data source while retaining their native access paths, and the current application has not changed. However, the data in any data source can be accessed in the case where the local domain is not considered. To quickly access, local data is still in local. If you need, you can still access remote data that is not regularly accessed, and the federal server will optimize queries to ensure that these data can be retrieved as much as possible. The part of the data that is often accessed frequently for both laboratory, if you are willing to copy. Heterogeneous replication
Many companies choose to retain multiple copies of their data. For example, a large retailer has some point of sale in the United States, which requires backup data in data warehouses located in various regions. Retail points use a relational database management system; use another DBMS with another scalability to implement data warehouses. However, this causes how to send data from the data source to the data warehouse. IBM federated technology not only makes mobile data, but also becomes convenient to insert data from data source, but also makes re-shaping data, as well as before being inserted into the data warehouse.
IBM provides a replication product DB2 DataPropagatorTM to help you integrate distributed database environments by using federated databases with feature of the federal database. DataPropagator automatically copies data between remote systems to avoid manually unloading and loading databases. For non-DB2 relational data sources, the Capture trigger is defined to capture changes to the data source and write the change to the desk. The Apply program running on the IBM Federal Database Server uses the keeper of the schedule to copy those changes from the desk table to the IBM federation database or another non-DB2 relational database. It is easy to make heterogeneous replication due to federal technologies.
Distributed data warehouse
A distributed data warehouse has always demonstrated high availability and lower cost. Enterprises can create several data fairs to store only high-level summary data, which are from data warehouses. With IBM's federal technology, although the data market and data warehouse can be on a separate system, users of the data market can easily drill into the data warehouse from the summary data of their local level. Federal technology is provided by providing a virtual data warehouse that users do not need to know that the data warehouse is distributed.
Space geographic app
Bank needs to select a place for its new branch. This location selected must maximize the expected profit. To this end, banks need to consider demographic information around each location (whether the demographic information is in line with the target customer base?), Consider whether the crime rate in this area (for retail business, low crime rate is important), consider whether it is close to Main roads (in order to attract customers near areas), consider whether it is close to competitor (lack of competition), considering whether it is close to any known issue area, these areas must be avoided (surrounding garbage piles) Or other people who are annoying will have a negative impact on the business). Some of these necessary information will come from banks' own databases. Other information must be retrieved from external data storage (including information about this community). This application shows the need to integrate spatial geographic data and traditional business data. It requires advanced query analysis functions to associate data, requiring the final user tool for displaying data in a visual manner in space geographic context.
Typically, spatial geographic data has been managed by a dedicated geographic information system (GIS), but it cannot integrate spatial data with other business data stored in company RDBMS, and external data sources. DB2 Spatial Extender is a product developed by IBM and its business partner Environmental Systems Research Institute (ESRI). DB2 Spatial Extender provides two full-specific solutions to customers using IBM federal databases. Customers can use built-in in DB2 Spatial Extender and combine geographical space intelligence of a large number of available business information in the federal system. This allows organizations to enhance the understanding of their own business, using the value of existing data, build complex new applications, enabling companies to success. Conclude
Although many research communities are quite focus on this field, few commercial database management systems have solved this problem in integrating relationships and non-relational data sources into federal. With this federal technology, IBM has taken a big step toward this goal. IBM's unique federal query processing technology allows users to experience all powerful functions of DB2 SQL and the power combined with powerful functions of each data source. It provides users with these benefits: transparency, isomer, advanced features, autonomous, scalability, openness, and optimized performance of the underlying federal data source. Today, we are using the federal body to solve many important business needs.
In the future, we will continue to work to improve the performance and functionality of the federal body. For example, a form of cache has been implemented using the Auto Summary Table (AST) mechanism, which enables administrators to define the avatar views of the data in a set of underlying tables - ie alias. For some types of queries, the database can automatically determine if the AST answer query is used without accessing the basic table. In addition to continuous improvement, we are also studying some tools to help configuration, tuning and managing federal systems. We are developing tools for generating statistics from non-relational data sources, as well as tools for monitoring federal system behavior. We currently develop tools that help wrapper developers.
Finally, even if the design of a good federal database management system and its set of tools are just a partial solution for data integration. Complete solutions will have to integrate applications and data, and solve some of the higher levels of problems, such as data quality, annotation, terminology differences and business rules that indicate when and how to combine information. IBM is focused on this broader information integration requirement to enable customers to implement their own business integration needs, and the database style federality is one of the key integration technologies.