What should I do with the construction of a data warehouse?
Generally speaking, the main two aspects of the construction data warehouse:
1. Design with the interface of the operative database.
2. Data warehouse itself design.
It seems that it is very simple, but the truth is not as such, assumes that I am a database designer. I can completely in the three seven twenty-one, first in part of the data, let DSS analyst (still not forgotten, it is the one The person required by the design of the data warehouse is analyzed, wait for him to give some opinions, we are not late.
Below, I will work in a pre-study class in order to solve the problem.
What is the main difficulty of building a data warehouse?
First correct a widely existing error: the process of building data warehouses is to extract data from operative data, which is wrong, mainly because: Most operability is not integrated (who has seen one The billing program can be statistically statistically on a few years of billing entry), you can't draw what you really need, for example, the average cost of this month, Ma Lei's overtime day in this month, don't need me, you also Knowing: Operational data is mainly for application services, and each system or application has its own unique "independence". When developing, who will think about going to the old account later?
Ok, change a new perspective to see the question: If it is not just the extraction, what else is there? as follows:
The first question: System integration. When you are put together in a hundred thousand tables, when you need to count, do you dare to know a field of this table and another table of the same name? Or in turn, do you dare to affirm a field of this table and the different fields of another table must be unrelated? These issues can be attributed to a problem: the system lacks integration! Solving this problem In addition to better design your database, only by your patience. There is also the conversion problem of the field. Look at the following example: Gender (SEX) has a lot of expression in the database, can be written into m / f, or write 0/1 to represent male / female, etc. ... What should I do? In order to ensure that the data is summoned to the data warehouse is correct, we must establish a different mapping (Sorry, simple saying: the same nature mentioned above, expressed in the same form of the same form), this is also one Parts requires patience!
Second question: Access the efficiency of the data of the existing system. This is normal, when there are a lot of tables and files that need to be scanned, who can be exactly knowing that a file is scanned? If there is a lot of data in the existing system, you can scan the entire database once in order to get some of the data, this is a tragedy. I believe no one wants this happening, and the specific solution is proposed below.
Please find "How to avoid these issues", first figure out what you have to load from the operating environment to the data warehouse (would you choose one?)
l Load archive data. (Lenovo knows what is a file in the old account of dust.
l Loading existing data in an operating system. (Just in the system, there is no backup yet)
l Since the last refresh of the database, the last change in the operative environment (update) is loaded into the data warehouse.
It is very simple for the first option, who will not be? So, it is very difficult, but as a DSS analyst, put existing data, you will be willing to analyze the data ten years ago, many companies discover that in many environments, use the old data to be lost. For the second option, it is not difficult because it only needs to be loaded. Usually we can write a downloaded order file according to the operating environment, use this sequential file, can be downloaded to the data warehouse without destroying the online environment. (Seems quite good)
The third option can be a bit complicated, because when you load data, the database is changing, and it is not easy to capture those changes. Therefore, scanned existing files (or forms) became the main problem of data warehouse architectural designers. What to do, what to do ... Actually, there are many ways - there are five kinds.
1. Scanned data that is stamped, you can clearly know that the data you need is recently updated, at least we can effectively avoid data inconsistent. (Unfortunately: There is not much data sometimes stamped)
2. Scanned increment files, (what is a increment file, I don't know, but it is certain that it is generated by the application, only records that change the record), unfortunate things, there is not much program incremental file . L
3. Scan the audit files and log files, these two files are the same in nature, in addition to the incremental file, in addition to big, there is more data, the interface program is difficult to do, there is no other harm J.
4. Modify the application code (this seems to be excessive, in order to design the data warehouse, actually let others write their own application), this is not common, should be an old and difficult to modify the code of the program. L
5. The fifth method is no way! joke. All materials including this book are advised to do this, so I just say two sentences: Do some image files on time, compare their differences. But it is best to use, I also think that the method is not only troublesome, complicated, and needs various resources. So not there is no need to have no need! J
The third question: Time-based changes, it is difficult to grasp. Existing operating data is usually current value, accuracy control, can be updated, but the data in the data warehouse cannot be updated, so the data must be included with time elements, when actual operation, when the operation system is transmitted to the data warehouse Must perform a large-range change in the data. At this time, you must consider the concentration of data, no way, the data is always changed over time, the space of the data warehouse is limited!
To this end, we involve three issues, as well as their solution, but this is not enough to build a data warehouse to build a self-study, should have not learned a specific method. The following section will be ...!
Data / Process Model and Architecture Design Method
First introduce two concepts: process modeling and data modeling, simply, process modeling is like the flowchart of our programming! There is the beginning of the AND end. Data modeling is like giving you cabbage, radish, vinegar, salt, etc., then ask what you can do, then you are natural answer: vinegar roller & radish soup. No, why do you want to do this, you should only do this. J
Process modeling is definitely not designed in the data warehouse, because the process modeling is based on demand, it is assumed that it has known demand at the beginning of the detail design, but at a point of construction data from that library is not Satisfy! The data model is much better, and it is right on both sides! (嘻嘻, like the universal glue) Building a data model, there is no need to consider the difference between the existing, the operating system and the data warehouse. Things to do seem to be very simple: build a corporate data model, build a data warehouse model, it is best to come to an operation data model, you can understand:
Enterprise Model à Operation Model à Data Warehouse Model
Three aspects are very important, and they are different from each other. (A bit like chicken and egg relationship)
Just talk to the data model, divide three levels of modeling: high-level modeling (solid model RED), intermediate layer modeling (data item set DIS) and underlying modeling (physical layer). The order of the construction is from the upward, as if you sit together, discuss a general architecture, start the design of the intermediate layer (because the data required by the RED is not simple to extract, need a certain comprehensive approach), then The underlying model is designed according to the intermediate layer (the data of the underlying model can be obtained from the operational data).
Oh, I still don't discuss it, let you leave a point you can ponder it yourself (and this book is not a textbook that specializes in modeling).
Is it a little dizzy, what data modeling, what three levels, don't worry, wait for you to take these questions to read the book, the problem will soon don't have, I recommend that you can record your own problem, no As for the time when reading a book, I forgot. J
Data modeling is also a process of taking puzzle. Each design is a unique wood, which has a puzzle after you have enough building blocks. (A task)
The above introduction is the design method of data warehouse - data modeling. Let's talk about several details of design data warehouse: (this may be very boring)
Standardization / reverse standardization
The purpose of this operation is to reduce the I / O operation time of the system. The specific method can be summarized in two sentences: in order to reduce the time used by I / O operation, some tables are combined (standardized), or to introduce redundant data (reverse standardization).
Data warehouse snapshot
Snapshot is a detailed record of an event. Example: You used a lot of money to buy a beloved thing, suddenly found that the life expenses of the next half of the month didn't, this is the event, and the snapshot generated is as follows:
Time | Keycasion | Location Items ... Mood at the time of purchase | Account balance ... The mood after buying |
1 2 3 4
It is not difficult to see that the third paragraph is discrete raw data, and the fourth paragraph is the cause of the cause of the incident (contact, optional) summarizes one. The snapshot should be a real record of an event, he should include The following content:
l Key code.
l Time unit.
l Only the initial data associated with the key code.
l The second data captured after the snapshot occurs, and there is no direct relationship in front.
Metadata
Regarding the (history) data of the data, for example, the first time of the data warehouse introduced, the second time. The source data is in WHERE, the data structure is what, the history record, and more.
Management reference table in data warehouse
The reference data in the data warehouse (the data yearbook), the purpose of the data warehouse is to provide a reference basis, so the periodic generating reference data can reduce the amount of data in the data warehouse. This is not difficult to understand: With reference data, it is natural that there is no need to keep the old age.
There are two ways to establish a reference data sheet:
1. Every time a specific time is a snapshot of a reference table.
2. A snapshot is a reference table (one), then make a record for each modification.
Data cycle
The so-called data period refers to the time from the operational environment data, and this change is reflected in the data warehouse. For example, a bank user moves, his new address is added to the operational data, and the data warehouse has been updated immediately. This is a data cycle.
The problem is coming, what should this adjustment time? In principle, it is greater than or equal to 24 hours. This is for the stability and cost of data.
Conversion and integration complexity
There are a lot of content here, they are all broken, like introducing the experience, or leave you a little study. (I am lazy) this is a way to build a database.
Trigger data warehouse record
Triggering a data warehouse requires an event, and this event should be an important event. It is important to ignore its existence, huh, just, just a button, pop up a dialog box. When you capture this event, add this snapshot of this event in the data warehouse. Very simple, isn't it? Maybe you want to know, what incident, how to trigger? For example, your important customer, call you, modify the delivery location, ok! Your reaction is prior to finding this shipping record and customer record (this is a snapshot), modify the delivery location (secondary data), write into the data warehouse. understood?
Manage data warehouse
The purpose of management is to let the data will go, the stay, the statistics of the statistics, don't let the data occupied the valuable space, huh, huh, saying that it is easy to do, everyone knows the user's day I will have a crazy, the old account, in case an error, will be bad. So the correct method is: · # ¥% ...! #. Didn't understand? Ah, I am sorry, this is a foreign language, hehe, summarizing two points:
1. Use simple record mode, summarize, integrated data. There is a comprehensive scale problem. Don't integrate the data at a time, don't throw all the details of the data. Let simple records provide a basis for the second time.
2. Establish data backups at the same time. This is the most insurance method, find Zhang Pan, the magnetic tape, the way to write in the safe. what? When I charge, I think it is very good, when the user checks, I can charge her. Also earned J
According to the above discussion, are you already established a general framework? Know what is considered a data warehouse, what kind of table structure is in line with the data warehouse? Sentences to be trained, I can't understand what the data model is something? Is it similar to the object in C or the structure in the data structure? I have learned from middle school: what must be considered when designing, not what is done. So, you must understand this thing, it is impossible. Only by continuous practice, it should only be an experienced process, which can be said that there is still not fully feasible, and can be moved to design the data warehouse. J is quite disappointed, it doesn't matter. This is a process that needs to be repeated.% 50 success rate is good, so there is no need to worry: P is good, suppose we are considering all the situation, built A very perfect data warehouse (a bit shameless, Xixi), start access, you must keep such a fact, the data warehouse must have the data you need, otherwise it must be developed. You start to count, extract, calculate, etc., you can't, if you want!
Simulate, you are a bank employee, who has received a user's lending request, then you must try to determine the credit value of this user and personal assets and work conditions to determine whether to give this person loan. Here is a very complex program to do this in the background. Moreover, the corresponding data is also prepared for this request in the data warehouse. This review is also a comprehensive and very fast. At this time, you must consider:
1. Repay history.
2. Private property.
3. Financial management.
4. Net worth.
5. All income.
6. All overhead.
7. Other intangible assets.
......
After complex calculations, the final result of the audit can be obtained, but many of the data needed for this process are sorted out of the data warehouse. OK, do you understand that the data warehouse is still useful.
But let's consider the form of this data, ..., have it found that the last data is a synthetic data integrating a lot of situations. Many of the contents like a large pot of Laba porridge, but the ingredients are in different places. Hey, in fact, this is an inevitable phenomenon in the data warehouse, called a star join. Oh -, in fact, these parts are named, the integration of the middle is "fact table", and the surrounding dimension is. And there is also a phenomenon on this: The main key of the dimension table is included in the fact table. You may not react, but the fact is like this.
The access techniques of the data warehouse are included here.
Think about it, I want to understand that I can teach me J.
After understanding several major elements involving the data warehouse, OK! Let's Go ON. The following questions will be deeply discussed to discuss topics similar to design details and management details. After reading, you need to think deeply, this can achieve the original meaning of the author. The main reasons also include translation problems.
Let's take a look at the first question:
Data warehouse particle size
The particle size in the data warehouse refers to the detailed dimensionality of data, and in order to describe a situation, I can use a lot of data, but I can also use only the required data. This is done in memory. If there is a big hard drive, there is no thing we can't save. Therefore, it is estimated that the maximum number of lines and the minimum number of rows in the table within one year is the biggest problem of designers. Here is a concept: the method of premising upper and lower limits. (Don't ask me, I don't understand)
Then you can know the database probably by simple calculations, and then adjust our policy. To be careful, we can use double-grained or single particle size.
Dual particle size is the best way to reduce the amount of data. Moreover, most companies use this method. One is analyzed below: Double granularity includes: low detail level and fine artificial level. You must know that it is meaningless to establish mild summary data at a very low detail level. In turn, it is useless to establish summary data in too high detail level. Therefore, it is necessary to evaluate data granularity, and then the best summary plan can be obtained. And ridiculous is that this is guessing, there is no guarantee of correctness, hey, no way, who makes us doing something I don't know, refers to the equation that knows the results, but you can take you As a result, let her evaluate this quality, don't expect% 100 through,% 50 is very good :)
Here are some feedback techniques and an example, in page 90, you can refer to it.
If the data granularity teaches you to build a data warehouse, the next topic is to teach you to manage!
Data warehouse and technology
Here, there are many management techniques I can't understand, 嘻嘻, for example: by addressing, by searching, through data extension, through effective overflow management ... Here management includes the ability to manage a large number of databases and manage data The capacity of the warehouse. Any technique that generates supporting data warehouses must meet the requirements of efficiency.
You have to manage multi-media, main memory, extended memory, cache, DSAD (such as hard drive), CD, tape ...
The soul of the data warehouse is the flexibility and the unpredictable access to the data, don't you understand? The simple point is that it can evaluate past all data and provide analysis basis. If the data warehouse cannot be used to use the index, the establishment of the data warehouse is unsuccessful. Many use of some secondary index, dynamic index, temporary index, etc.
A variety of technical interfaces, this I don't have to explain it, this you should understand.
Control of data storage positions, just like this, must make a complete set of data storage mechanisms in the data warehouse. And it is best to automatically!
Parallel storage and management of data, assume that access to data is equal, and the performance improvement is proportional to the physical device distributed in the data.
Metadata management, remember this fact, good house, if there is no key, you can't do it! ! Therefore, the importance of managing metadata even exceeds data in the management data warehouse. This includes the structure of the table, the property, source data (record system), mapping relationship, data model description, extracting log, and shared routine.
Language interface, SQL language interface, you have to do a front desk control program, can be inserted, deleted ...
Efficient loading of data, think of it yourself (what, teachers are lazy, I am lazy, how?) There is nothing to say here, you need to do different deals according to different environments.
High efficiency indexing utilization, data compression, composite key code, becoming long data, lock management, fast recovery. I will no longer say more, this is more than I understand.
DBMS type and data warehouse,
Multidimensional Database Management System (commonly known as "data market"), provides an information system structure that makes it very flexible to the database. If I don't understand the wrong, the data market provides a data management, inspection plan, so it is over the data warehouse, so data of the data warehouse is the main data source of the data, which can be said. The difference between the two is the difference in data particle size. The particle size in the data warehouse is small, and the DBMS has a large data particle size. Of course, this is a purpose, which is not only for the storage time, but also more data more! There are still many other roles here:
E.g:
ü Support data dynamic connection.
ü can support universal data update processing.
u The relationship is clear.
Then it's perfect? actually not! In fact, there are many ills of ills to overcome.
The amount of data is not as much as the relationship database support.
¨ Do not support universal update technology.
¨张 is long.
¨¨ The structure is not flexible.
¨ Dynamic support has problems.
This is a little feeling after I have read the data warehouse, and I will study it together, study, haha ...