Data mining concept and technology reading notes

zhaozj2021-02-16  43

1. Time challenge

In the past ten years, people use information technology to produce and collect data significantly, thousands of databases are used for commercial management, government office, scientific research and engineering development, etc., this momentum will continue to develop. . So, a new challenge was raised: in the era of information explosion, the excessive information is almost a problem that everyone needs to face. How can I be overwhelmed by the Wort of the Wort of the Information, and I find useful knowledge from the middle, improve the information utilization. To make the data really become a company's resources, only make full use of its business decision-making and strategic development services for the company OK, otherwise a large amount of data may become a burden, even garbage. Need to be the mother of the invention, therefore, in the face of the challenges of "people are submerged, people are hungry in knowledge", data mining and knowledge discovery (DMKD) technology should be born, and they are booming, increasingly showing its powerful Vitality.

The knowledge mentioned here is not the truth that requires the discovery of the four seas, nor to discover new natural scientific aimings and pure mathematical formulas, nothing machine proof. In fact, all discovered knowledge is relatively, there is specific premise and constraint conditions, facing specific areas, and can be easily understood by the user. It is best to express the results found in natural language.

2. Historical necessity

From commercial data to the evolution of business information, each step is based on the previous step. See the table below. In the table, we can see that the fourth step evolution is revolutionary, because from a user's point of view, this phase of database technology can quickly answer a lot of business.

From the table below, it can be clearly seen that the data mining is born is a historical option. It is in line with the objective development of the understanding of the human society. From this point of view, it has just begun to be in the popular data mining. The prospect is still very optimistic.

Evolutionary stage

Business problem

Support technology

Product manufacturer

Features

Data collection (1960s)

"What is my total income in the past five years?"

Computer, tape and disk

IBM, CDC

Provide historic, static data information

Data Access (80s)

"What is the sales of New York's sales last year?"

Relational Database (RDBMS), Structured Query Language (SQL), ODBC

Oracle, Sybase, INFORMIX, IBM, Microsoft

Provide historic, dynamic data information in the record level

Data warehouse; decision support (1990s)

"How much is the sales in New York, how much is the sales of March last year? What conclusions can I have according to this?"

Online analysis processing (OLAP), multi-dimensional database, data warehouse

Pilot, Comshare, Arbor, Cognos, MicroStrategy

Provides backtracking, dynamic data information at various levels

Data mining (popular)

"How will the sales of Los Angeles next month? Why?"

Advanced algorithm, multiprocessor computer, massive database

Pilot, Lockheed, IBM, SGI, Other startup

Provide predictive information

3. Definition of data mining

Data Mining is from a large number of incomplete, noise, blurred, randomized data, and people don't know in advance, but it is potentially useful information and knowledge. process. There are also many terms similar to this term, such as finding knowledge (KDD), data analysis, data fusion, and decision support, and decision support. People think of the original data as the source of knowledge, just like the mine from ore. The original data can be structured, such as data in the relational database, or semi-structural, such as text, graphics, image data, or even isomeric data distributed on the network. The method of finding knowledge can be mathematical, or non-mathematical; can be interpreted, or it can be summarized. Knowledge discovered can be used for information management, query optimization, decision support, process control, etc., can also be used for data itself. Therefore, data mining is a very broad intersection, which brings together scholars and engineering technicians in different fields, especially databases, artificial intelligence, mathematical statistics, visualization, parallel computing. In short, data mining is actually a deep-level data analysis method. Data analysis itself has many years of history, but in the past data collection and analysis is used for scientific research, in addition, due to the limitations of calculating capacity, the complex data analysis method for analysis of large data is greatly restricted. . Now, due to the implementation of business automation, commercial sectors have produced a lot of business data, which is no longer collected for the purpose of analysis, but due to business operations. Analysis of these data is no longer simply for research, but also provides real valuable information for business decisions, which in turn gives profits. But a common question facing all enterprises is that the amount of corporate data is very large, and the real valuable information is very small, so from a large amount of data, it is conducive to business operation, improving the competitiveness. Like the gold gold in the ore, data mining is also named.

4. Knowledge classification of data mining

4.1 Generalization Knowledge (Generalization)

An generalized description of the category characteristics. According to the microscopic characteristics of the data, it is found to be characterized, with universal, higher level concepts, middle and macro knowledge, reflecting the common nature of the similar things, is a summary, refining and abstraction of data.

There are many discovered methods and implementation techniques of proficiency knowledge, such as data cubes, orientation of attributes. Data cubes have other alias, such as "multi-dimensional database", "implementation view", "OLAP", etc. The basic idea of ​​this method is to achieve calculations of certain commonly used aggregation functions, such as counting, summing, average, maximum, etc., and store these implementations in the multi-dimensional database. Since many aggregates are frequently calculated frequently, the pre-calculated results in the multi-dimensional data cube will ensure fast response, and can be flexibly providing different angles and different abstract data views. Another proficiency knowledge discovery method is the approximate approving method proposed by SIMONFRASER University of Canada. This method represents data mining queries in class SQL language, collects related data sets in the database, and then applies a series of data promotion technologies on related data sets, including attribute deletion, concept tree boost, attribute threshold control, counting Other aggregated function spreads, etc.

4.2 Association (Association)

It reflects the knowledge or association between an event and other events. If there is an association between the two or more attributes, then one of the attribute values ​​can be predicted according to other attribute values. The most famous association rule discovery method is the APRIORI algorithm proposed by R.AGRAWAL. The discovery of association rules can be divided into two steps. The first step is to iterate all frequent project sets, requiring that the support rate of frequent project sets is not lower than the minimum value set; the second step is to construct credibility from frequent project concentration is not lower than the minimum value set. the rule of. Identify or discover all frequent project sets is the core of the association rule discovery algorithm and the largest amount of computational amount.

4.3 Classification and Cluster Knowledge (Classification & Clustering)

It reflects the characteristic knowledge of similar things and difference characteristics between different things. The most typical classification method is based on the classification method of decision tree. It is a way of constructing decision trees from examples, is a guided learning method. The method first forms a decision tree based on the training subset (also known as the window). If the tree can't give all objects to give the correct category, select some exceptions to add to the window, repeating the process until the correct decision set. The final result is a tree, its leaf node is class name, the intermediate node is a branching property, which is a possible value for attributes. The most typical decision tree learning system is ID3, which uses the top-down no retrospective strategy to ensure a simple tree. Algorithms C4.5 and C5.0 are both Id3 extensions, which extend the classification field from the category attribute to the numerical properties. Data classification also has a statistical, rough set (Roughset). Linear regression and linear distinguishing analysis are typical statistical models. In order to reduce the cost of decision tree generation, people also propose a interval classifier. Recently, some have studied the use of neural network methods to categorize and rules in the database, and the representatives are subclassified.

4.4 Prediction Knowledge (PREDIX)

It speculates the future data based on time serial data, by historical and current data, or it is also considered to be associated with time is key attributes.

At present, the time series prediction method has a classic statistical method, neural network and machine learning. In 1968, Box and Jenkins proposed a relatively complete time series modeling theory and analysis method. These classic mathematical methods were established by establishing a random model, such as self-regression models, self-regression sliding average models, and self-return sliding average models and Seasonal adjustment model, etc., predicts of time series. Since a large number of time sequences are noncommitted, their feature parameters and data distribution have changed over time. Therefore, only through the training of a certain historical data, establish a single neural network predictive model, and it is not possible to complete the accurate predictive task. To this end, people have proposed statistics and precision-based rehabilitation methods. When the current prediction model is no longer applicable to current data, the model is retrained and new weight parameters, and the new model is established. There are also many systems to perform time series prediction by means of a parallel algorithm.

4.5 Deviation Type Knowledge (Deviation)

---- In addition, other types of knowledge can also be found, such as deviation knowledge (Deviation), which is a description of differences and extreme examples, revealing things from routine abnormalities, such as special cases outside the standard, data clustering External leaves, etc.

All of these knowledge can be discovered at different concepts and increase from micro to the middle of the concept, to the macro to meet the needs of different hierarchical decisions of different users.

5. Common technology for data mining

5.1 Artificial Neural Network

The non-linear prediction model of physiological neural network structure is used to perform mode identification. Sketchy, the neural network is a set of connected neural units, each of which is associated with one weight. In the learning phase, the right class labels of the input sample can be predicted by adjusting the right class label of the input sample. Due to the connection between the unit, neural network learning is also known as the connectors. Its advantages include the ability to hyperbure noise data, and its ability to classify unmarked data classification models.

5.2 Judgment Tree

"What is the determination tree?" Decision Tree is a tree structure similar to the flowchart. It is similar to the concept of binary decision tree in the data structure. Each internal node represents a test on an attribute, each branch represents a test output, while each leaf node represents a class or class distribution. The top of the tree is the root node.

5.3 Genetic Algorithm

Based on evolutionary theory and adopt genetic binding, genetic variation, and optimization techniques for design methods such as natural selection. According to the principle of survival, the most suitable rules in the current group form new groups, as well as the descendants of these rules. Typically, the fitness of the rules (FITNESS) uses it to assess the classification accuracy of training samples. The descendants are created by genetic operations such as intersection and variation. 5.4 Recent neighboring algorithm

A method of classifying each record in a data collection. The closest classification is based on requirements or lazy learning methods, which is stored all training samples, and the categories are established until new (unmarked) samples need to be classified. It can also be used to predict, that is, a real value prediction that returns a given position sample.

5.5apriori algorithm

It is an algorithm for the most influential excavation of Boolean relational rules. The name of the algorithm is based on this fact: the algorithm uses the prior knowledge of the nature of frequent item sets. It is used as an iterative method called a layer-by-layer search, and the K-item set is used to explore (k 1) - item set. First, find a collection of frequent 1-item sets. Then use the former to find a collection of 2-item sets, so iterate until you cannot find frequent K-items. Finally, the association rules are generated by frequent item sets.

5.6 Frequent mode growth (FP- growth)

Compared with the above method, it is a method of excavating frequency items that do not produce candidates. It constructs a highly compressed data structure (FP-growth), which compresses the original business database. It focuses on frequent mode growth, avoiding high cost candidate, achieving better efficiency.

Some specialized analysis tools using the above techniques have developed a history of approximately decade, but the amount of data facing these tools is usually smaller. Now these technologies have been directly integrated into many large industrial standards data warehouses and online analysis systems.

1. Data mining function

Data Mining By predicting future trends and behaviors, making advanced, knowledge-based decisions. The target of data mining is to find implicit, meaningful knowledge from the database, mainly with the following five types of functions.

6.1 Automatic prediction trends and behaviors

Data mining automatically looks for predictive information in large databases, and the previous needs of a large number of hand-analyzed issues can now be conclusions directly from the data itself. A typical example is a market prediction problem. Data mining uses data from the past to find the largest users in future investments, and other predictable issues include forecast bankruptcy and identify groups that are most likely to respond to designated events.

6.2 Association analysis

Data associations are an important type of knowledge that exists in the database. If there is a regularity between the two or more variables, it is called association. The association can be divided into simple association, timing association, and due to fruit association. The purpose of the association analysis is to find the associated network hidden in the database. Sometimes I don't know the association function of data in the database, even if it is not sure, the rule of the associated analysis is generated with confidence and support.

6.3 Correlation analysis

Many properties in data may not be associated with classification and predictive tasks. For example: Recording a bank loan application is the data proposed by the week may not be related to the success of the application. In addition, other properties may be redundant. Therefore, correlation analysis can be performed, and the attributes that are not related or redundant during the learning process can be deleted. This process is called feature choices in machine learning.

6.4 Cluster Analysis

Records in the database can be divided into a series of meaningful subsets, ie clusters. Clustering has enhanced people's understanding of objective reality, which is a prerequisite for conceptual description and deviation analysis. Clustering techniques mainly include division methods, hierarchical methods, density-based methods, and model-based methods. There are also some clustering algorithms inherit the idea of ​​multiple clustering methods.

6.5 Concept Description

The concept description is to describe the connotation of a type of object and summarize the related features of such objects. The concept description is divided into a feature description and distinction description, and the former describes the common feature of a certain type of object, which describes the difference between different types of objects. Generating a feature description of a class only involves the commonality of all objects in this type of object, which will abstract the large task related to the lower concept layer to a higher concept layer. Large data sets are valid, flexible, marked, can be divided into two categories: 1: Data cube (or OLAP) method, and 2: attribute-oriented induction method. There are many ways to generate distinctive descriptions, such as determination tree method, genetic algorithm, etc. 6.6 Deviation detection

Data in the database often has some exception records, which is meaningful to detect these deviations from the database. Deviations include many potential knowledge, such as abnormal instances in the classification, unsatisfactory examples of rules, observations and model predicted values, and variations of magnification over time. The basic method of deviation detection is to find meaningful differences between observations and reference values.

转载请注明原文地址:https://www.9cbs.com/read-25176.html

New Post(0)