Data Mining Article - Data Mining Technology Introduction [Reserved]

xiaoxiao2021-03-06 28

Recently I have learned DataMining, and some articles are here, learn and for reference.

Summary: Data mining is a new important research area. This paper introduces the concept, purpose, common method, data mining process, and data mining software of data mining. Introduction and prospects for issues facing data mining field.

Keywords: data mining data collection

1 Introduction

Data Mining is from a large amount, incomplete, noise, blurred, randomized data, and people don't know in advance, but it is the process of potentially useful information and knowledge. . With the rapid development of information technology, people accumulated the amount of data rapidly increased, and how to extract useful knowledge from a massive data in the TB meter. Data mining is to comply with this data processing technology that needs to be born out. It is a key step of knowledge discovery in database.

2. Data mining tasks

The task of data mining is mainly associated analysis, cluster analysis, classification, prediction, timing mode, and deviation analysis.

(1) Association Analysis

Association rule mining is first proposed by Rakesh Apwal et al. There is a certain regularity between two or more variables, which is called association. Data associations are a class of important and discoverable knowledge in the database. The association is divided into simple association, timing association and causal association. The purpose of the association analysis is to find the associated network hidden in the database. Generally, two thresholds of support and credibility are generally correlated, and parameters such as interest, correlation, etc. are constantly introduced, so that the rules of the excavated are more in line with requirements.

(2) Cluster analysis (Clustering)

Clustering is summarized into several categories in similarity, similar to each other, and data in different classes is different. Cluster analysis can establish a macro concept, discover data distribution mode, and mutual relationships between possible data properties.

(3) classification

Classification is to find a concept description of a category, which represents the overall information of such data, which is the connotation description of this class, and use this description to construct a model, generally expressed in rules or decision tree mode. Classification is to use the training data set to obtain a classification rule by a certain algorithm. Classification can be used in rule descriptions and predictions.

⑷ prediction

The prediction is to use historical data to find changes, establish a model, and this model is predicted to predict the types and characteristics of future data. Pre-predicting accuracy and uncertainty, usually using predictive variance.

Time-Series Pattern

Timing mode refers to a mode in which the repetition occurs is higher through the time series search. As with the return, it is also a value of the future, but the difference between these data is different from the time of variables.

⑹ deviation analysis (Deviation)

In deviations include many useful knowledge, there are many abnormal conditions in the database in the database, and it is very important to discover the abnormal conditions in the database. The basic method of the deviation test is to find the difference between the observation and the reference.

3. Data mining object

According to the information storage format, objects for excavation include relational databases, object-oriented databases, data warehouses, text data sources, multimedia databases, spatial databases, temptation databases, heterogeneous databases, and internets.

4. Data mining process

(1) Defining the problem: clearly define business issues to determine the purpose of data mining.

(2) Data Preparation: Data Preparation includes: Selecting Data - Extracts Data Sets of Data Mining in Large Database and Data Warehouse Target; Data Preprocessing - Data Repair, including inspection data integrity and data consistency, Delicate, fill the lost domain, delete invalid data, etc.

(3) Data Mining: Select the appropriate algorithm based on the type of data function and the characteristics of the data, and perform data mining on the purification and converted data set.

⑷ Result Analysis: Interpretation and evaluation of data mining, and conversion becomes a knowledge that can eventually be understood by the user.

⑸ Knowledge Application: Integrate the knowledge obtained by the analysis to the organizational structure of the business information system. 5. Data mining method

(1) Neural network method

Due to its own good robustness, self-organizational adaptability, parallel processing, distributed storage, and high-level fault tolerance, etc., which is very suitable for solving data mining, so it has received more and more attention in recent years. The typical neural network model is mainly divided into three categories: with a perceptual machine, BP reverse propagation model, a function type network, for classification, prediction and pattern recognition of feedforward neural network model; with Hopfield discrete model and The continuous model is represented by the feedback neural network model for associative memory and optimization calculation; the ART model, the Koholon model is represented by the ART model, the self-organized mapping method for clustering. The disadvantage of neural network approach is "black box", and people are difficult to understand the network learning and decision-making process.

(2) genetic algorithm

Genetic algorithm is a random search algorithm based on biological natural selection and genetic mechanism, is a bionic global optimization method. The genetic algorithm has the nature of the implicit parallelism, easy and other model combined, such that it is applied in data mining.

SUNIL has successfully developed a data mining tool based on genetic algorithm, using the tool to perform data mining experiments on the real database of two aircraft crash, and the result indicates that the genetic algorithm is one of the effective methods of data mining. The application of genetic algorithms is also reflected in combination with neural networks, rough sets. If the neural network structure is optimized by the genetic algorithm, the excess connection and hidden unit are deleted without increasing the error rate; the training neural network is combined with the genetic algorithm and BP algorithm, and then extracts the rules from the network. However, the algorithm of the genetic algorithm is more complicated, and it has not been resolved in a local minimal convergence problem.

(3) Decision tree method

Decision tree is a algorithm that is commonly used in predictive models, which finds some valuable, potential information by categorizing a large amount of data. Its main advantage is that it is simple, the classification speed is fast, especially suitable for large-scale data processing. The most influential and earliest decision tree method is the famous Id3 algorithm based on the famous information entropy based on Quinlan. Its main problem is: ID3 is a non-increasing learning algorithm; ID3 decision tree is a single variable decision tree, and the expression of complex concepts is difficult; the interregional relationship between the same-sex is insufficient; the anti-noiseability is poor. In response to the above problems, there have been many better improvement algorithms, such as Schlimmer and Fisher designed ID4 income learning algorithm; Zhong Ming, Chen Wenwei, etc. proposed IBLE algorithm.

⑷ rough set method

The rough set theory is a mathematical tool for research inaccurate and uncertain knowledge. There are several advantages: no additional information is not required; simplify the expression space of the input information; the algorithm is simple, easy to operate. The object of the rough set processed is an information sheet similar to a two-dimensional relationship table. At present, mature relational database management systems and newly developed data warehouse management systems have laid a solid foundation for rough data mining. However, the mathematical foundation of the rough set is a collection, it is difficult to directly process continuous properties. The continuous attributes in the real information table are generally existing. Therefore, discretization of continuous attributes is the difficulty of restricting the practical use of rough set theory. Nowadays, some rough set-based tool applications, such as KDD-R developed by Regina University, Canada; Less, USA, developed by Kansas University.

⑸ Cover the normal example rejection

It uses the idea to cover all the correct examples and exclude all the resentment. First, one seed is optionally selected in the collected set, and it is compared to the alphaset collection. The selection composition composed of the field is went, and the opposite is retained. According to this ideological cycle, all of the normal seeds will be obtained, and the rules of the positive are obtained (the pick-up type of the selection). The more typical algorithm has Michalski's AQ11 method, Hong Jiarong improved AQ15 method and his AE5 method.

⑹ statistical analysis method

There are two relationships between the database field items: the function relationship (the deterministic relationship that can be used in function formula) and related relationships (not represented by function formulas, but still related deterministic relationships), statistics on their analysis Learning methods, that is, use statistical principles to analyze the information in the database. Common statistics can be performed (seeking maximum, minimum, sum, mean, etc.) in a large amount of data, regression analysis (using regression equation to represent the number of variables), correlation analysis (related to the correlation coefficient to measure variables The degree), the difference analysis (there is a difference from sample statistics to determine if there is a difference between the overall parameters). ⑺ ⑺ 糊集方法

That is, the ambiguity of fuzzy evaluation, fuzzy mode identification and fuzzy polymerization analysis were used to use fuzzy set theory. The higher the complexity of the system, the stronger the ambiguity, the general fuzzy set theory is to portray the fuzzy things in the belonging to grave the fuzzy things. Based on the traditional fuzzy theory and probability statistics, Li Deyi proposed a qualitative quantitative uncertainty conversion model - cloud model, and formed a cloud theory.

6. Evaluation data mining software needs to consider

More and more software suppliers have joined data mining competition in this area. How do users correctly evaluate a commercial software and choose the right software to become the key to data mining.

Evaluation of a data mining software should mainly follow the following four main aspects:

(1) Calculation performance: If the software can run in different business platforms; software architecture; can connect different data sources; operational large data sets, performance change is linear or index; calculated; The structure is easy to expand; the stability of operation, etc .;

(2) Functional: If the software provides a sufficiently diverse algorithm; can avoid excavation process black box; software provides the algorithm to be applied to multiple types of data; user can adjust the parameters of algorithms and algorithms; Can software from data? Set randomly extract data to establish pre-preservation models; whether you can perform excavation results in different forms;

(3) Availability: If the user interface is friendly; if the software is easy to learn; the software faced by the software: beginners, senior users or experts? Error reporting is a great help to user debugging; the field of software applications: It is specially to attack a certain field or for multiple fields;

⑷ Auxiliary function: If the user is allowed to change the error value in the data set or perform data cleaning; whether a global alternative to the value is allowed; whether the continuous data is discretized; whether it can extract a subset from the data set according to the rules established by the user; Use the null value in the data to replace the value specified by the user or the value specified by the user; if it can feed back the results of the analysis to another analysis, and so on.

7. Conclude

Data mining technology is a young and hopeful research area, and the powerful driving force of commercial interests will continue to promote its development. Every year, there are new data mining methods and models, and people's research is increasing and in-depth. Despite this, data mining technology still faces many problems and challenges: such as the efficiency of data mining methods, especially the efficiency of data mining in large-scale data concentration; development and adaptation of multi-numerical types, and noise reduction methods to solve heterogeneous Data mining problem; dynamic data and data mining; data mining in networks and distributed environments, etc. In addition, multimedia database has developed quickly, mining technology and software facing multimedia databases will become research and development in the future. Hotspot.

转载请注明原文地址:https://www.9cbs.com/read-49714.html

9cbs

New Post(0)