Introduction to Data Mining Technology

xiaoxiao2021-03-06 122

Summary: Data mining is a new important research area. This paper introduces the concept, purpose, common method, data mining process, and data mining software of data mining. Introduction and prospects for issues facing data mining field.

Keywords: data mining data collection

1 Introduction

Data Mining is from a large amount, incomplete, noise, blurred, randomized data, and people don't know in advance, but it is the process of potentially useful information and knowledge. . With the rapid development of information technology, people accumulated the amount of data rapidly increased, and how to extract useful knowledge from a massive data in the TB meter. Data mining is to comply with this data processing technology that needs to be born out. It is a key step of knowledge discovery in database.

2. Data mining tasks

The task of data mining is mainly associated analysis, cluster analysis, classification, prediction, timing mode, and deviation analysis.

(1) Association Analysis

Association rule mining is first proposed by Rakesh Apwal et al. There is a certain regularity between two or more variables, which is called association. Data associations are a class of important and discoverable knowledge in the database. The association is divided into simple association, timing association and causal association. The purpose of the association analysis is to find the associated network hidden in the database. Generally, two thresholds of support and credibility are generally correlated, and parameters such as interest, correlation, etc. are constantly introduced, so that the rules of the excavated are more in line with requirements.

(2) Cluster analysis (Clustering)

Clustering is summarized into several categories in similarity, similar to each other, and data in different classes is different. Cluster analysis can establish a macro concept, discover data distribution mode, and mutual relationships between possible data properties.

(3) classification

Classification is to find a concept description of a category, which represents the overall information of such data, which is the connotation description of this class, and use this description to construct a model, generally expressed in rules or decision tree mode. Classification is to use the training data set to obtain a classification rule by a certain algorithm. Classification can be used in rule descriptions and predictions.

⑷ prediction

The prediction is to use historical data to find changes, establish a model, and this model is predicted to predict the types and characteristics of future data. Pre-predicting accuracy and uncertainty, usually using predictive variance.

Time-Series Pattern

Timing mode refers to a mode in which the repetition occurs is higher through the time series search. As with the return, it is also a value of the future, but the difference between these data is different from the time of variables.

⑹ deviation analysis (Deviation)

In deviations include many useful knowledge, there are many abnormal conditions in the database in the database, and it is very important to discover the abnormal conditions in the database. The basic method of the deviation test is to find the difference between the observation and the reference.

3. Data mining object

According to the information storage format, objects for excavation include relational databases, object-oriented databases, data warehouses, text data sources, multimedia databases, spatial databases, temptation databases, heterogeneous databases, and internets. 4. Data Mining Process (1) Defining Problem: Clearly define business issues to determine the purpose of data mining. (2) Data Preparation: Data Preparation includes: Selecting Data - Extracts Data Sets of Data Mining in Large Database and Data Warehouse Target; Data Preprocessing - Data Repair, including inspection data integrity and data consistency, Delicate, fill the lost domain, delete invalid data, etc. (3) Data Mining: Select the appropriate algorithm based on the type of data function and the characteristics of the data, and perform data mining on the purification and converted data set. ⑷ Result Analysis: Interpretation and evaluation of data mining, and conversion becomes a knowledge that can eventually be understood by the user. ⑸ Knowledge Application: Integrate the knowledge obtained by the analysis to the organizational structure of the business information system. 5. Method of data mining (1) Neural network method neural network due to its good robustness, self-organized adaptive, parallel processing, distribution storage and high-level fault tolerance, etc., which is very suitable for solving data mining, so it has received more and more people in recent years. s concern. The typical neural network model is mainly divided into three categories: with a perceptual machine, BP reverse propagation model, a function type network, for classification, prediction and pattern recognition of feedforward neural network model; with Hopfield discrete model and The continuous model is represented by the feedback neural network model for associative memory and optimization calculation; the ART model, the Koholon model is represented by the ART model, the self-organized mapping method for clustering. The disadvantage of neural network approach is "black box", and people are difficult to understand the network learning and decision-making process. (2) Genetic algorithm genetic algorithm is a random search algorithm based on biological natural selection and genetic mechanism, is a bionic global optimization method. The genetic algorithm has the nature of the implicit parallelism, easy and other model combined, such that it is applied in data mining. SUNIL has successfully developed a data mining tool based on genetic algorithm, using the tool to perform data mining experiments on two aircraft crash, the result indicating that the genetic algorithm is one of the effective methods of data mining [4]. The application of genetic algorithms is also reflected in combination with neural networks, rough sets. If the neural network structure is optimized by the genetic algorithm, the excess connection and hidden unit are deleted without increasing the error rate; the training neural network is combined with the genetic algorithm and BP algorithm, and then extracts the rules from the network. However, the algorithm of the genetic algorithm is more complicated, and it has not been resolved in a local minimal convergence problem. (3) Decision Tree Method Decision Tree is a algorithm that is commonly used in predictive models, which finds some valuable, potential information from which a large amount of data is destined. Its main advantage is that it is simple, the classification speed is fast, especially suitable for large-scale data processing. The most influential and earliest decision tree method is the famous Id3 algorithm based on the famous information entropy based on Quinlan. Its main problem is: ID3 is a non-increasing learning algorithm; ID3 decision tree is a single variable decision tree, and the expression of complex concepts is difficult; the interregional relationship between the same-sex is insufficient; the anti-noiseability is poor. In response to the above problems, there have been many better improvement algorithms, such as Schlimmer and Fisher designed ID4 income learning algorithm; Zhong Ming, Chen Wenwei, etc. proposed IBLE algorithm. ⑷ Rough set The rough set theory is a mathematical tool for research inaccurate, uncertain knowledge. There are several advantages: no additional information is not required; simplify the expression space of the input information; the algorithm is simple, easy to operate. The object of the rough set processed is an information sheet similar to a two-dimensional relationship table. At present, mature relational database management systems and newly developed data warehouse management systems have laid a solid foundation for rough data mining. However, the mathematical foundation of the rough set is a collection, it is difficult to directly process continuous properties. The continuous attributes in the real information table are generally existing.

Therefore, discretization of continuous attributes is the difficulty of restricting the practical use of rough set theory. Nowadays, some rough set-based tool applications, such as KDD-R developed by Regina University, Canada; Less, USA, developed by Kansas University. ⑸ Cover the correct example Reverse method method It is to use the idea of covering all the correct examples to exclude all the resentment. First, one seed is optionally selected in the collected set, and it is compared to the alphaset collection. The selection composition composed of the field is went, and the opposite is retained. According to this ideological cycle, all of the normal seeds will be obtained, and the rules of the positive are obtained (the pick-up type of the selection). The more typical algorithm has Michalski's AQ11 method, Hong Jiarong improved AQ15 method and his AE5 method. ⑹ Statistical Analysis Method There are two relationships between database field items: function relationship (can be used in function formula) and related relationships (not expressed in function formulas, but still related deterministic relationships), for them Analysis can be used to analyze the information in the database using statistical principles. Common statistics can be performed (seeking maximum, minimum, sum, mean, etc.) in a large amount of data, regression analysis (using regression equation to represent the number of variables), correlation analysis (related to the correlation coefficient to measure variables The degree), the difference analysis (there is a difference from sample statistics to determine if there is a difference between the overall parameters). ⑺⑺ ⑺集方法方法方法即集集评评评, 分析, 分析, 分析, 分析, 分析, 分析, 分析, 评, 评, 评, 评评评评评, 评, 评, 评, 评, 评, 评, 评, 评, 评, 评, 评, 评. 评. 评. 评. 评. 评. 评. 评. 评. 评.. The higher the complexity of the system, the stronger the ambiguity, the general fuzzy set theory is to portray the fuzzy things in the belonging to grave the fuzzy things. Based on the traditional fuzzy theory and probability statistics, Li Deyi proposed a qualitative quantitative uncertainty conversion model - cloud model, and formed a cloud theory. 6. Evaluation of data mining software needs to consider more and more software suppliers have joined data mining this area of competition. How do users correctly evaluate a commercial software and choose the right software to become the key to data mining. Evaluation of a data mining software should mainly follow the following four main aspects: (1) Calculation performance: If the software can run in different business platforms; software architecture; can connect different data sources; operational large data sets, performance changes Linear or index; calculated; whether or not the component structure is easy to expand; the stability of operation, etc .; Apply to a variety of types of data; the user can adjust the parameters of the algorithm and algorithm; the software can establish a pre-preservation model from the data set randomly extracted data; whether the mining result can be performed in different forms; (3) Usage: If the user interface is Friendly; Software is easy to learn; users face users: beginners, senior users are experts? Error reporting is a great help to user debugging; the field of software applications: is a special area or for multiple fields, etc .; ⑷ Auxiliary functions: If the user is allowed to change the error value in the data set or perform data cleaning; Global alternative to the value; whether the continuous data is discretized; whether the subset can be extracted from the data set according to the rules developed by the user; whether the null value in the data is used for an appropriate mean or user specified; The results of the analysis are given to another analysis, and so on. 7. Conclusion Data Mining Technology is a young and hopeful research area, and the powerful driving force for commercial interests will continue to promote its development. Every year, there are new data mining methods and models, and people's research is increasing and in-depth.

转载请注明原文地址:https://www.9cbs.com/read-90242.html

9cbs

New Post(0)