Data Mining Technology_______ / Data Mining involved a lot of disciplines and methods, and multiple classification methods. According to the mining task, it can be divided into classification or predictive model finding, data summary, clustering, association rule discovery, sequence mode discovery, dependency or dependency model discovery, exception, and trend discovery, etc.; according to mining target points, there is a relational database Object-oriented database, spatial database, timbility database, text data source, multimedia database, heterogeneous database, heritage database, and global network Web; according to the mining method, it can be crude: machine learning method, statistical method, neural network method And database methods. In machine learning, it can be subpristed to: induction method (decision tree, rule, etc.), based on example learning, genetic algorithm, etc. In the statistical method, it can be subsequently divided into: regression analysis (multiple returns, self-return, etc.), discriminating analysis (Bayesian discrimination, Fischer discrimination, non-parameter discrimination, etc.), cluster analysis (system clustering, dynamic clustering , Etc.), exploratory analysis (main analysis method, related analysis, etc.). In the neural network method, it can be sub-zone: forward neural network (BP algorithm, etc.), self-organizing neural network (self-organizing feature mapping, competition learning, etc.). The database method is primarily a multi-dimensional data analysis or OLAP method, and there is an attribute-oriented point of summary. ____ This article will focus mainly from the perspective of mining tasks and mining methods, focusing on data summary, classification discovery, clustering, and association rules to find four very important discovery tasks. ____ 1, data summary ____ data summary purpose is to concentrate the data and give its compact description. The traditional also the simplest data summary method is to calculate the statistical value of the summary value, average value, variance value, or other graphical ways such as histograms, cake shaped diagrams. Data mining mainly cares about the data summary from the perspective of data generalization. Data Up is a process of putting data from low-level abstraction to high levels. Due to the information contained on the database or objects, the information contained in the database is always the most original, basic information (this is to miss any possible data information). It is sometimes hoped to process or browse data from a higher level view, so you need to generalize the data on different levels to accommodate various query requirements. Data generally have two main technologies: multidimensional data analysis methods and attributes. ____ multidimensional data analysis method is a data warehouse technology, also known as online analysis processing (OLAP). The data warehouse is a set of historical data for decision-supporting, integrated, stable, different time. The premise of decision-making is data analysis. The calculation amount of such operations is particularly used in data analysis. Therefore, a natural idea is to pre-calculate and store the collection operation results to use it in the decision support system. The place where the collection operation result is stored as a multidimensional database. Multi-dimensional data analysis technology has obtained successful applications in decision-making support systems, such as the famous SAS data analysis software package, Business Object's decision support system Business Object, and IBM's decision analysis tools use multi-dimensional data analysis technology. ____ The multi-dimensional data analysis method is used to summarize, which is for the data warehouse, and the data warehouse stores the offline historical data. In order to handle online data, the researchers have proposed an attribute to the attribute. Its idea is to use the data view that users interested in users (available in a general SQL query language), rather than the pre-generic data is stored in advance like a multi-dimensional data analysis method. The method of the method is called this data generalization technique as an attribute. The original relationship has been used by generalization operations, which summarizes the original relationship at a low level from a higher level. With the generalization relationship, it can generate a variety of in-depth operations to generate knowledge needs, such as generating characteristic rules, discriminating rules, classification rules, and association rules based on the generalization relationship. ____ II, classification discovery ____ classification is a very important task in data mining, currently available in commercial applications. The purpose of classification is to learn a classification function or classification model (also known as a classifier) that maps the data items in the database to one of the given categories. Classification and regression can be used for prediction. The purpose of prediction is to automatically derive the promotion description of a given data from the historical data record, which can predict future data. Unlike the regression method, the classification output is a discrete category value, while the return output is a continuous value. Here we will not discuss the regression method. ____ To construct a classifier, there is a need for a training sample data set as an input. The training set consists of a set of database records or tuples. Each tuple is a feature vector consisting of a field (also known as attribute or feature) value. In addition, there is a category mark. A specific sample can be: (V1, V2, ..., VN; C); where VI represents a field value, C represents a category. The construction method of the ____ classifier has a statistical method, a machine learning method, a neural network method, and more.
The statistical methods include the Bayesian method and non-parameter method (neighbor learning or instance based learning), and the corresponding knowledge representation is the discrimination function and prototype. The machine learning method includes decision tree method and rule induction method, the former corresponds to decision tree or discrimination tree, the latter generally generating a rule. Neural network methods are mainly BP algorithms, and its model represents the forward feedback neural network model (a architecture consisting of nodes representing neurons and the edge of the connection weight), the BP algorithm is essentially a nonlinearity Discriminant function. In addition, a new approach has recently raised: Rough Set, its knowledge is a generated rule. ____ Different classifiers have different characteristics. There are three classifier evaluations or comparison dimensions: 1 prediction accuracy; 2 computational complexity; 3 model description. The prediction accuracy is a comparative scale, especially for predictive classification tasks, and is currently recognized by a 10-point cross-validation method. Calculation complexity depends on specific implementation details and hardware environments. In data mining, since the operation object is a huge amount of database, the complexity problem of space and time will be a very important part. For the description type classification task, the more concise and more popular the model description; for example, the classifier constructor indicated by the rule is more useful, and the results generated by the neural network method are difficult to understand. ____ In addition, it is to note that the effect of the classification is generally related to the characteristics of the data, some data noise is large, some have a lack, some distribution is sparse, some fields or attributes are strong, and some attributes are discrete Some of them are continuous values or mixed. It is generally believed that there is no data that can be suitable for various features. ____ three, clustering ____ clustering is to generate a group of individuals into several categories in similarity, ie "objects". Its purpose is to make the distance between the individuals belonging to the same category as small as possible, while the distance between individuals on different categories is as large as possible. Clustering methods include statistical methods, machine learning methods, neural network methods, and methods for databases. ____ In the statistical method, clustering is called cluster analysis, which is one of the three major methods of multivariate data analysis (other two are regression analysis and discrimination analysis). It mainly studies clusters based on geometric distances, such as European distance, Magicoski distance. Traditional statistical clustering analysis methods include system clustering, decomposition, joining method, dynamic clustering, ordered sample clustering, overlapping clustering and blurring clusters. This clustering method is a global comparison cluster that requires all individuals to determine the classification; so it requires all data to be pre-given, and new data objects cannot be added. Cluster analysis methods do not have linear computational complexity, which is difficult to apply to a very large database of databases. ____ Clustering in machine learning is called no supervision or no teacher induction; because of classification learning, the classification learning example or data object has a category mark, and an example to be clustered is not marked, and it is necessary to learn from clustering The algorithm is automatically determined. In many artificial intelligence, clustering is also called conceptual clustering; because the distance here is no longer a geometric distance in the statistical method, but is determined according to the description of the concept. When the cluster object can be dynamically increased, the conceptual cluster is called a concept formation. ____ In the neural network, there is a class of non-supervised learning methods: self-organizing neural network method; such as Kohonen self-organized feature mapping network, competition learning network, etc. In the field of data mining, the neural network clustering method of the report is mainly self-organizing feature mapping method, and IBM is specifically mentioned in its released data mining white paper. Database cluster segmentation. ____ four, the association rule discovery ____ Association rule is a rule in the form, "90% of the customers who buy bread and butter, 90% have also bought milk" (bread butter (milk). The main object discovered by the association rule is a transactional database. It is targeted by the sales data, also known as basket data. A business is generally composed of the following parts: transaction time, a group of customers purchased items, sometimes There is also a customer identification number (such as a credit card number). Purchase Behaviors provides a very valuable information. For example, you can help the goods on how to put the goods (such as the goods that often buy customers), help how to plan the market (how to match each other). This shows, from The relationship rules are found in the transaction data, which is very important for the decision of business activities such as retail industry .____ set i = {I1, I2, ..., IM} is a set of items (a shopping mall item may have 10,000 items ), D is a set of transaction sets (called transaction database). Each transaction T is a set of items, apparently satisfying ti. Symmetric transaction T support item set x, if Xt. Association rules are One of the following forms: xy, Xi, Yi, and x∩y = i.