Data mining and statistical relationship
1, what is data mining?
Data Mining is a scientific method using mathematics, statistical, artificial intelligence and neural network, such as memory reasoning, cluster analysis, association analysis, decision tree, neural network, gene algorithm, etc., from a large amount of data Mining the implied, previous unknown, the relationship, patterns, and trends with potential value, and use these knowledge and rules to establish models, tools and processes for predictive decision support.
Data mining has integrated various disciplines, there are many functions, the current main functions are as follows:
(1), Classification: Follow the properties, features of the analytical object, to establish different groups of groups. For example, the banking department divides customers into different categories according to the previous data, and now you can distinguish between newly applied loans to take the corresponding loan solution.
(2) Cluster: Identify analysis of the intrinsic rules, divide the object into several classes in accordance with these rules. For example, the applicant is divided into highly risk applicants, moderate risk applicants, and low risk applicants.
(3), related rules: Association is such a connection to other things when something occurs. For example: people who purchase beer every day may also purchase cigarettes, how much specific gravity can be described by associated support and credibility.
(4) Prediction: Grasping the law of analytical object development and foresee the future trend. For example: judgment on future economic development.
(5) Detection of deviation: The description of the minority, extreme examples of the analytical object, reveals the internal reasons. For example, there are 500 fraud in the bank's 1 million transactions. In order to stabilize, the bank will discover the intrinsic factors of this 500 cases, reduce the risk of future operations.
Of course, in addition to some of the other functions such as time series analysis, it is necessary to note that the functions of data mining are not independently existing, and they are related to each other in data mining.
2, data mining and statistical connection
Data mining techniques are a new discipline that makes computer technology, artificial intelligence technology and statistical technology. Data mining comes from statistical analysis, but also different from statistical analysis. Data mining is not to replace traditional statistical analysis techniques. In contrast, data mining is the extension and extension of statistical analysis methods. Most statistical analysis technologies are based on perfect mathematical theory and superb skills, and their prediction is still satisfactory, but the knowledge of users is relatively high. With the continuous development of computer capabilities, data mining can take advantage of the same functionality with relatively simple and fixed procedures. The new computing algorithm produces a neural network, and the decision tree makes people do not need to understand its internal complex principles can also achieve good analysis and prediction effects through these methods.
Due to data mining and statistical analysis of root-rooted contacts, usual according to the mining tools can provide statistical analysis functions by optional or itself. These functions are necessary to summarize and analyze data for data excavation and data mining after data mining. Statistical analysis, such as variance analysis, hypothesis test, correlation analysis, linear prediction, time sequence analysis, etc., to explore data in the early stage of data mining, discovery data mining topics, identify data mining targets, determine The variables involved in data mining are sampled, and the data sources are sampled. All these pre-works have a significant impact on data mining. The results of data mining also require statistical description (maximum, minimum, mean, variance, quadritile, number, probabilistic allocation) to make detailed description, so that the result of data mining can be understood by the user. Therefore, statistical analysis and data mining are complementary processes, and the reasonable cooperation between the two is an important condition for the success of data mining.
3, data mining and statistics
There is currently a trend in statistics to be more accurate. Of course, this is not a bad thing, and the more accurate can avoid mistakes and discover the truth. Statistics must prove before using a method, rather than focusing on experience in computer science and machine learning. Sometimes the researchers in other fields of the same issue make a significant use method, but it can't be proved by statisticalists (or now there is still no prove). Statistical magazines tend to publish mathematical proven methods rather than some special methods. Data Mining As a comprehensive synthesis of several subjects, the attitude of experimentation has been inherited from machine learning. This does not mean that the data mining workers do not pay attention to precise, but only if the method cannot produce the result, it will be given away. It is because of the statistical mathematical accuracy, and its focus on reasoning, although some branches of statistics are also focused on the description, the core problem of these papers will be found in the case of observing the sample. How to infer the overall. Of course, this is often concerned about data mining. Below we will mention a specific attribute of data mining is to handle a big data set. This means that traditional statistics are only a sample because of feasibility, but it is necessary to describe the big data set from the sample from the sample. However, data mining problems can often get the overall data, such as all of the employee data of a company, all customer information in the database, all of last year. In this case, statistical inference is no value.
In many cases, the essence of data mining is a very casual discovery of non-expected but very valuable information. This shows that the data mining process is essentially experimental. This is different from the analysis of certainty. (In fact, one person cannot fully determine a theory, only provide evidence and uncertain evidence.) Determinative analysis focuses on the most suitable model - establish a recommended model, this model may not explain well The data. Most statistical analysis proposed a certain analysis.
If the main purpose of data mining is to discover, it doesn't care about how to collect data before answering a specific issue, such as experimental design and investigation. Data mining intrinsically imaginary data has been collected, but only how to find secrets.
(In addition, statistical core is a model, data mining is more important is guidelines. This section did not understand, expect someone to add.)
Reference: an article on a SPSS Forum
A translation of data mining discussion group:
Statistical and data mining