Data Mining and Statistics: What is connected? J.H. Friedman Statist University Statistics and Linear Accetriction Center Summary: DM (Data Mining) is a discipline that reveals the relationship between the model and data of the data, emphasizes the processing of a large number of observed databases. It is an edge discipline involving database management, artificial intelligence, machine learning, pattern recognition, and data visualization. From the statistical view, it can be seen as an automatic exploratory analysis of a large number of complex data sets through the computer. At present, the role of the discipline is a bit exaggerated, but the field has great impact on business, industrial, and scientific research, and provides a large number of research work to promote the development of new methods. Although there is obvious connection between data mining and statistical analysis, most of the data mining methods have been produced by statistical disciplines. This article explains this phenomenon and explains why statisticalists should pay attention to data mining. Statistics may have a great impact on data mining, but this may require statistists to change their basic ideas and operational principles. 1 Preface statement: The views in this article only represent the author's own point of view, does not necessarily reflect the editor, sponsors, Stanford University and peers. The theme of the MAY INTERFACE (MAY 1997, Houston, TX) is an analysis of data mining and large data sets. The theme of this meeting and the conference of the Leo Breiman organization, ASA and IMS, respectively. Twenty years later, it was extremely appropriate to discuss what they had been in the past 20 years. This article will discuss the following questions: a) What is data mining? b) What is statistics? c) What is the connection between them (if any)? d) What can statistically? (Possible) 2 What is data mining? Data mining definitions are very blurred, definitions to it depend on the definitions' views and backgrounds. The following is some definitions in the DM document: Data mining is an important process of determining effective, new, possibly useful and ultimately can be understood. - FAYYAD. Data mining is a process of extracting previous unknown, understandable, executable information from a large database and uses it for critical business decisions. Data mining is some ways to identify unknown relationships and patterns existing in the knowledge discovery process. - Erruzza Data Mining is the process of discovery of use in data. --Jonn data mining is a decision support process for our unknown information model to study large data sets. --PARSAYE data mining is ...? Decision tree? Neural network? Rules inference? Recent neighbor? Genetic algorithm? Mehta Although these definitions of data mining are a bit unable to touch, it has become a business business. Like the past, the goal is `Developing miners. The biggest profit is the sale of tools to miners, not practical development. Data Mining This concept is used as an equipment to sell computer hardware and software.
hardware manufacturers stressed the need for high computing power of data mining. You must store, quickly read and write very large databases, and use the intensive calculation method for this data. This requires a large-capacity disk space, fast built-in large amount of RAM computer. Data mining opens new markets for these hardware. Software providers emphasize competitive advantage. `Your opponent uses it, you'd better have to keep up. `At the same time, it emphasizes that it will increase the value of the traditional database. Many organizations have a lot of business in handling inventory, bills, accounting databases. The creation and maintenance of these databases is huge.
Now only need to use relatively small investment to data mining tools, you can find a hidden in this data with high-profit information`. At present, the purpose of hardware and software suppliers is to use quickly to launch data mining products for data mining by quickly launching data mining products before the market is not saturated. If a company invests 50,000 US dollars for data mining bags, this may not be experimentally, people are not ranked before new products have greatly proved that they have great advantages over old products. Here are some current data mining products: SPSS INC: 'Clementine' `Clemen Small Citrus` IBM: `Intelligent Miner` 'Intelligent Miner' TANDEM: 'Relational Data Miner' 'relational data miner' angosssoftware: 'Knowledgeseeder'` knowledge searchers `Thinking Machines Corporation: 'DarwinTM' NeoVista Software: 'ASIC' DataMind Corporation: 'DataMind Data Cruncher' Silicon Graphics: 'MineSet' California Scientific Software: 'BrainMaker' WizSoft Corporation: 'WizWhy' Lockheed Corporation: 'Recon' SAS Corporation: 'SAS Enterprise Miner' In addition to these `integrated) software bag, there are many specialized products. In addition, many consultations that are professional in data mining have also been established. In this field, the different statisticients and computer scientists differ in that when the statisticalists have an idea, he wrote it into an article, while computer scientists open a company. The current data mining product features: • Charming graphical user interface? Database (search language)? A set of data analysis processes? Interface in window? Flexible and convenient input? Click-in-style button and said? Br> Enter dialog box? Util Chart analysis? Complex graphics output? A lot of data graphs? Flexible graphics explanation? Tree, network, flight simulation? Easy handling. These packages are like data mining experts for decision makers. The statistical analysis process used in the current data mining package includes:? Decision tree Inference (C45, CART, CHAID)? Rules Inference (AQ, CN2, Recon, etc)? Nearest Neighbor Method (Prime Scheme) • Clustering method (data separation)? Joint rules (market basket analysis)? Feature extraction? Visually, some include:? Neural network? Bayesian Belief network (graphics model)? Genetic algorithm? Self-organizational map? Nerve fuzzy system is almost All packs do not include:? Assumption test? Experimental design? Response surface model? ANOVA, MANOVA, ETC? Linear regression? Discriminant analysis? Conversion? General Linear model? Regular correlation? Main component analysis? Factor analysis (Note: The SPSS's Clementine has included linear regression, logarithmic regression, main component analysis, factor analysis, and other statistical functions can be implemented in conjunction with SPSS to implement the main part of the standard statistical package.
Therefore, most of the methods currently being marketized in data mining packages are produced and developed outside the statistical disciplines. The method of statistical core has been ignored. 3 Why now? What's the rush? The idea from the data has been put forward for a long time. But in the sudden interest of people, it is so strong, why is this? The main reason is that it has gained contact with the field of database management. Data, particularly large amounts of data are saved in the database management system. Traditional DBMS focuses on the online conversion process; that is, the purpose of the data organization is to store and quickly resume a single record. They are often used to record inventory, pay meter records, bill records, shipments, and so on. Recently, the database management community is interested in interested in making database management systems for decision support. Such a decision support system will allow statistical queries to be made for data originally applied to the online conversion process application. For example, how many diapers have been sold in all our chain stores? `, Decision support system requires the structure of the` data warehouse` Data warehouse uses the same format to unify the data of each department in the data of each department into a single central database (usually 100GB large). Sometimes smaller sub-databases can also be built for special analysis; these also called `Data Markets (Data Marts) Decision Support Systems are online analysis process (OLAP) and relationship online analysis process design. Relationship online analysis process is `Multidimensional analysis` Relationship Online Analysis Process Database The logical class of the Dimensional Organization and Dimensional Properties (Variables). Decigbs can be seen as a high dimensional accidental event table. Relationship online analysis process supports the following types: Show the total sales volume of the Spring Sports Department, and the number of stores in the California major city commercial street, comparing all the profit boundary values, if the relationship is online analysis process Search by the user, the user puts a potential related problem; the results need to be added, and the answer may imply further problems. Such analysis processes have been present until no longer have an issue, or to analyst exhaustion or time consumption. If you use a relationship to analyze the online analysis process, it requires an experienced user. He can not sleep and not old, the user must constantly repeat the issue of Guangbo. Data mining can also be performed with a data mining system (software), which only requires the user to provide fuzzy instructions to automatically search for the corresponding mode and display important items, predictions, or abnormally recorded. What is the characteristics of the profit boundary value? If you decide to develop a product of a product - predict its profit boundary value Find the feature of the items that have accurately predicted the accurate prediction, all large databases are commercial, computers in science and engineering a large number of existing databases . These databases are usually associated with computer automatic receipt data, saying that A) Astronomical (sky map) B) Meteorological (Climate, Environmental Pollution Monitoring Station) C) Satellite Remote Sensing D) High Energy Physics E) Industrial Process Control Data Control It can also benefit from data mining technology (principle) 4 Is data mining or intelligent training? The current interest of data mining has triggered some topics in the academic community. Data mining is very feasible as a business undertaking, but it can be set to a smart training. Of course, it is very important to contact computer science. These include: a) High-efficiency calculation of the aggregate (ROLAP) B) Fast Stereo (X * X) Looking Up C) Transforming DBMS Method for Incremental Online Search Calculation E) Data mining algorithm.
f) Disk rather than RAM implementation G) Parallel implementation of basic data mining algorithm from statistical data analysis, we can ask if data mining methods are intelligent training. So far, it can still be said that it can be said to be not. The widely known procedures in the data mining package come from machine learning, pattern identification, neural network and data visualization. They emphasize `watching and feeling` and sensory presence. This seems not to be a specific performance, but to quickly occupy the market. Most of this field currently focuses on improving current machine learning methods and accelerating existing algorithms. However, in the future data mining is almost certainly a smart training. When a technique is more efficient, people have to seriously reconsider how to apply it. Think about how people have come from the historical process of flying, every increase is about ten times, and each amount has reorganized how our team uses the means of transportation. Chuck Dickens (former SLAC calculation guidance) once said: `Every time the capacity of your computer is ten times, we should also think about how we should count, what we should count, what is the problem. A corresponding statement may be `Each data increases ten times, we should re-consider how to analyze it from overall. The time from the currently almost mostly used data mining tools invented, the computer's processing power and data volume increases several orders. New data mining methods will be more intelligent in the future, more academic (commercial). 5 Data mining should be part of the statistics? We have given data mining methods in the past, but the statistics should be concerned as a discipline should care about its development. Should we see it as part of statistics? What is mean? At least it indicates that we should: publish this article in our magazine. In our undergraduate course, some content is teached in our graduate students to teach some relevant research topics. Provide some rewards (work, term, prizes) for those who are better in this regard. The answer is not obvious. In the history of statistics, many new methods for development related to other data processing are ignored. The following are examples of some related fields. Among them, those buds were sprouting in statistical science, but most of the methods in which most of them were subsequently ignored. • Pattern Identification * CS / Project? Database Management CS / Library Science? Neural Network * Psychology / CS / Engineering? Machine Learning * -CS / AI? Graphics Model * (BEYES NET) - CS / AI
Engineering CS / Engineering Chemical Statistics * Chemistry? Data Visualization ** CS / Scientific Calculation can affirm that individual statists have been committed to these areas, but they have not hugged by our statistical fields ( Or enthusiastically embrace). 6 What is statistics? Since there are some topics and statistics from data from data, we can't help but ask: nothing is statistical. If you are not a matter of statistics, what is it full? So far, statistically definitions have dependent on some tools, which is what we are teaching in the current graduate course. The following example: Probability theory, real analysis, measurement, gradual theory, decision theory, MAC chain, traversal theory and other statistics are assessed to be defined as a family can propose as above or related tools. Of course, these tools will be useful in the past and future. Just like Brad Efron remind us: statistics are the most successful information science. Those who ignore statistics will be punished, they will re-discover the statistical approach in actually. Some people think that current data (and their related applications) grow in an index mode, while the number of statistically can't catch this growth, our statistics should focus on our best part of our information in information science. That is, based on mathematical probability.
This is a highly conservative point of view, of course it is likely to be the best strategy. However, if we accept this point of view, our statisticians will gradually disappear in the 'information revolution' wave (less actors on this stage). Of course, a good advantage of this strategy is that it is very small for our innovation, we only need to stick to the rules. Another point of view, as early as 1962, John Tukey [Tukey (1962)], he believes that statistics should pay attention to data analysis. This area should be based on the issues rather than tool definitions, that is, those related to data. If this point of view becomes a mainstream view, it requires a large change in our practices and academic issues. First (most importantly), we should follow the pace of calculation. Where there is data, there is calculation. Once we regard the calculation method as a basic statistical tool (rather than a way to implement our ready-made tool), then there will be no more relevant areas where data is closely related. They will become part of our domain. It is also important to carefully treat the calculation tool rather than simply use the statistical package. If calculating a basic research tool for us, there is no doubt that our students should learn related computational scientific knowledge. This will include numerical linear algebra, numerical and combined optimization, data structure, algorithm design, mechanical system, programming method, database management, parallel system, and program design. We will also expand our course plan, which should include current computer orientation data analysis methods, which are mostly developed outside the statistical discipline. If we want to compete for academic and commercial market spaces and other data related fields, some of our basic models will have to change, and we will have to adjust the fantasy of mathematics. Mathematics (like calculation) is just a tool for statistics, although it is very important, it is not the only tool that can confirm the effectiveness of statistical methods. Mathematics is not equivalent to theory, and vice versa. Theory is originally created and mathematics, although this is very important, it is not the only way to do this. For example, mathematics content in the pathogenesis of the disease, but it makes people better understand many medical phenomena. We will recognize experience confirmation, although there must be certain limitations, it is indeed a confirmation method. We may have to change our culture. Every statistically involved in other data-related fields are shocked by them and statistics 'cultural gap'. In other fields, 'idea' is more important than math technology (basis). An inspired 'idea' is considered valuable. If there is more detailed confirmation (theoretical or experience) people discuss its final value. The way of thinking is 'If there is no proven to be guilty, it is whiten' this is inconsistent with our ideas. In the past, if a new method is not effective with mathematical prove, we often destroy it, even if it is not like this, we will not accept it. This idea is reasonable when the data set is relatively small and the information noise is relatively high. In particular, we should change our habit of destroying those performances (usually in other fields), but they are not understood by us. 7 Which Way to Go? Perhaps, the current statistics are in a crossroads, we can decide whether to accept or refuse to change. As mentioned above, both views are extremely persuasive. Although it is rich in view, no one is certain which strategy can maintain healthy development and vitality in our field. Most statisticians seem to think that statistically affects information science is getting smaller. They don't agree to do something for this. The leading point of the station believes that we have market problems, our customers and colleagues don't understand our value and importance in other fields. This is also our main professional organization, the view of the US Statistics Association.