First knowledge "data mining"
Recently, in the "Electrochemical Education Research" magazine, I saw an article named "Data Mining in Data Personalization Service" in 2002, thereby having awareness of data mining. It feels that it is not only available in the technical field, but also has its stage in the field of social research. In the field of social science research, "Data Mining" is staged, although there is no specific idea, but I want to remember this idea, after reference.
In this article, the author describes the related content of "data mining", as follows:
Definition of data mining: From a large amount, incomplete, noise, blurred, random actual data, extracting an implicit in it, but people don't know prior, but it is potentially useful information and The process of knowledge. It is a premise that there is no clear assumption, discovers knowledge.
The information obtained by data mining should have three previously unknown, effective and practical three features.
Data Mining Features: (1) Automatic Prediction Trend and Behavior Data Mining Automatically looks for predictive information in large databases, which speculates future data based on time serial data, by history and current data. An example of a typical use of data mining predicts is the target marketing. Data Mining Tools can find customers who are most likely to be able to respond to future mail sales based on the large amount of data in the past mail sales. At present, prediction methods have classic statistical methods, neural networks and machine learning.
(2) Association analysis It reflects the dependence or associated knowledge between events and other events. If there is an association between the two or more attributes, then one of the attribute values can be predicted according to other attribute values. That is, through data mining to find out the associated network hidden in the database to guide decision making. For example, 90% of people buying milk while purchasing bread and butter. This ways can be placed together with the goods purchased by the goods at the same time to increase the delivery benefits.
(3) Clustering clustering is as usual "object to be gathered", which is to generate a group of individuals into several categories according to similarity. Its purpose is to make the distance between the individuals belonging to the same category as small as possible, while the distance between individuals on different categories is as large as possible. It reflects the characteristic knowledge of characteristic knowledge and different things between the common nature of similar things. By clustering, records in the database can be divided into a series of meaningful subsets. Clustering has enhanced people's understanding of objective reality, and is a prerequisite for conceptual description and deviation analysis. Clustering techniques mainly include traditional pattern identification and mathematical classification.
(4) Concept Description Concept is a description of the connotation of a type of object and summarize the characteristics of such objects. The concept description is divided into a characteristic description and distinction description. The former describes the common feature of a certain type of object. The latter describes the difference between different types of objects, generating a characteristic description of a class, only involving all objects in this type of object. . There are many ways to distinguish between distinctive descriptions, such as decision tree methods, genetic algorithms, etc.
(5) The data in the deviation detection database often has some abnormal records, which is meaningful to detect these deviations from the database. Deviations include many potential knowledge, such as abnormal instances in the classification, unsatisfactory examples of rules, observations and model predicted values, and variations of magnification over time. The basic method of deviation detection is to find meaningful differences between observations and reference values.
Data mining process: (1) Problem Definition The first thing before starting data mining is to be familiar with background knowledge and understand the needs of users. Lack of background knowledge, you can't clearly define the problem to solve, you can't explain the results of the results for the preparation of high quality data. To give full play to the value of data mining, you must have a clear and clear definition of the goal, that is, what to do.
(2) Establishing a data mining library To perform data mining, you must collect data resources to be excavated. It is generally recommended that data to be mining is collected into a database, rather than using the original database or data warehouse. This is because most cases need to be modified data to be mining, and it will also encounter external data; in addition, data mining also has a variety of complex statistical analysis of data, and data warehouses may not support these data. structure. (3) Analytical data analysis data is the process of generally conducted in-depth investigation. From the data concentration, the rules and trends are identified, and the cluster analysis area is classified, and the purpose to achieve is to engage in the correlation between multi-factors, very complex relationships, and discover the correlation between factors.
(4) Adjusting the data by the above steps, further understanding the status and trend of the data, then the requirements for problem solving as much as possible can be further explicit, further quantification. The data is deleted against the problem, and a new variable is combined or generated in accordance with a new understanding of the entire data mining process to reflect a valid description of the state.
(5) Modification can establish a model for forming knowledge based on the problem of further clarity, data structure and further adjustment. This step is the core link of data mining, generally using neural networks, decision trees, mathematical statistics, time sequence analysis, etc. to establish models.
(6) Evaluating and explaining the model model above, it is possible that there is no practical meaning or no practical value, and it may be that it cannot accurately reflect the authenticity of the data, even in some cases, it is necessary, therefore Assess, determine which is effective, useful mode. One way to evaluate is to use data in the originally established mining database to check, another way to find a batch of data and test it, another way is to take fresh data in the actual environment Test.