Technical, method and application of data mining and knowledge discovery

xiaoxiao2021-03-13  191

Data Mining and Knowledge Discovery Technology, Method and Applying a Solid Oil Painting

Keywords:

Data Mining, Knowledge Discovery in Databases, DM, KDD, CRISP-DM, Internet

concept

The development of Internet-based global information systems has made us have unprecedented data. A large number of information has brought a lot of questions while bringing convenience: the first is excessive information, it is difficult to digest; the second is to identify information true and false; the third is that information security is difficult to guarantee; fourth is the form of information Inconsistent, it is difficult to handle it. It is rich in data, and knowledge is poor has become a typical problem. The purpose of Data Mining is to effectively extract the needs from massive data, implement the transition process of "Data -> Information -> Knowledge -> Value".

Data mining (Data Mining) refers to the process of extracting potential, valuable knowledge (models or rules) from massive data from massive data. There are other synonyms: Knowledge discovery in databases, information discovery, information discovery, Intelligent Data Analysis, Exploration Data Analysis (Explorative Data) Analysis, information harvesting, data archeology, etc.

The development of data mining is approximately as follows:

• 1989 IJCAI Conference: Knowledge Discussion Forum in Database

-Knowledge discovery in databases (G. PITETSKY-ShaPiro and W. Frawley, 1991)

• 1991-1994 KDD discussion topic

-Advances in knowledge discovery and data mining (u. fayyad, g. PITESKY-ShaPIRO, P. Smyth, And R. Uthurusamy, 1996)

• 1995-1998 KDD International Conference (KDD'95-98)

-Journal of Data Mining and Knowledge Discovery (1997)

• 1998 ACM Sigkdd, Sigkdd'1999-2002 Meeting, and Sigkdd Explorations

• More international conferences in data mining

-Pakdd, PKDD, Siam-Data Mining, (IEEE) ICDM, DAWAK, SPIE-DM, ETC.

Data Mining is a database research, development and application of the most active branch. It is a multidisciplinary cross-section that involves database technology, artificial intelligence, machine learning, neural network, mathematics, statistics, pattern identification, knowledge base. System, knowledge acquisition, information extraction, high performance computing, parallel computing, data visualization, etc.

Data mining techniques are applied from the beginning, it is not only a simple search query call for a particular database, but also to microscopic, middle and even macro statistics, analysis, integration, and reasoning to guide practical problems. Attempting to find out the interrelationship between events, even using existing data to predict future activities. For example, the Canadian BC Provincial Telephone Company requires the Canadian Simonfraser University KDD research group, according to its more than ten years of customer data, summarizing, analyzing and proposing new telephone charges and management methods, making preferential policies that are beneficial to the company and benefit customers. In this way, the application of the data will be improved from the low-level end query operation, and the management decision-making support for business decision makers at all levels. This demand drive is more powerful than database queries. At the same time, the data excavation here is not required to find the truth that the four seas is discovered, nor is it to discover a new natural scientific or pure mathematical formula, and it is not a machine theorem certificate. All discovered knowledge is relatively, there is a specific premise and constraint conditions, facing specific areas, but also can be easily understood, it is best to express the results of the discovery with natural language. Therefore, the research results of data mining are very practical. technology

Data Mining (Data Mining) Main tasks include data summary, concept description, classification, clustering, correlation analysis, deviation analysis, modeling, etc. Specific techniques include:

Statistical Analysis

Common statistical methods have regression analysis (multiple returns, self-return, etc.), discriminating analysis (Bayesian analysis, Fischer discriminant, non-parameter discrimination, etc.), cluster analysis (system clustering, dynamic clustering, etc.) and exploration Sexual analysis (main dollar analysis, related analysis, etc.). Its processing can be divided into three phases: collect data, analyze data and reasonce.

Decision Tree

The decision tree is a tree, the root node of the tree is the entire data collection space, each of which is a test of a single variable, which divides the data set space into two or more blocks. Each leaf node is a record of a single category. First, through the training set to generate a decision tree, then trim the decision tree through the test set. The function of the decision tree is to predicate which type of new record belongs to.

Decision tree is divided into two types of classification trees and regression trees. The classification tree makes decision trees on discrete variables, and the return of the tree to make a decision tree for continuous variables.

By recursive segmentation process to construct decision tree: 1 look for initial split, the entire training set as a collection of decision trees, and training sets must be classified. Decide which properties (field) domain as the best classification indicator. The general practice is to exhaust all attribute domains, quantify the quality of each genus split, and calculate the best split. Quantization criteria is to calculate the diversity of each split (Diversity) indicator Gini indicator. 2 trees grow to a complete tree, repeat the first step until the records within each leaf node are the same category. 3 Trimness of the data, remove some data that may be noise or abnormal.

Its basic algorithm (greedy algorithm) is: the method of rising from top and bottom, all data is in the root node; attribute is a kind of field (if it is continuous, discrete it); all records selected The attribute recursive segmentation; the choice of attributes is based on an heuristic rule or a statistical metric (eg, Information Gain). Stop segmentation conditions: The data on a node belongs to the same category; no properties can be used to segment the data.

The pseudo-tone is:

Procedure buildtree (s) {

Use data set S to initialize root node R

Use root node R initialization queue QWHILE q is not empty do {

Retrieve the first node N in the queue Q

IF N is not pure (pure) {

For each attribute A

Estimated the information gain of the node on A

Elect the best attribute, divide N, N1, N2

}

}

}

The statistical metric for the property is:  Information gain -information Gain, all attribute assumptions are all kinds of type fields, can be applied to the numeric field after modification;  Kini Index - Gini Index (IBM IntelligentMiner) It can be applied to the type and numerical fields.

Correlation Rules

The rules reflect statistical dependence between certain attributes or certain data items in the data item, and its general form is:

X1∧ ... ∧XN

Y [C, S], indicating that Y, the x1∧ ... ∧XN can predict Y, where the credibility is c, the support is S.

Set i = {I1, I2, ..., IM} is a collection of binary text, which is called an item (Item). Remember D is a collection of transaction t, which is the collection of items, and Tíi. There is a unique identifier corresponding to each transaction, such as the propagation number, records the TID. Let X is a collection of I-I-IN, if Xít, then the transaction T contains X.

An association rule is shaped as Xþy's implication, here xìi, yìi, and xçy = f. The support of the rule Xþy in the transaction database D is the ratio of the transaction number of the transaction, including the number of transactions and all transactions, recorded as Support (Xþy), ie

Support (xþy) = | {T: xèyít, tîd} | / | D |

The credite of the rule X þY in the transaction set refers to the ratio of the number of transactions containing X and Y and the number of transactions containing X, which is confidency (xþy), ie

Confidence (xþy) = | {T: xèyít, tîd} | / | {t: xít, tîd} |

Given a trading set D, mining association rule issues are association rules that generate support and credibility is greater than the minimum support (MINSUPP) and minimum credibility (minconf) given by users.

The association rules can be divided into Boolean and numerical types based on the categories of variables treated in rules. The value of the Boolean association rule processing is discrete, species, which shows the relationship between these variables; the numerical association rules can be combined with the multidimensional association or multi-layer association rules, and the numerical fields are processed. Dynamically divide it, or processes the original data directly, and of course, the numerical association rules may also contain type variables. Based on the abstraction level of data in rules, you can be divided into single-layer association rules and multi-layer association rules. In a single-layer association rule, all variables do not take into account the reality of data with multiple different levels; in the multilayer association rules, multi-layers of data have been adequately considered. The association rules can be divided into single dimensional and multi-dimensional. In single-dimensional association rules, we only involve a dimension of data, such as users purchased; and in multidimensional association rules, data to be processed will involve multiple dimensions.

Agrawal is first proposed in 1993, first proposing the association rule issues in the interval of the customer's transaction database, and its core approach is based on the recurrence method of the frequency set theory. A number of researchers have conducted a lot of research on the mining problems of the associated rules. Their work includes optimizing the original algorithms, such as introducing random sampling, parallel thinking, etc. to improve the efficiency of algorithm mining rules; proposing various variants, such as generalized association rules, cycle correlation rules, etc. The application of the rules is promoted. Agrawal et al. Designed a basic algorithm in 1993, I proposed an important way to mine the correlation rule - this is a method based on two-stage frequency intensive ideas, and the design of the associated rules mining algorithm can be decomposed into two sub-issues:

1) Find all the items (Itemsets) that are greater than the minimum support, called the Frequent Itemset.

2) Use the frequency set found in step 1 to produce the desired rules.

The second step here is relatively simple. If a frequency set Y = I1i2 ... IK, K32, IJ ∈i, generates only all rules (up to K bars) in the items in the set {I1, I2, ..., IK}, each of which There is only one of the right part of a rule, (ie, such as [Y-II] þ II, "1 £ £ K). Once these rules are generated, only those rules greater than the minimum credibility given by users Leave it. For the rule of the rule, the rule of the above item is studied in its future work. In order to generate all intersection, the method of recupuncture is used. The core thinking is as follows:

L1 = {large 1-itemsets};

For (k = 2; LK-1 ¹F; k )

{

CK = APRIORI-GEN (LK-1); // New candidate set

For all transactions tîd

{

CT = Subset (CK, T); // The candidate set included in the transaction T

For (All Candidates Cî CT)

C.count ;

}

LK = {Cî CK | c.count3minsup}

}

Answer = èklk;

First result in frequent 1-item set L1, then frequent 2-item set L2 until a R value makes LR empty, then the algorithm stops. Here in the kth cycle, the process first generates a collection of candidate K-item sets CK, and each item set in CK is one (k-2) that is only one of the LK-1 of LK-1. - Connection is generated. The item set in the CK is a candidate set for generating a frequency set, and the last frequency set LK must be a subset of CK. Each element in the CK is required to verify in the transaction database to determine if it is added to the LK, where the verification process is a bottleneck of algorithm performance. This method requires multiple scans that may have a large transaction database, that is, if the frequency set contains up to 10 items, then you need to scan the transaction database 10 times, which requires a large I / O load.

Agrawal et al. Introduces pruning to reduce the size of the candidate set CK, thereby significantly improving the performance of generating all frequency set algorithms. The pruning strategy introduced in the algorithm is based on such a nature: a set of intersections is intended and only when all of its subsets are intersession. Then, if a candidate set in CK has a (k-1) - subset does not belong to LK-1, this item set can be trimmed to be repaired, this trimming process can reduce the calculation of all candidates set The price of support is supported.

The APRIORI-based frequency interaction method even has been optimized, but some inherent defects of the Apriori method are still unable to overcome: 1) a large number of candidate sets. When there is 100,000 frequent sets of length 1, the number of candidate collections 2 will exceed

10m

. Also, if you want to generate a long rule, the intermediate element to be generated is also a huge amount. 2) The rare information cannot be analyzed. Since the parameter minSUP is used, it is not possible to analyze the events smaller than the minSUP; if Minsup is set to a very low value, then the efficiency of the algorithm has become a difficult problem. The following two methods are used to solve the above two problems. One way to solve the problem 1 uses a method of fp-growth. They use a strategy for cure: After the first scan, the frequency set in the database is compressed into a frequent mode tree (FP-TREE) while still retaining the associated information. We then divide the FP-TREE into some condition libraries, each library and a frequency set of 1 length. Then excavate these conditional libraries respectively. When the amount of raw data is large, it is also possible to combine the division, so that an FP-Tree can be placed in the main memory. Experiments have shown that FP-Growth has good adaptability to different lengths of different lengths, while the Apriori algorithm is huge in efficiency.

The second question is based on this idea: the relationship between the Apriori algorithm is frequent, but in actual applications, we may need to find some highly related elements, even if these elements are not frequent. In the APRIORI algorithm, the decisive role is support, and we will now put credibility in the first, excavate some rules with very high credibility. The entire algorithm for a solution to this problem is basically divided into three steps: calculation features, generating candidate sets, filter candidates. In three steps, the key location is the use of the Hash method when calculating features. When considering the method, there are several indexes measured: time and space efficiency, error rate, and missing rate. Basic methods have two categories: min_haashing (mh) and local_sensitive_haashing (lsh). The basic idea of ​​min_hashing is: the position of the head K as 1 in a record as a HASH function. The basic idea of ​​local_sentitive_haash is that the entire database is classified with a probability-based method, making the similar columns more likely to be more likely, and the likelihood of unlike columns is smaller. Comparing these two methods, the missing ratio of MH is zero, and the error rate can be strictly controlled by K, but the time-space efficiency is relatively poor. LSH's omission rate and error rate cannot be reduced simultaneously, but its time-space efficiency is relatively good. Therefore, it should be dependent on the specific situation. The last experimental data also illustrates this method to generate some useful rules.

Case-based reasoning based on example

Example reasoning is to solve a given problem directly using past experience or solution. Examples are usually a specific problem that has been encountered and has a solution. When a specific problem is given, the paradigm is retrieved, finds a similar example. If there is a similar example, the solution can be used to solve new problems. And the new problem is added to the Example Library, which is ready to search.

Fuzzy Set

The fuzzy set is an important way to demonstrate and handle uncertainty data. It not only handles incomplete data, noise, or inappropriate data, but also to develop data uncertain models, it can provide more smart than traditional methods, smoother Performance.

Rough Set

Rough sets are relatively new type of mathematical tool for handling amphidiating and uncertainties, and can play an important role in data mining. The rough set is approximately approximate, and the upper approximation is defined. Every member of the abstament is a set of representative members, rather than those in the near future, definitely not members of the collection. The upper approximation of rough set is close to the border area and. Members of the boundary area may be members of the collection, but not a certain member. It can be considered that the rough set is a fuzzy set with a three-value membership function, which is often used in conjunction with association rules, classification, and clustering methods, generally not only available. Bayesian Belief Network

Bayesian belief network is a probability distribution diagram, which is an endless figure, node represents attribute variables, indicating probability dependencies between attribute variables. And each node is the conditional probability distribution for describing the relationship between the node and its parent node.

Definition 1: Give a random variable set

= {

},among them

Is an

Dimensional vector. Bayesian belief network

A joint condition probability distribution. The Bayesian belief network is defined as follows:

first part

It is a tolerance, and its vertices correspond to a limited set.

Random variable in

. The arc represents a function dependencies. If there is a arc from the variable

Until

,then

Yes

Double pro or direct forward, and

Be

Subsequent. Once the parents are given, each variable in the figure is independent of the non-successive in which the node is in the figure. Figure

in

All duplex

.

the second part

Represents a set of parameters used to quantify the network. For each one

,

Value

The existence is as follows:

It indicated that it is given

Incident

The conditional probability of event occurring. So in fact, a set of variables for a Bayesian belief network

The joint condition probability distribution:

The Bays network constructive algorithm can be represented as follows: given a group of training samples

,

Yes

Example, look for a Bayesian belief network that matches the sample. Common learning algorithms are usually introduced into an assessment function

Using this function to evaluate the fitness between each possible network structure and the sample and find a optimal solution from all of these possible network structures. Commonly used evaluations have a Bayesian Score Metric and a minimum description length function (Minimal Description length).

Support vector machine (Support Vector Machine)

Support Vector Machine (SVM) is based on the principles of structural risks in calculating learning theory. Its main thinking is for two types of classification issues, looking for a super plane as two types of segmentation in high dimensional space to ensure the smallest classification error rate. An important advantage is that it can handle linear inseparable situations.

Hidden Markov Model

The Markov process is a method of describing the system, raised by Russian organic chemist Markov in 1870, with the probability of system state, system initial status probability, and status. The hidden Markov model includes: the number of states in the model, the number of output symbols in the model, the status set in the model, the probability distribution of state shift, and the initial state distribution, etc. Hidden Markov model has three basic problems: Identification (given an output sequence and model, what is the probability of the sequence that the model may create?), Sequence problem (given an output sequence and model, what is the most likely state sequence You can create an output sequence?) And training issues (given an output sequence and topology, how to adjust the model parameters, including the probability distribution of status transfer, and output, so that the output sequence created by the model has the maximum probability?).

Neural Network

Neural Net refers to a network that is interconnected by a large number of neurons, an international internet (Internet) that is a bit icon. The human brain has 100 billion neurons, and each neuron is averaged and 10,000 others. Neuron interconnect, which constitutes the direct material foundation of human wisdom. The working mechanism of neurons: neurons consist of cells, dendrites (inputs), axons (outputs). There are two working conditions of neurons: excitement and inhibition. Each neuron to another neuron is connected (the reactivity of the latter to the former) is a change in external stimulus, which constitutes the basis of learning functions.

Working principle of artificial neural network: Artificial neural network first must learn with a certain study guideline before work. The identification of two letters of artificial neural network is now written as an example. It is specified that "1" should be output when "A" input network, and when the input is "B", the output is "0". So the guidelines for network learning should be: If the network makes a wrong judgment, the learning of the network should make the network reduce the likelihood of the same mistake. First, give a random value in the (0, 1) interval to the network, input "A" to the network, the network will be equalize to the input mode, and the threshold comparison, Linear operation, get the output of the network. In this case, the probability of network output to "1" and "0" is 50%, that is, it is completely random. At this time, if the output is "1" (the result is correct), the connection weight is increased so that the network can still make the correct judgment when the network is entered again. If the output is "0" (ie the result error), the network connection weight is adjusted toward the direction of the weighting input weighting value, and its purpose is to make the network next to "A" mode input, reduce the damage The likelihood of the same error. This is adjusted, when the network is input to the network, after "B", "B", after several learning methods, the correct rate of network judgment will be greatly improved after several learning methods. This shows that the network has been successfully learned from the two modes, and it has distributed these two modes to memory in each connection weight of the network. When the network encounters any of these modes, it is possible to make rapid, accurate judgment and identification. In general, the more the number of neurons contained in the network, the more it can be remembered, the more identified mode.

The characteristics of the neural network model are: utilize a large number of simple computing units (neurons), thereby achieving large-scale parallel computing; distributed storage, information existence of the entire network, using the right value, is associated, can be An incomplete information restores its complete information; self-organizing, self-learning. Its working mechanism is to change the connection strength between the neurons by learning. The basic structure of artificial neural networks can be divided into: recursive networks and feedforward networks.

Common neural network models are: Hopfield Net, Harmming Net, Carpenter / Grossberg Classifier, Single-layer Perceived Network, Multi-layer Perceived Network, Kohonen Self-Organic Chart and Reverse Communication (BP) network.

Multi-level perceived network (error reverse propagation neural network): In 1986, "Parallel Distributed Processing" published in Rumelhart and McCelland, the error reverse propagation learning algorithm is fully proposed and widely accepted. Multi-layer perceived network is a classic neural network having three or three floors. A typical multi-layer perception network is three-layer, feedforward class network, namely: input layer I, implicit layer (also known as intermediate layer) J, output layer K. Each neuron between adjacent layers achieves a full connection, that is, each neuron of the next layer achieves full connections and each neuron of the previous layer, and there is no connection between each neuron per layer.

Kohonen Neural Network: It is taken from the reaction of human retina and cerebral cortex. The results of neurobiology have shown that there are many specific cells, which are more sensitive to specific graphics (input modes), and make specific cells in the cerebral skin produce large excitement, and their adjacent nerve cells Extension is suppressed. For a certain input mode, only one corresponding output neuron is activated by competition in the output layer. Many input modes are activated in the output layer, thereby forming "feature graphics" reflecting input data. Competitive neural networks are a network that conducts network training in non-teacher. It automatically classifies the input mode through its own training. Competitive neural networks and their learning rules have their own distinctive features compared to other types of neural networks and study rules. On the network structure, it is neither a unidirectional connection between the layer neural networks as in the classical neural network, nor does not have a significant hierarchical limit on the network structure as a full connector network. It is generally two-layer networks composed of input layer (analog retinal neurons) and competition (simulated cerebral skin neurons, also called output layers). The neurons between the two layers have a two-way full connection, and there is no implicit layer in the network, as shown in Figure 5. There is also a lateral connection between each neuron of the competition. The basic idea of ​​competitive neural network is the response opportunity of each neuron competition in the network competition, and finally there is only one neuron to become a competitive winner, and will only be corrected with the connected weights related to the winning neurons. Make it adjusted toward more conducive to its competition. When the neural network works, for a certain input mode, the competitive neurons corresponding to the most similar learning input mode in the network will have the greatest output value in the network, that is, the competition wins the neuron to represent the classification results. This is achieved by competition, in fact, the process of network memories.

In 1986, American physicist J. Hopfield has issued several papers, and proposed HOPFIELD neural network. He studied the stability of artificial neural network using the energy function method in nonlinear dynamics system theory, and using this method to establish a system equation for optimization calculation. The basic HOPField neural network is a full-connection type single-layer feedback system composed of nonlinear components. Each neuron in the network transmits its own output to all other neurons through connection, while receiving all other nerves. The information passed. That is, the output state of the neuron T time in the network is actually indirectly related to the output status of their T-1 time. So the Hopfield neural network is a feedback type network. Its state changes can be characterized by differential equations. An important feature of the feedback network is that it has a steady state. When the network reaches a stable state, it is the minimum of its energy function. The energy function here is not the energy function in the physical sense, but in the form of expression, the energy concept in the physical sense, characterizes the change trend of the network state, and can continue to change according to HopField work rules, eventually can eventually achieve A minor target function. Network convergence means that the energy function reaches a minimum. If a target function of an optimization problem is converted into a network energy function, the problem number corresponds to the status of the network, then the Hopfield neural network can be used to solve optimized combined problems. The connection weight of each of its neurons is fixed when HopField is working, and the updated is only the output state of the neuron. The operating rules of the Hopfield neural network are: first select a neuron ui from the network, performed, according to the formula (1), and then calculate the output value of Ui's t 1 time according to equation (2) . The output value of all neurons other than UI remains unchanged, returns to the first step until the network enters the steady state. For networks that the same structure, when the network parameter (referring to the connection weight and threshold) changes, the number of the number of network energy functions (referred to as the stable balance point of the network) and the size of the minimum value will also Variety. Therefore, the mode of the desired memory can be designed to a stable balance point that determines the network state. If the network has M balance points, M memory mode can be memorized. When the network is closer to memory mode (equivalent to a memory mode that has any modifications or some noise, that is, only some of the information is provided), the network is working in HopField. Run the rules to perform status updates, the status of the last network will stabilize the minimum dots of the energy function. This completes the association process of some information.  The reverse propagation training algorithm initially developed by WERBOS is an iterative gradient algorithm for solving the minimum mean square difference between the actual output of the feedforward network and the desired output. The BP network is a multi-layer mapping network that reverse passes and corrects errors. When the parameter is appropriate, this network can converge to a small partial difference, which is one of the most widely used networks. The shortcoming of the BP network is a long training time and is easily trapped locally.

Artificial neuron network may never replace the human brain, but it can help human extension to the understanding and intelligent control of the external world. For example, the GMDH network is originally Ivakhnenko (1971) to predict the model proposed by fish in the ocean river, and the successful application of ultrasonic aircraft (SHRIER, 1987) and power system load forecast (Sagara and Murata, 1988) ). The human brain is very complicated, but it is not in line with a lot of data and complex operations, and the brain is the spiritual artificial neuron network model, which is equipped with high-speed electronic computers, will greatly improve people's understanding of the objective world.

Genetic algorithms

The genetic algorithm is an optimized strategy proposed in accordance with the principle of natural evolution according to the principle of natural evolution based on the principle of natural evolution. Although GA has not received attention, in recent years, people have applied it to learning, optimization, adaptive and other issues.

The algorithm process of GA is briefly described. First take a group of points in the solution space, as the first generation of genetics. Each point (gene) is expressed in a two-binary numeric string, which is measured by a target function (Fitness function). In the genetic evolution to the next generation, each of the previous generations in the previous generation is assigned to the pairing pool according to the probability determined by its target function value. A good numeric string is copied with a high probability, and the inferior digital string is eliminated. Then pair the numbers in the pairing pool, and crossover each of the numbers, generate new descendants (digital strings). Finally, the new digital string is varied. This creates a new generation. According to the same method, after several generations of genetic evolution, the global optimal solution or approximate optimal solution is obtained in the final generation.

The biggest feature of GA is that it is simple, it has three calculations: replication, crossover, mutation. During the solution, through the best choices, cross-combined and variations can be intended to be more and better. The genetic algorithm in data mining is mainly used to form a dependency assumption between variables.

Time Series

Time series refers to sequence values ​​that vary over time, and processing timing data includes trend analysis (long-term or trend changes, cyclic variations, cyclic variations, seasonal changes or seasonal changes, non-standard or random changes), similarity search, sequence mode Digging and cycle analysis and other content.

Trend Analysis: A variable y, indicating a total closing price, can be seen as a function of time t. For example: y = f (t); such a function can be represented by a graph of a time series.

How do we analyze the data of these time series? There are four aspects worth our attention: 1 long time to: indicate the total trend in a long time, this can be displayed in a "trend curve" or "trend straight line", the specific method will . 2 cycles and cycles change: the oscillation of the line and curve is not a cycle, and this loop does not follow the law based on equal time. 3 Seasonal Timetry and Variation: For example, before Valentine's Day comes, the sales of chocolate and flowers suddenly increase. In order to change, it is a period of continuous many years, and there is always different periods in this year. 4 irregular random trends; due to some burst of accidental events. Above these movements can be represented by variables t, c, s, and i, and the time series analysis can also be divided into these four basic trends. This time the sequence variable Y can be modereed into the product or sum of these four variables.

"Given Y's collection, how do we analyze the direction of data?" A very common method is to calculate the average, which is called "Moving Average ORder N". If a method with weight is used, it is "Weighted Moving Average of Order N". For example: given 9 sequences, we calculate its Moving Average of Order 3 and Weighted Moving Average of Order 3 (weight 1, 4, 1). This can be represented by the following table:

3, 7, 2, 0, 4, 5, 9, 7, 2

4, 3, 2, 3, 6, 7, 6

5.5, 2.5, 1, 3.5, 5.5, 8, 6.5

It is an impact on a smoothing to the middle.

Is there any other way to estimate this trend? One of the methods is "freehand method": uses a similar curve instead of data, the most similar curve is defined as the sum of Di, and DI is the difference between the curve Yi and the actual data yi. Is there any way to adjust the seasonal fluctuations? In actual business use, people are always generalized in seasonal fluctuations. We can use the method of Seasonal Index NumBers.

Digging sequence mode: Sequence mode mining is a frequently occurring mode based on time or other sequences. An example of the sequence mode is "a customer who bought a PC before 9 months ago, it is possible to buy a new CPU in one month. Many data are in this time series, we can use it to market trend analysis, customer retention and weather forecasts, etc. Its application areas include: customer purchase behavior pattern prediction, web access mode prediction, disease diagnosis, natural disaster prediction and DNA sequence analysis.

Examples and parameters for sequence mode mining: There are many parameters that affect the results of mining: first is the duration of time series T, which is the effective time of this time series or a time period selected by the user, for example 1999. Such sequence mode mining is defined as mining of data in a certain period of time. Second, the time folding window W, several events that occur within a period of time can be seen as simultaneous, if W is set to the length of the duration T, we can find some correlation mode - "in 1999, A bought PC has bought a digital camera (not considering the order of order). If W is set to 0, the sequence mode is that two events occur in different times - "Customers who have bought PC and memory may be able to buy a CD-ROM in later." If W is set to a time interval (for example, a month or day), the transaction in this period can be seen in the analysis at the same time. The third parameter is time interval, int, which represents the time interval of the discovered mode. INT = 0: Here, we have to consider parameter W, for example, if this parameter is set to a week, event A, event B will happen within a week. Min_interval

Visualization

Visualization is the process of transforming data, information, and knowledge into visual expressive forms, which is characterized by: the focus of information visualization is information; the amount of data is large; the source of information is varied.

method

Excavation process

The general content of the data mining is as follows:

1. Determine the business object. Clearly defined business problems, recognizing the purpose of data mining is an important step in data mining. The last structure of mining is unpredictable, but the problem to explore should be foreseen, and the data mining is blind for data mining. Sex, it will not succeed.

2. Preparation of data. 1) Data selection, search all the internal and external data information related to the business object, and select data for data mining applications. 2) Preprocessing of data, the quality of the research data is prepared for further analysis. And determine the type of mining operation to be performed. 3) The conversion of data is converted into an analysis model. This analysis model is established for the mining algorithm. Establishing a realistic analysis model for mining algorithms is the key to data mining success.

3. Data mining. Excavate the resulting transformed data. In addition to perfecting the appropriate mining algorithm, the remaining work can be done automatically.

4. Results analysis. Explain and evaluate the results. The analysis methods used should generally be depends on data mining operations, usually used in visualization technology .5. The knowledge obtained is integrated into the organizational structure of the business information system.

Step-by-step implementation of the data mining process requires people with different expertise, and they can be divided into three categories. Business Analysis: Requires the business, can explain the business object, and determine business needs for data definitions and mining algorithms according to each business object; data analysts: proficient data analysis technology, and have more skilled mastery of statistics. Ability to convert business demand into data mining, and select the appropriate technology for each step; data managers: proficient in data management technology, and collect data from databases or data warehouses. It can be seen that data mining is a variety of experts in cooperation, as well as a process of high investment in funds and technologies.

CRISP-DM is an abbreviation of Cross-Industry Standard Process-Data Mining, which is developed by SPSS, NCR, DAIMLER-BENZ in 1996. CRISP is one of the standards of general popularity in today's data mining industry, which emphasizes data mining in business applications to solve problems in business. The process is as follows:

Business understanding (discovery problem - identify business goals; evaluation of existing resources, determining whether issues can be solved by data mining; determine the target of data mining;

Data understanding (determined data required for data mining; description of data; preliminary exploration of data; check data quality)

Data Preparation (Select Data; Cleanup Data; Rebuild Data; Adjusting Data Format Adjustment)

Establish model (evaluate each model; select data mining model; establish model)

Model assessment (assessing the result of data mining; evaluation of the front step of the entire data mining process; what should I do next? Is the release model? Or further adjustment of the data mining process, generate a new model)

The model is released (sending the results of the data mining model to the corresponding management person; daily monitoring and maintenance of the model; regular update data mining model)

Data summary

The purpose of the data summary is to concentrate the data and give it a compact representation. The general method of data aggregation is to perform various statistic calculations for data, and by way of graphics and tables. Data mining is processed from the point of view of data, simply, is to show the lower level of data in a higher level, thereby meeting various requirements for the user. It has two main methods: multi-dimensional data analysis methods and methods for properties.

Multi-dimensional data analysis method is a data warehouse technology, also known as online analysis processing (OLAP). The concept of online analysis processing (OLAP) was first proposed in 1993 by the parent E.f.codd of the relational database. At that time, CODD considered online transaction processing (OLTP) that the terminal user needs to query the database query analysis, and the simple query of SQL to large databases does not meet the needs of user analysis. User's decision analysis requires a lot of calculations for relational databases to get results, and the results of the query do not meet the needs of decision makers. Therefore, CODD proposes the concept of multi-dimensional database and multi-dimensional analysis, that is, OLAP. CODD proposes 12 guidelines for OLAP to describe the OLAP system:

Guidelines 1 OLAP model must provide multidimensional conceptual view

Guidelines 2 Transparency Guidelines

Guidelines 3 Access capacity speculation

Guidelines 4 Stable Report Ability

Guidelines 5 Customer / Server Architecture

Top 6-dimensional equivalence

Criterion 7 Dynamic Sparse Matrix Treatment Guidelines

Guidelines for more than 8 users support capability

Guidelines 9 Non-restricted cross-dimensional operation

Guidelines 10 intuitive data manipulation

Guideline 11 flexible report generation

Guide 12 Unrestricted dimension and aggregation level

The data warehouse is a set of historical data for decision-supporting, integrated, stable, different time. The premise of decision-making is data analysis. The calculation amount of such operations is particularly used in data analysis. Therefore, a natural idea is to pre-calculate and store the collection operation results to use it in the decision support system. The place where the collection operation result is stored as a multidimensional database. Data extraction is performed using a multi-dimensional data analysis method. It is directed to a data warehouse, and the data warehouse is stored offline historical data. In order to handle online data, the researchers have proposed an attribute to the attribute. Its idea is to use the data view that users interested in users (available in a general SQL query language), rather than the pre-generic data is stored in advance like a multi-dimensional data analysis method. The method of the method is called this data generalization technique as an attribute. The original relationship has been used by generalization operations, which summarizes the original relationship at a low level from a higher level. With the generalization relationship, it can generate a variety of in-depth operations to generate knowledge needs, such as generating characteristic rules, discriminating rules, classification rules, and association rules based on the generalization relationship.

Concept description

The concept description refers to: Characterization, gives a simple and clear description of the selected data; comparison provides the result of comparing two or more data.

Basic methods are available, data focus: selection and current analysis related data, including dimension. Property removal: If an attribute contains a large number of different values, there is no profile on this property or its higher layer concept is represented by other attributes. Property Aniba: If a property contains a large number of different values, the operator is subjected to subsequently evolving. Attribute Threshold Control: Typical 2-8, Specified / Default. Annibutation Threshold Control: Controls the size of the final relationship.

The basic algorithm is: initialrel: obtain related data, forming an initial relational table; Pregen: Differential decisions of different values ​​of different attributes are discarding the attribute or summarize it; Primegen: According to the previous calculation result, The attribute is subjected to the corresponding level, calculates the summary value, resulting in the main contention relationship; the result is expressed: use the outline relationship, crosstab, and 3D cubes.

Correlation analysis

The association rule is a rule in the form. The database, which is targeted by the application, also known as basket data. A business is generally composed of the following parts: transaction time, a group of goods purchased, sometimes there is a customer identification number (such as a credit card number).

Due to the development of barcode technology, the retail department can use the front-end cash register to collect a large number of sales data. Therefore, if these historical transaction data is analyzed, the customer's purchase behavior can provide extremely valuable information. For example, you can help how the goods on the shelf (such as putting the goods often bought at the same time), helping how to plan the market (how to match each other). It can be seen that the association rules are found in the transaction data, which is very important for the decision-making of business activities such as retail industry.

Set i = {I1, I2, ..., IM} is a set of items (one item of items in a shopping mall), D is a set of transaction sets (called transaction database). Each transaction T in D is a group of items, apparently satisfying T

I. Self-tested transaction T, if X, if x

T. Association rules are a meaning of the following form: X

Y, where X

I, y

I, and x∩y = i.

(1) The article set x has a support of S. If there is a S% transaction support item set x in D;

(2) Weigh the association rule x

Y has a support of S by the transaction database D, if the support degree of the item set X ∪Y is S; (3) Rule X

Y has a credibility of size c in the transaction database D, and if there is a C% in the transaction that supports the item set X in D, it also supports the item set Y.

If the support and credibility of the association rule are not considered, there is an infinite association rule in the transaction database. In fact, people generally only interested in association rules that meet certain support and credibility. In the literature, it is generally called a rule that satisfies certain requirements (such as larger support and credibility). Therefore, in order to find meaningful association rules, two thresholds are required: minimum support and minimum credibility. The former is the minimum support that the user prescribed by the user must satisfy, which indicates that a set of items need to be satisfied in the statistics; the latter is the minimum credibility of the association rules specified by the user, it responds Minimum reliability of association rules.

In practice, a more useful association rules are generalized association rules. Because there is a hierarchical relationship between items, such as jackets, skiing belongs to jackets, jackets, shirts and belongings. After you have a hierarchy, you can help find some more meaningful rules. For example, buy a coat, buy shoes (here, the jackets and shoes are items or concepts at higher levels, so this rule is a generalized association rule). Because there are thousands of items in the store or supermarket, On average, each item (such as ski) has a low support, so it is sometimes difficult to discover useful rules; but if considering a higher level item (such as a jacket), its support is high, which may find Useful rules.

In addition, the idea of ​​discovery of association rules can also be used in sequence mode discovery. When purchasing the item, in addition to the above-mentioned correlation rules, there are time or sequences, because many times the customer will buy these things this time, buy some things related to the last time, then buy some Something.

Classification and prediction

Classification is a very important task in data mining and is currently available in commercial applications. The purpose of classification is to learn a classification function or classification model (also known as a classifier) ​​that maps the data items in the database to one of the given categories. Classification and regression can be used for prediction. The purpose of prediction is to automatically derive the promotion description of a given data from the historical data record, which can predict future data. Unlike the regression method, the classification output is a discrete category value, while the return output is a continuous value.

To construct a classifier, there is a need to have a training sample data set as an input. The training set consists of a set of database records or tuples. Each tuple is a feature vector consisting of a field (also known as attribute or feature) value. In addition, there is a category mark. A specific sample is in the form of: (V1, V2, ..., Vn; C); where VI represents a field value, C represents a category.

The structure of the classifier has a statistical method, a machine learning method, a neural network method, and the like. The statistical methods include the Bayesian method and non-parameter method (neighbor learning or instance based learning), and the corresponding knowledge representation is the discrimination function and prototype. The machine learning method includes decision tree method and rule induction method, the former corresponds to decision tree or discrimination tree, the latter generally generating a rule. Neural network methods are mainly BP algorithms, and its model represents the forward feedback neural network model (a architecture consisting of nodes representing neurons and the edge of the connection weight), the BP algorithm is essentially a nonlinearity Discriminant function. In addition, a new approach has recently raised: Rough Set, its knowledge is a generated rule.

Different classifiers have different characteristics. There are three classifier evaluations or comparison dimensions: 1 prediction accuracy; 2 computational complexity; 3 model description. The prediction accuracy is a comparative scale, especially for predictive classification tasks, and is currently recognized by a 10-point cross-validation method. Calculation complexity depends on specific implementation details and hardware environments. In data mining, since the operation object is a huge amount of database, the complexity problem of space and time will be a very important part. For the description type classification task, the more concise and more popular the model description; for example, the classifier constructor indicated by the rule is more useful, and the results generated by the neural network method are difficult to understand. In addition, the effects of the classification are generally related to the characteristics of the data, and some data noise is large, some have a default, some distribution, some fields or attributes are strong, and some attributes are discrete and some of them are continuous values. Mixed. It is generally considered that there is no certain classification method that can be suitable for data from various features.

There are two steps in classification: 1 Model creation, create a model for a model that has been determined. Each record belongs to a determined category, we use class tag properties record categories. Data set used to create a model: training set, model can be expressed in the form of classification rules, decision trees, or mathematical equations. 2 Model Use: Use the created model to predict the future or category unknown record, use the created model to predict on a test set and compare the results and actual values. Note that the test sets and training sets are independent.

Cluster analysis

Clustering is to create a group of individuals into several categories according to similarity, ie "objects". Its purpose is to make the distance between the individuals belonging to the same category as small as possible, while the distance between individuals on different categories is as large as possible. Clustering methods include statistical methods, machine learning methods, neural network methods, and methods for databases.

In the statistical method, clustering is called clustering analysis, which is one of the three major methods of multivariate data analysis (other two are regression analysis and discrimination analysis). It mainly studies clusters based on geometric distances, such as European distance, Magicoski distance. Traditional statistical clustering analysis methods include system clustering, decomposition, joining method, dynamic clustering, ordered sample clustering, overlapping clustering and blurring clusters. This clustering method is a global comparison cluster that requires all individuals to determine the classification; so it requires all data to be pre-given, and new data objects cannot be added. Cluster analysis methods do not have linear computational complexity, which is difficult to apply to a very large database of databases.

In machine learning, cluster is called no supervision or no teacher. Because of classification learning, an example or data object of classification learning has a category tag, and an example to be clustered is not marked, and it is necessary to automatically determine by a cluster learning algorithm. In many artificial intelligence, clustering is also called conceptual clustering; because the distance here is no longer a geometric distance in the statistical method, but is determined according to the description of the concept. Conceptual clustering is called concept formation when clustering objects can be increased.

In neural networks, there is a class of non-supervised learning methods: self-organizing neural network methods; such as Kohonen self-organized feature mapping network, competition learning network, etc. In the field of data mining, the neural network clustering method reported is mainly self-organizing feature mapping method.

A good clustering method can produce high quality clustering results - clusters, these clusters must have the following two features:

High cluster similarity, low cluster-between cluster, good abusement, depending on the similarity evaluation method used in the clustering method and the specific implementation of the method, and the method of the clustering method also depends on the method. Is it possible to find some or all implied mode. In particular, data mining requires: scalability, can process different types of properties, can discover any shape of the cluster, try not to require specific domain knowledge when determining input parameters, can handle noise and exception, input The order of the data object is insensitive, and high dimensional data can be processed. It can produce a good, which can meet the clustering results of the user specified constraints, and the result is explanatory, understandable and available. Collating routine application: pattern recognition; spatial data analysis, in GIS, establish a topic index by cluster discovery feature space, in spatial data mining, detecting and interpreting clusters in space; image processing; economics (especially It is market research aspects); WWW has document classification and analysis web log data to discover similar access modes

Deviation analysis

That is, an isolated point analysis, isolation point analysis is an important aspect of data mining, used to find "small mode" (for clustering), that is, data concentration is significantly different from other data.

The definition of the isolated point (OUTLIER) given by Hawkins (1980): The isolated point is that the data is concentrated in the data, which makes people doubt that these data is not randomly isolated, but is generated at a completely different mechanism.

Outliers may be found when running or detecting, such as a person's age is 999, which is discovered when detecting the database. Also, Outlier may be inherent, not a mistake, such as CEO's salary is much higher than that of the general employee.

An image of an isolated point (OUTLIER) is, for example:

Many data mining technologies have to minimize the impact of Outliers until complete. However, this is likely to lose some important hidden information, because it is very important for the other party to talk about something "bad". In other words, this "special case" may have a special role, such as finding frauds. Therefore, discovery and analysis "fraud behavior" is a very meaningful data mining task, I call "Outlier Mining".

The application of Outlier mining is very wide. In addition to "fraud discovery" mentioned above, it also finds the purchase behavior of customers with low income or particularly high. Outlier mining can be described: given n records and k (we expect to get the number of Outlier); found that K with other records of records. This process can be seen as two subsections: 1, first defined what kind of record is called "special case"; 2, according to the above definition, find a very effective way to discover these special cases.

Its application is as follows:

- Telecom and credit card deception (check the purchase amount or the number of purchases, etc.)

- Loan approval

- Drug study

-Weather forecast

- Financial sector (check the abnormal behavior such as money laundering)

-Customer classification

- Network intrusion detection, etc.

Isolated point analysis algorithm can be divided into the following categories:

- STATISTICAL-BASED method

- Method for distance (discaCe-based)

- Method for deviation (deviation-based)

- Method for density-based

- Isolated point analysis of high dimensional data

Based on statistical isolation point detection application is mainly limited to scientific research calculations, mainly because the distribution characteristics of data must be in advance this limits its application range.

Distance-based algorithm compared to statistical algorithms, users do not require users to have any field knowledge. Compared with the "sequence isolated point", it is more intuitive. More importantly, it is closer to Hawkins's isolated nature of hawkins. The concept of sequence isolation points based on deviation of the separation point detection algorithm is not universally recognized. This is because the sequence isolation point still has a certain defect in concept, and it is missing a lot of isolated point data.

The density-based isolated point view is more close to Hawkins's isolated point definition than the distance-based isolated point point of view, so it is possible to detect a class of isolated point data based on distance isolate algorithm - partial isolation points. Local isolated point views Abandoned all of the previous isolated point definitions, the idea of ​​this is the absolutely isolated point, which is more in line with the application in real life.

Actual data tends to have large noise, so isolated point models often exist in low-dimensional space, and it is difficult to determine in full-dimensional space; and the previous algorithm has dropped sharply when the ratio is high in dimensions. Therefore, aggarwal and yu (sigmod'2001) propose a method of high-dimensional data isolation point detection. The genetic optimization algorithm has been used to obtain good computational performance.

application

Data mining techniques can provide services for tasks such as decision, process control, information management, and query processing. An interesting application example is the story of "diaper and beer". In order to analyze which commodities customers are most likely to buy together, a company named Walmart uses automatic data mining tools, after analyzing a lot of data in the database, accidentally discovered that the most purchased goods with diapers are actually beer. Why is the goods that two winds and cows do not have to buy together? It turned out that the lady often squad, bought diapers for children after get off work, and her husband took back to two bottles of beer after buying diapers. Since diapers are purchased with beer, the store is placed together, as a result, the sales volume of diapers and beer growth is biplong. Here, the digital excavation technology does not. In general, data mining is available, telecommunication: loss; bank: cluster (subdivision), cross-selling; department store / supermarket: shopping basket analysis (related rules); insurance: subdivision, cross-selling, loss (reasons Analysis); credit card: fraud detection, subdivision; e-commerce: website log analysis; tax department: taxation tax behavior detection; police organs: analysis of criminal behavior; Medical: Health care. details as follows:

Electronic government data mining

Establish an electronic government, promote the development of e-government, and is an inevitable trend applied to government management. Practical experience shows that the government departments are increasingly dependent on the scientific analysis of data. Develop e-government, establish a decision-making system, using a large amount of data stored in the e-government comprehensive database, by establishing a correct decision-making system and decision support model, can provide scientific basis for the decision-making of all levels, thus improving various policies Scientific and rationality to achieve the purpose of improving government office efficiency and promoting economic development. To this end, in terms of government decision support, you need to continue

A new information processing technology is absorbed, and data mining is a core technology that implements government decision-making support. Government decision support systems relying on data mining will play an important role.

The e-government is located in five fields (e-government, e-commerce, distance education, telemedicine, electronic entertainment), which are actively advocated by countries, indicating that government information is the basis of social informationization. E-government includes government information services, e-commerce, electronic government, government departments, and the masses participate in five aspects of the government. Introducing network data mining techniques into e-government, which can greatly improve government information level and promote informationization of the entire society. The specific embodiment is reflected in the following aspects: 1) The government's electronic trade hides the mode information in the data of the server and the browser-side log record, and the network usage mining technology can automatically discover the network access mode and the user's behavior mode, thus Prediction analysis. For example, by evaluating the time spending for a user to browse a certain information resource, it can be judged that the user is interested in the user; the domain name data collected by the log file is classified according to the national or type; application cluster analysis Identify the user's access motivation and access trend. This technology has been effectively used in government-e-commerce.

2) Website design passes the excavation of the content of the website, mainly for the mining of text content, can effectively organize the website information, such as the hierarchical organization of website information using automatic classification technology; at the same time, it can be combined with the user access log record information. Mining, grasp the user's interest, thereby helping to carry out website information push services and custom services for personal information, attracting more users.

3) Search Engine Network Data Mining is a key to the current network information retrieval development. For example, by mining the web page, the classification and search and search of network information can be realized by the network information; at the same time, by the analysis of the history of the question-based history, ask questions and expansion, improve The search effect of the user; in addition, use network content mining technology to improve keyword weighted algorithm, improve the accuracy of network information, thus improving the search effect.

4) Decision support provides decision-making support for government major policies. For example, through the excavation of various economic resources, the future economy will be determined, thereby developing corresponding macroeconomic regulation policies.

Marketing data mining

Data mining technology has been more common applications in corporate marketing. It is based on market segmentation principles of marketing. Its basic assumption is "consumer past behavior is the best explanation of its future consumption".

By collecting, processing and processing a large amount of information involving consumer consumer behavior, determining a specific consumer group or individual interest, consumption habits, consumption tendency, and consumption needs, in turn, invested in the corresponding consumer group or individual consumption behavior, then Based on the orientation marketing of specific content, the identified consumer group, which has greatly saved marketing costs compared to the large-scale marketing means of consumer objects, and improved the marketing effect, thus bringing enterprises. More profits.

Commercial consumption information comes from various channels in the market. For example, whenever we use credit card consumption, commercial enterprises can collect commercial consumption information on credit card settlement processes, and record our time, place, interested goods or services, willing to receive prices and payment capabilities. When we apply for a credit card, apply a car driving license, fill in other things that need to fill out the form, our personal information is deposited in the corresponding business database; companies can even collect relevant business information, even Buy such information from other companies or organizations to use themselves.

These data information from various channels are combined, using supercomputers, parallel processing, neuronal networks, modeling algorithms and other information processing techniques for processing, from the mid-to-business decision to provide specific consumer groups or individuals information. How is this data information applied? For a simple example, when the bank is tapped by business data, it is found that the bank account holder suddenly requires the application for a two-person joint account, and confirming that the consumer is the first application for a joint account. The bank will inference the user may To get married, it will be directed to the user to purchase a long-term investment business such as housing, pay for children, etc., and banks may even sell this information to companies that specialize in wedding goods and services. Data mining construction competitive advantage. In countries and regions developed in the market economy, many companies have begun to process business information on business information on the basis of the original information system to build their own competitive advantage and expand their turnover. American Express has a database for recording credit card services, with a data volume reaches 5.4 billion characters and is still updated with business progress. Through the excavation of these data, the Express has developed the "Relation Ship Billing" promotion strategy, that is, if a customer purchases a set of fashion in a shop in a store, then buy a pair of shoes in the same store, You can get a big discount, which can increase the sales volume of the store, or increase the use of the transportation card in the store. For another example, if the card consumers living in London If they have just passed the British Airways flight to Paris, then he may get a weekend ticket discount card discount card.

Based on data mining marketing, it is often possible to issue a sales material related to its previous consumer behavior to consumers. Kraft Food Company has established a database with 30 million customer data. The database is established by collecting a positive response to other promotional means for the company, and Kraft has passed data. Draw the interests and tastes of specific customers and send them a coupon for specific products to the basis, and recommends them in accordance with Kraft product recipes that meet customer tastes and health. The American Reader's Digest publisher runs a 40-year business database, which accommodates more than 100 million subscribers throughout the world. The database is running 24 hours a day, ensuring that the data is constantly updating, It is based on the advantages of data mining on customer data databases, enabling the reader's abstract publishing company to expand from popular magazines to the publication and distribution of professional magazines, books and sound and video products, which greatly expands their business.

Based on data mining marketing is also inspiring in my country's current market competition. We can often see some manufacturers on bustling commercial streets to distribute large quantities of commodity advertisements for people who are not divided into and visits. The result is no need. Discard the information, and those needed do not necessarily get it. If a company engages a company, a company that has just purchased home appliances in the store, the manufacturer of special drugs will be mailed to the hospital-specific outpatient medical patients, and will definitely be more than a beautiful marketing effect.

Data mining in the retail industry

With barcodes, coding systems, sales management systems, customer data management and other business data, information about merchandise sales, customer information, freight units and store information, etc. can be collected. Data is collected from a variety of application systems, and in the data warehouse, allowing senior management, analysts, purchasing personnel, market personnel and advertisers to access these data, to provide them with efficient Scientific decision tool. If you have a shopping basket analysis of goods, you have analyzed those commodities customers to buy together. Such as the classics of the industry and business circles ---- Wal-Mart's "beer and diaper" are typical in data mining through data to find out the rules between people and objects. In the field of retail applications, DW, DM will have excellent performance in many ways:

1. Understand the Sales Global: Through Category Information - According to the type of product species, sales quantity, store location, price and date to understand every day of operation and financial situation, the sales of sales, changes in inventory, and sales through promotion It can be said. When the retail store is selling goods, it is important to check whether the product structure is reasonable at any time, such as whether the business ratio of each type of commodity is substantially comparable. Adjusting the demand change caused by seasonal changes in the adjustment of the commodity structure, the commodity structural adjustment of the commodity structure of the competitors. 2. Product Group Layout: Analyze customer's purchase habits, consider the probability of purchasing the purchase of different goods in the store, purchase time and location, master different products; through the active analysis and association analysis of goods sales varieties, with the main Component analysis method, establish the best structure of the product settings and the best layout of goods.

3. Reduce inventory costs: Concentration of sales data and inventory data through data mining system, by data analysis, to determine the increase or decrease of each product of each product to ensure the correct inventory. Data warehouse systems can also send predictive information from stock information and commodity sales to the supplier through electronic data exchange (EDI), so that the commercial intermediary is eliminating, and the supplier is responsible for regular supplementary stocks, retailers can reduce their own burden.

4. Market and Trend Analysis: Use data mining tools and statistical models to carefully study data warehouse to analyze customers' purchase habits, advertising success rates and other strategic information. The use of data warehouses by retrieving sales data in recent years, analyzing and data mining, predicting seasonal, monthly sales, and trends in commodity varieties and inventories. It is also possible to determine the price reduction item and make decisions on quantities and operations.

Effective product promotion: You can determine the effectiveness of sales and advertising business by analyzing, customer statistics, and historical status by analyzing a manufacturer commodity. Through the analysis of customer purchase preference, the target customers of the product promotion are determined to design various commercial promotions, and through the results of the purchase of related association analysis, cross-selling and top sales methods, excavation of customers' purchasing power, implementation Accurate commodity promotion.

Banking industry data mining

Financial affairs requires a lot of data, and due to the status of banks in the financial sector, working nature, business characteristics, and fierce market competition determine it for information, electronic components have more urgent requirements than other fields. Using data mining technology can help bank product development departments describe the customer's previous demand trend and predict future. American Commercial Bank is a model of commercial banks in developed countries, and many places are worth learning and learning from my country.

Data mining technology is widely used in the US bank financial sector. Financial services need to collect and handle large amounts of data, analyze these data, discover their data patterns and features, and may find financial and business interests for a customer, consumer group or organization, and observe the changes in the financial market. The profits and risks of commercial banks are coexisting. To ensure maximum profits and minimal risks, the account must be scientifically analyzed and classified, and the credit assessment is performed. Mellon Bank uses data mining software to improve the accuracy of sales and pricing financial products, such as family common loans. There are two main types of retail credit customers, a class of rarely use credit limits (low cycles), and the other can maintain a high unclear balance (high cycle). Every category represents the challenge of sales. The low cycle represents the default and expenditure costs of the expenditure, but will bring very little net income or negative income because their service costs are almost the same as the high cycle. Banks often provide them with projects, encouraging them to more using credit limits or find opportunities for cross-selling high-profit products. High cycles consist of high and medium hazard components. High hazard segments have the potential for payment default and cancellation. For medium hazard sections, the focus of sales project is to keep the profitable customers and strive to bring new customers who can bring the same profit. However, according to new perspectives, the user's behavior will vary over time. Analysis of the cost and income of the entire life cycle can see who is the most profitable potential. Mellon Bank considers "customization according to a certain part of the market" to find end users and position the market to these users. However, you must understand this information about the end user characteristics. Data mining tools provide the Mellon Bank to get this information. The Mellon Bank Sales Department uses Intelligence Agent to find information on the prior data mining project, the main purpose is to determine the existing Mellon user to purchase specific additional products: the tendency of home ordinary credit limits, using this tool to generate models for detection. According to bank officials: Data mining can help users enhance their business intelligence, such as interaction, classification, or regression analysis, depending on these capabilities, can have a purposeful sales for those who have higher tend to purchase bank products, service products and services. The official believes that the software feeds back high quality information for analysis and decision, and then inputs the information to the product. Data mining also has customizable capabilities.

US FIRSTAR Bank uses data mining tools to provide customers with what products provide customers based on customer consumption patterns. The FIRSTAR Bank Market Survey and Database Marketing Department discovers: the public database stores a lot of information about each consumer, the key is to thoroughly analyze the consumer in the new product, find a model in the database, thus Find the most suitable consumer for each new product. Data mining systems can read 800 to 1000 variables and assign them, depending on the consumer's family property loan, credit card, deposit certificate or other savings, investment products, and divide them into several groups, then use data mining tools to predict When will I provide each consumer which product. Pre-proportional customers' needs are the competitive advantage of US commercial banks.

Securities industry data mining

Typical applications include:

1, customer analysis

Create a data warehouse to store information and transaction data for all customers, predefined customer bases, a customer, and implement the topic-oriented information by excavating and related analysis of these data. Classify customers' demand models and profit value, identify customer bases of the most valuable and profitable potential, and their most needed services, better configure resources, improve services, and firmly grasp the most valuable customers.

Through multi-angle mining of customer resource information, understand the customer's indicators (such as assets contribution, loyalty, profitability, position ratio, etc.), master customer complaints, customer loss, etc., capture information before leaving brokerage, Take measures to retain customers in a timely manner.

2, consulting service

According to the collection market and transaction data, combined with the market analysis, predict the future market trend, and discovers the trading situation with the law of the market, the customer is targeted to consult the customer according to these laws. 3. Risk prevention

Through analysis of funds data, you can control business risks, and you can change the original financial control mode of the company's headquarters, and to understand the funds in time, and play the role of risk warning.

4. Analysis of business conditions

By data mining, important information such as business conditions, funding, profit, customer group distribution can be understood. Combined with the broader trend, provide the largest income business method under different conditions. At the same time, through the horizontal comparison of the operations of each business, analyzes the business conditions of the business department to make an analysis of the business conditions of the business department.

Telecommunications data mining

The telecommunications industry has quickly evolved from a simple provision of local and long-distance services to integrated telecommunications services such as voice, fax, paging, mobile phone, image, email, computer and web data transfer, and other data communication services. The integration of communications and calculations in telecommunications, computer networks, Internet, and various other methods is currently the current trend. Moreover, with many countries's openness and emerging computing and communication technology in many countries, the telecom market is rapidly expanding and more competitive. Therefore, using data mining techniques to help understand business behavior, determine telecommunications mode, capture behavior, better use of resources and improve service quality. Analysts can analyze information such as call sources, call destination, call quantity, and daily mode of use, can also be identified by excavation, so as soon as possible, it can be discovered as soon as possible to reduce losses for the company.

Data mining in mobile communications

In response to informationization, mobile communication industry informationization process has been used in successively developing and wide application, operational network system, integrated business system, billing system, office automation, etc., has accumulated a lot of history for computer application systems. data. However, in many cases, these massive data are unable to refine and sublimately useful information in the original operating system and provide business analysts and management decision makers. On the one hand, the online operation system becomes cumbersome, and the investment of system resources cannot be kept, and the investment of system resources cannot be expanded; on the other hand, managers and decision makers can only be fixed according to fixed, timing The report system obtains limited operation and business information, and cannot adapt to the fierce market competition.

With the further release of the telecom industry, the adjustment of the development of the telecom industry and the improvement of the quality requirements of telecom service, the increase in the increase of the thief, fraudulent factors, etc., the operation of mobile communications faces a more complex situation, the operating cost is large Amplitude increases. Therefore, how to make full use of existing equipment to reduce costs and improve efficiency under the premise of meeting customer demand and quality services, and become a topic of policy makers.

According to the development experience and history of foreign telecommunications markets, the successful operation of Telecom Corporation in Market Competition is: (1) Retaining existing customers with high quality services; (2) Improve both the speech volume and equipment utilization, with competition Lower costs to achieve new customers, expand market share; (3) Abandon customers who have no profit and credit, reduce business risks and costs.

For a relatively mature mobile communication operator, the massive historical data accumulated by each operation and support system is undoubtedly a valuable wealth, and the data mining is the most used for these valuable resources to achieve the above three goals. Effective methods and means.

Data excavation in sports fields

1, physical data analysis

At present, our country attaches great importance to health and enhancements, and there are many related physical tests each year. Such a large amount of data has been accumulated in the year, and almost all of the statistical methods used for the analysis of these data, including many units of sports analysis and evaluation software, mainly for the mean analysis of physical data and the setup formula Evaluation and analysis. Obviously, they have certain contributions to physical data analysis in sports, but their role can only be limited to the size comparison of the data itself, and the resulting results can only be understood by professionals, and only statistical methods are mining data. The connection between the link is also very limited. Mining the physical data using data mining, it is easy to generate the result of the statistical method. For example, according to the accumulation and continuous collection of data, combined with physical data and nutrition knowledge, you can dig the nutrients of different regions, and can be excavated according to physical data and medical knowledge. People's health, and even analyze the possible disease causes of possible diseases, so that it can better guidance for people's self-health and fitness; in addition, data mining the early physical data of famous athletes To analyze, you can find their common features to provide a strong basis for sports. Physical data is as a treasure mine, using data mining technology, must be able to dig out a lot of unimaginable treasures.

2. Application in sports industry

Data mining initial application is the business sector, and the sports industry is a typical business. In a general commercial data mining, DM technology determines which is their most valuable customers, re-establish their product promotion strategies (promoted products to those who need them), with the smallest cost to get the best sales. Taking sports advertising as an example, you can minimally engage in databases of different sports advertising business. For example, it is found that the unit or company of some kind of sports advertising is discovered, so you can have these characteristics but have not yet become our. Other companies or units of customers sell such sports advertisements; Similarly, if the common feature of customers who find the lost, they can make targeted compensation before those with similar features have not lost. In this way, the benefits of sports advertising can be improved to some extent. Therefore, timely, effectively use DM, you can create more wealth for our sports industry.

3, the application of competitive sports

Competitive sports, especially competitive competition, usually not only high levels of actual levels, while tactical strategies are also very important, sometimes tactics in competitiveness actually determine. After recognizing the function of the data mining, it has applied it abroad to competitive sports. For example, the famous American national basketball team NBA coach, using the data mining tools provided by IBM, and the decision to replace the players, and achieved good results. Systematic analysis showed two defenders in the first lineup of the Magic Afronta and Brian Shaw, was named - 17 points in the first two games, which means they The fraction of the team lost in the field is more than 17 points. However, when Hadwood and the replacement of Darrell Armstrong, the magic team scored 14 points. In the next game, the Magic team added Ams creation time. This is really effective: Arms have settled 21 points, Hadwood won 42 points, and the magic team won 88-19. In the fourth game, the magic team made Arms to create a first lineup, and once again defeated the hot team. In the fifth game, this lineup supported by data mining did not drag the hot team, but the data mining helped the magic team to win 5 games until finally decided to win the opportunity. At present, approximately 20 software systems developed by IBM companies in the NBA team to optimize their tactical combinations. Similarly, using data mining techniques can also analyze football, volleyball, and other similar competitive sports, and find out the weaknesses of the opponent, and develop more effective tactics. Postal data data mining

China Post has established the largest domestic logistics exchange system, and has also accumulated a lot of user data. How to use these user data, and provide scientific decision-making basis for the development of postal business, is the postal department is very concerned about issues. Data mining techniques can solve the above problems for the postal department. Using this technology, we can conduct customer deposit balance analysis, customer deposit structure analysis, average deposit interest rate analysis, different storage balance analysis, different stored customers analysis, Statistical analysis, business statistical analysis, etc. We introduce our customer deposit analysis. The analysis of the analysis includes the following: Business outlets: As the analysis dimension, you can judge the performance of each savings; the age of the customer: According to the deposit balance according to the customer age, it can be analyzed which age group The customer is a fine customer, which customers are the focus of future development. The customer's address: According to the statistical deposit balance of the customer, you can analyze the economic situation of each region and the degree of understanding of postal savings, so that the basis for future business expansion; deposit use: The savings deposit use of residents is more complicated, But understanding that there is a regular deposit point to help postal savings want to think about customers, close to the distance between customers; on the other hand, it can provide powerful information for the expansion of new business; time period: passing this analysis Changes of customer savings can be properly adjusted in time to properly adjust the postal business process. For example, according to the changes in customer deposits, it is predictable to prevent postal savings in time, and prevent financial risks while ensuring maximum investment.

Call center data mining

The call center is gradually become the main channel for information collection. After collecting a lot of data, how to organize and analyze these data, support the company's scientific decision-making, and it is also a major problem. Data mining technology can provide a new solution.

Provide a basis for decision-making, introducing data mining technology into a call center, which is very important. The various information in the business process is reflected by data. Through the analysis of these data, it can be found that the laws in the operations process, providing scientific guiding significance for the production activities, market activities of the enterprise.

The call center currently solves the problem of information access to the enterprise and the external market. The resulting large amount of data can only be reflected in the general sense of information through reports. Through data mining technology, many deep, manual can't discover the law, help companies account for more chances in a fierce competitive environment. Provide targeted services to users, through data mining technology, can be classified according to customer consumption behavior, find out the consumption characteristics of this class, and then provide a more personalized service through the call center, thus improving the service level of the company, improves Enterprises' Social and Economic Benefits.

Improve the decision-making science of the enterprise, currently, the company's decision is very blind. If data mining techniques can be used, on the basis of data generated in their own production, scientific analysis, resulting in more scientific prediction results, reducing decision failures. Through data mining techniques, the company's decision to return to its own business, draw more practical judgment.

Value is easier, data mining has many applications in the call center, and some applications can help simplify management operations, and some can provide some business association data to help enterprise call centers to better carry out business and achieve value-added. Specifically, the value-added application is in the following aspects. Analyze customer behavior and cross-sell. Among the various customers of the call center, the relevant analysis can be performed according to the characteristics of their consumption, and how much is the probability of purchasing other kinds of products when purchasing a product. Based on this mutual correlation, cross-selling can be made. Analyze customer loyalty and avoid customer loss. There will be a lot of important big customers in the customer analysis. Using data mining technology, these lost large customers can analyze, find data models, discover their loss, then improve service quality, avoid customer loss, reduce economic losses.

Simplified management, the operation management of the call center is mentioned unprecedented because a center is very good, the technology is also very advanced, but if the management is not good, the advantage still does not play. However, management is a very difficult threshold for a lot of call centers, and data mining can help simplify management.

Predict the traffic volume, arrange the artificial agent, in the call center, traffic is an important indicator, and the company can arrange the number of seats according to the size of the traffic, but the traffic is a change indicator, and it is difficult to predict in the past. . Through the time series analysis in the data mining, a certain degree of prediction can be performed more reasonably, and the number of seats can be more reasonable, and the operation cost of the company can reduce the operation of the enterprise without reducing the call center contact.

Perform a relational analysis to reduce operating costs. In the operational call center, many business services are often available, and the number and chart of the seats are arranged according to the varieties of these business. Through the correlation analysis in data mining, the relevant analysis of the business can be performed, and which services have a relatively strong correlation. If you have a business with your business in the express delivery industry, you may have a lot of relevance. In this way, when scheduled a seat person, the agent of the two services can be combined, reducing the number of personnel, and reducing the operating cost of the call center.

Data Mining of Digital Library

Web mining is a very optimistic tool for foreground. We know that the information of traditional efficiency low search engines is often indexed in complete, there is a large number of unrelated messages or no reliability verification. Users can quickly easily retrieve relevant reliable information from the Web is the most basic requirement of a system. Web mining not only discovers information from a large amount of data from WWW, but also monitors and predicts user access habits. This gives designers with more reliable information when designing Web sites. Web mining techniques help librarians will develop in the design site, save time and efficiency. Web mining technology provides advanced tools for librarians. With this tool, librarians can organize more, better high quality information, according to the requirements or habits of each user. For example, the institutional librarian appliances apply Web mining technology to retrieve relevant information from WWW in different disciplines in this institution. This technology can automatically retrieve information and classify information in accordance with the field of topics, making them easier access. Librarians can establish a set of features for different topics, and to be retrieved and classified in these features, ensuring that the information obtained is reliable and authoritative. Since Web mining technologies are automatically, it is not necessary to organically discover and organize information from WWW, so that librarians only need to have a small amount of time to maintain the database. The user can be very satisfied because there is no need to spend a lot of time to browse hundreds of documents. More importantly, they can access anywhere in any time at any time. In fact, this is the specific work performance of librarians to transfer their consulting services from desktop to the Internet.

Website data mining

With the development of Web technology, all kinds of e-commerce websites have a wind. Establish an e-commerce website is not difficult, how difficult is how to make your e-commerce website benefit. If you want to have benefits, you must attract customers and increase customer loyalty that can bring benefits. The competition of e-commerce business is more fierce than traditional business competition. One of the factors is that customers are converted from an e-commerce website to the competitors, just click on a few mouse. The content and level of the website, using words, title, reward programs, services, etc. may become attractive customers, and it is also possible to lose our customers' factors. At the same time, the e-commerce website may have millions of online transactions every day, generating a large number of recorded files and registration forms, how to analyze and excavate these data, fully understand the preferences, purchase models, and even The impulse of customers, designs a personalized website that meets the needs of different customers, thereby increasing its competitiveness, almost become a must. If you want to survive in the competition, you will know more about our customers than your competitors.

When data mining on the website, the required data is mainly from two aspects: one aspect is the background information of the customer. This part of the information is mainly from the customer's registration form; while another data mainly comes from the browser's click stream ( Click-stream, this part of the data is mainly used to examine customer behavior performance. But sometimes, customers are very tremendous to their own background information, refused to fill this part of the information on the registration form, which will bring inconvenience to data analysis and mining. Under this circumstance, the background information of the customer has to be speculated from the viewer's performance data, and then use.

Data mining of biomedical and DNA

Biological information or gene data mining is not shallow for human beings. For example, a combination of genes is ever-changing, and how much is the gene of a human gene and normal human gene? Can you find different places, and then change it differently, make it a normal gene? This requires support for data mining technology.

Compared to the data mining of biological information or genes, it is much more complicated compared with the current complexity of the data, the amount of data is analyzed and established in the data. From the analysis algorithm, some new and good algorithms are needed. Many manufacturers are working on this aspect. However, in terms of technology and software, it is far from maturing. Data excavation of Internet screening

Recently, there are many data excavation products to screen news on the Internet, protect users from boring emails and commercial sales, very popular.

Data mining in meteorological forecast

The agricultural production and climate, the weather has a close relationship, my country is a agricultural country, agricultural production is related to national economic lifelines and people's lives. The weather system is a complex system that affects the factors, and the time and space changes. Meteorological data contains complex nonlinear dynamics mechanisms. The relationship between the various factors is very complex and has a variable spatial characteristics. Therefore, it is difficult to establish the relationship between agricultural production and meteorological elements. It has the actual significance of related research and practical research, and the driving technology of application development and demand can be used to resolve this issue.

The methods used in the application of meteorological predictions in foreign countries, including: neural network, classification, and clustering; some people have adopted the knowledge representation of wavelet analysis and linguistic field, which proposes a new new Based on the combination of wavelet analysis and chaotic theory, the discovery method of category knowledge, the weather data can be extracted after wavelet transform, and the characteristic data representing the weather system can be extracted, and the relationship between the characteristic data is used to agricultural production, such as yield, pest density, etc.) Data mining, data mining methods include: classification, clustering, association rules, similar patterns, etc., a practical, scalable, easy-to-operate weather scientific research application system is constructed from the perspective of unstructured data information mining.

Data mining of hydrological data

The rapid development of information acquisition and analysis technology, especially the application of telemetry, remote sensing, network, database and other technologies, and vigorously promotes the development of hydrological data collection and processing technology, making it in the scale and feature type of time and space. Different degrees of extension. Due to the special role of water in human survival, various new technologies have been applied to hydrological data, excavating knowledge of hydrological data has become a new hotspot for hydrological science development. The proposal of digital hydrological systems is one of the era markers of hydrological science development. Its core is how to form digitized, covering the entire designated geographical space, multiple time and space scale, multiple elements, and useful data products for hydrological analysis.

Hydrological data mining is an important foundation for accurate hydrological forecasting and hydrological data analysis. In my country, the total amount of hydrographic data has exceeded 7000MB, plus the weather, geography of the weather, geography, etc. required, and the amount of data required for hydrological analysis is large. From these large numbers, the types of complex data are accurately excavated to meet the needs of the needs, often because of the computational power, storage capacity, and the lack of algorithm. Therefore, it is necessary to efficient hydrological data mining technology. Data Mining Technology Applications in Wills Information Services are available in many ways.

Data Mining Generally related associations, sequenceial pattern, classification analysis, clustering, and other feature types. Depending on the application target, data mining can adopt or learn from various existing theories and algorithms, such as information on theory, the aperture, evolutionary computation, neurogram, statistical, and many algorithms for examples, can be applied to data mining systems. In the implementation. Hydrological data mining can apply decision trees, neural networks, covering a modest rejection, rough set, concept tree, genetic algorithm, formula finding, statistical analysis, fuzzy theory, etc., and support in visualization technology The hydrological data mining application system that constructs meets different purposes.

Data mining of video data

At present, multimedia data has gradually become the main information media form in the field of information processing, especially video data, because it can record, retain space and time, its content is rich, but make people can be closest to nature. Ways get more details. The application of video data is increasing in life, has produced a large number of digital video libraries, and current research focuses on organizational management and use of digital video libraries, especially for content-based video retrieval techniques. Content-based video information retrieval technology although to some extent solves the problem of video search and resource discovery, video information retrieval can only obtain video "information" required by the user, and cannot analyze the contained in large number of video data. "Knowledge" represented by video media. To this end, it is necessary to study a higher video analysis method than the search and query level, that is, video excavation. Video excavation is to analyze video data audiovisual characteristics, time structure, event relationships, and semantic information, find implicit, valuable, understandable video modes, to draw the trend and association of video indications, improve video information management The degree of intelligence. The system structure of video excavation is generally shown. Based on data cubes and multidimensional analysis, some data mining methods can find useful information and modes implied in video data, and common mining methods are classified, clustering, and association methods.

Classification is a commonly used data analysis form. The categorization of regular data is a two-step process: the first step, establish a model to describe the predefined class set, constructed the classification model by analyzing some of the data in the database, and the data used to establish models is called training data set, training Can be randomly selected. Step 2, use the model to classify. First, you should evaluate the classification accuracy of the classification model, using the test data set to detect the feasibility of the model, if the accuracy of the model is considered to be accepted, you can use it to classify other data in the database. The classification of video objects is to divide a set of video objects (lens, representing frames, scenes, extracted target objects, text, etc.) into several classes, so that the similarities between the data belonging to the same category are as large as possible. Similarity between different category data is as small as possible. You can select some characteristics of the classification, such as the color histogram of the video lens, and the semantic description of the video in the video.

Cluster analysis First analyzes data in the video database, collect data with the same characteristics together, reasonably divided, and then determine the category of each data object. Cluster analysis is different from classification, dividing data into several categories, is not known in advance. Clustering algorithms are generally divided into probabilistic clustering algorithms and distance-based clustering algorithms. The cluster of video objects has an important role in video structural analysis, for example, with a clustering algorithm, a lens that can be similarly collected into a higher-layer structural unit - scene.

Association rule mining refers to a useful connection between a given data set. The most typical association rules in the regular database are shopping basket analysis, that is, analyze the customer's purchase habits by discovering customers in their shopping baskets. Mining the association between video objects is to see the video object as a data item, which finds a mode in which the frequency is high in different video objects. For example, two video objects often appear simultaneously, the frequency of the video lens transform and the association between the video types.

Video excavation can be applied to government agencies, corporate management, commercial information management, military intelligence and command, public utilities, public safety, national security and other command decision departments. The application of the company, the administration of the government can bring direct or indirect economic benefits; the application of military, commercial, government and other departments will solve the problem of implicit mode of routine methods that can not solve the problem. From the typical and potential applications, video excavation technology, economic benefits, and social role can be seen from the above typical and potential applications. 1. Real-time analysis and excavation of traffic monitoring video streams, performing sports characteristics analysis, excavating traffic conditions and congestion models, providing decision-making support for traffic control and command agencies. 2. Video news excavation, analyzing and excavating a large number of domestic and international video news a day, including the relationship analysis of the event, analysis, disaster incident (flood, fire, disease, etc.) analysis, military deployment or force mobilization, time or Spatial dimensions are displayed in the mines. For example, analyzing and excavation of terrorist incidents in many years of video news, resulting in a valuable action mode and event association. 3. Video stream excavation of video surveillance systems, used for video streams such as banks, shopping malls, workshops, and hotels, analysis accident model, passenger flow mode, etc. 4. Digital Library Video Data Mining, Classification, Clustering of Large-scale Multi-Subjects, to improve video classification and indexing. 5. TV video program mining, narrative mode analysis, style analysis, and high-level analysis and mining in video program management, such as production date, quantity, type, and elongation of high-level characteristics of the production date, number, type, and elongation. 6. The visual trading activities of e-commerce in the enterprise, the analysis and excavation of the trading process of visual e-commerce. User interacts with multimedia enterprise e-commerce interface, which is constructed by MPEG-4. This technology can be used to analyze and excavate the user's selection and ordering behavior. In addition, video advertisements can be analyzed and excavated, analyzed, analyzed and effectively associated modes. Personal data mining

Personal data mining is very wide, for example, you can explore the company record, choose the best partner; tap the medical history of personal family, determine and genetic medical model, thus making optimal decisions for lifestyle and health; excavation stock And company performance to choose the best investment method.

Data mining tool evaluation criteria

How to choose the data mining tool that meets yourself? To evaluate a data mining tool, you need to consider from the following aspects:

1 What is the type of pattern generated.

2 Solve the ability of complex issues.

The amount of data is increased, and the increase in mode exquisite and accuracy requirements will result in an increase in problem complexity. Data mining systems can provide the following methods to solve complex issues:

The combination of multiple modes of modes can help discover useful modes to reduce problem complexity. For example, first use clustering methods to packet data, and then excavate the predictive mode on each group, which will be more effective than simply operated on the entire data set, more accurate.

A variety of algorithms have many modes, especially those related to classification, can have different algorithms to achieve, each has its own advantages and disadvantages, suitable for different needs and environments. The data mining system provides a variety of ways to generate the same mode, which will add more capable solutions. Verifying methods There are a variety of possible verification methods when evaluating mode. The more mature method can be controlled like n-layer cross-validation or bootstrapping, etc. to achieve maximum accuracy.

Data selection and conversion mode are typically hidden by a large amount of data items. Some data is redundant, and some data is completely independent. The presence of these data items affects the discovery of valuable mode. One important function of the data mining system is to handle data complexity, providing tools, selecting the correct data items and conversion data values.

Visualization tools provide an intuitive and concise mechanism to represent a lot of information. This helps position important data, evaluating the quality of the model, thereby reducing the complexity of modeling. The scalability is very important in order to improve the efficiency of processing a large amount of data, and the data mining system is important. What you need to know is: Can data mining systems make full use of hardware resources? Do you support parallel calculations? The algorithm itself is designed as parallel or uses DBMS's parallel performance? Which parallel computer is supported, SMP Server or MPP Server? When the number of processors increases, is calculated whether the scale is correspondingly increased? Do you support data parallel storage? The data mining algorithm written for a single processor is not automatically run at a faster speed on a parallel computer. In order to give full play to the advantages of parallel computing, it is necessary to prepare an algorithm that supports parallel computing.

3 easy to operability

Easy to operate is an important factor. Some tools have a graphical interface that guides the user semi-automated implementation tasks, and some use scripting languages. Some tools also offer data mining APIs, which can be embedded in programming languages ​​like C, VisualBasic, PowerBuilder.

Mode can be applied to existing or newly added data. Some tools have a graphical interface, and some allow the mode to export mode to programs or databases by using rule sets in such a program or SQL.

4 Data Access Ability

Good data mining tools can read data from DBMS using SQL statements. This simplifies data preparation and makes it possible to take advantage of the database (such as parallel reading). None tools can support a large number of DBMS, but most popular DBMs can be connected via a universal interface. Microsoft's ODBC is a such interface.

5 interface with other products

There are many other tools to help users understand data and understand results. These tools can be traditional query tools, visualization tools, OLAP tools. Can data mining tools provide a simple way to integrate with these tools?

Many industries such as communications, credit card companies, banks and stock exchanges, insurance companies, advertising companies, shops, etc. have used data mining tools to assist their business activities, and their domestic applications are still in its infancy, and data mining Researchers and developers of technology and tools, my country is a market with great potential.

Look forward to

The "Science and Technology Comments" magazine of the Massachusetts Institute proposed 10 new technologies that have a significant impact on humans in the next five years, and "data mining" is ranked third. A recent Gartner report lists five key technologies, KDD and artificial intelligence ranking for industries within 3 to 5 years. At the same time, this report will be parallel computer architecture research and KDD in the 10 new technical fields that companies should invest in the next five years. It can be seen that the research and application of data mining has received more and more emphasis on academia and industry, and thus be the most promising cross-discipline in the information industry. Its development direction is: database data warehouse system integration, integration with predictive model system, excavation of various complex types of data and application, developing and developing data mining standards, supporting mobile environments.

转载请注明原文地址:https://www.9cbs.com/read-129010.html

New Post(0)