Abstract This paper introduces the basic concepts and classification methods of the association rule, and list some associated rules mining algorithms and briefly analyzes the typical algorithm, and look forward to the future research direction of related rules.
Key words data mining, association rules, frequency set, APRIORI algorithm, FP-tree
1 Introduction
Association rule excavation discovers interesting associations or related connections between a large amount of data in the item set. It is an important issue in data mining and has been widely studied in the industry in recent years.
A typical example of the excavation of related rules is a shopping basket analysis. Association Rule Research helps discover contact between different commodities (items) in the transaction database, find out customer purchase behavior patterns, such as purchasing a commodity for purchasing other commodities. The results of the analysis can be applied to commercial shelf layout, freight arrangements, and classify users based on purchase models.
Agrawal is first proposed in 1993, first proposing the association rule issues in the interval of the customer's transaction database [AIS93B], and many researchers have a large number of excavation issues of associated rules. Their work includes optimizing the original algorithms, such as introducing random sampling, parallel thinking, etc. to improve the efficiency of algorithm mining rules;
Recently, there are also works of Agrawal's frequency interaction method [HPY00] to avoid some of the defects of the intersection, explore new methods for excavating relationship rules. There are also some jobs [KPR98] focus on the value of the model of the mines, and the model they propose suggested some research directions worth considering.
2 basic concept
Set i = {I1, I2, .., IM} is a group set, where IK (k = 1, 2, ..., m) can be items in the shopping basket, or a customer of an insurance company. The data d that set the task is a collection, where each transaction T is a group set, making Tíi. Set A is a set of items, and Aít.
Association rules are logical implications as follows: A þ B, aìi, aìi, and A∩B = f. Association rules have two important properties:
Support: p (a∪b), the probability that both A and B are simultaneously occurring in the transaction set D.
Confidence: p (b | a), that is, in the transaction set D of the item set A, the probability that the item set B also appears.
It also meets the rules that meet the minimum support threshold and the minimum confidence threshold are called strong rules. Given a business set D, mining associated rule issues is to generate support and credibility, which is greater than the minimum support and minimum credibility of users, which is the problem of stronger rules.
3 types of related rules
1) The association rules can be divided into Boolean and numerical types based on the categories of variables treated in rules.
The value of the Boolean association rules is discrete, species, which shows the relationship between these variables.
Numerical association rules can be combined with multi-dimensional association or multilayer association rules, process the numeric fields, dynamically split, or process the original data directly, and of course, the numerical type association rules can also contain type variables. .
2) Based on the abstraction level of data in rules, it can be divided into single-layer association rules and multi-layer association rules.
In a single-layer association rule, all variables do not take into account the reality of data with multiple different levels.
In the multilayer association rules, multi-layers of data have been fully considered.
3) The association rules can be divided into single dimensional and multi-dimensional.
In a single-dimensional association rule, we only involve a dimension of data, such as the item purchased by users.
In multidimensional association rules, the data to be processed will involve multiple dimensions.
4 algorithm review
4.1 Classic frequency set algorithm
Agrawal is equivalent to an important method of excavating the relationship rules of the association rules in the interval of the customer's transaction database [AS94A, AS94B], which is based on the recursive algorithm of two-stage frequency intersection. This association rule is within classification, single-layer, Boolean association rules.
All supported items greater than the minimum support is called frequent item sets, referred to as the intersection.
4.1.1 The basic idea of algorithm first identifies all of the intensive sets, which are often at least in terms of at least a predefined minimum support. Then generate strong relational rules by the frequency set, which must meet the minimum support and minimum credibility.
The overall performance of the excavation association rule is determined by the first step, and the second step is relatively easy.
4.1.2 Apriori core algorithm analysis
In order to generate all intensive sets, the method of recursive is used. Its core idea is briefly described as follows:
(1) L1 = {Large 1-itemsets};
(2) for (k = 2; LK-1 ¹F; k ) do begin
(3) CK = APRIORI-GEN (LK-1); // New candidate set
(4) for all Transactions Tîd Do Begin
(5) CT = Subset (CK, T); // The candidate set included in the transaction T
(6) for all Candidates Cî CT DO
(7) c.count ;
(8) end
(9) LK = {Cî CK | c.count3minsup}
(10) END
(11) Answer = ∪KLK;
First result in frequent 1-item set L1, then frequent 2-item set L2 until a R value makes LR empty, then the algorithm stops. Here in the kth cycle, the process first generates a collection of candidate K-item sets CK, and each item set in CK is one (k-2) that is only one of the LK-1 of LK-1. - Connection is generated. The item set in the CK is a candidate set for generating a frequency set, and the last frequency set LK must be a subset of CK. Each element in the CK is required to verify in the transaction database to determine if it is added to the LK, where the verification process is a bottleneck of algorithm performance. This method requires multiple scans that may have a large transaction database, that is, if the frequency set contains up to 10 items, then you need to scan the transaction database 10 times, which requires a large I / O load.
Possible candidates can be generated, as well as the need to repeat the scanning database, which is the two shortcomings of the APRIORI algorithm.
4.1.3 Optimization of Algorithm
In order to improve the efficiency of the algorithm, Mannila et al. Introduced trimming techniques to reduce the size of the candidate set CK [MTV94], which can significantly improve the performance of all frequency set algorithms. The pruning strategy introduced in the algorithm is based on such a nature: a set of intersections is intended and only when all of its subsets are intersession. Then, if a candidate set in CK has a (k-1) - subset does not belong to LK-1, this item set can be trimmed to be repaired, this trimming process can reduce the calculation of all candidates set The price of support is supported.
4.2 Improved frequency set algorithm
4.2.1 Half
This algorithm is proposed by Park, etc., is proposed in 1995 [PCY95B]. Through the experiment, finding that the main calculation of the frequency item set is to generate frequent 2 sets L2, and Park is to introduce hash techniques to improve the production of frequent 2 sets.
Its basic idea is: When each transaction in the database is scanned, when the candidate 1 set in C1 generates frequent 1 set L1, all 2 sets are generated for each transaction, and their hash to the hash table structure. In the bucket, and increase the corresponding barrel count, 2 sets corresponding to the bucket count in the haveh table cannot be frequent 2 sets, which can be deleted from the candidate 2, which can be greatly compressed. 2 sets considered.
4.2.2 Transaction compression
Agrawal et al. Proposed a method of compressing the number of transactions that further iteratively scanned [AS94B, HF95]. Because it does not contain any of any K item set, it is impossible to contain any (k 1) item set, and the delete flag can be applied to these transactions, and the database is not considered when scanning the database.
4.2.3 Mix
A highly efficient algorithm based on a high-efficiency algorithm is proposed by Park et al. [PCY95A]. Through experiments, we can find that the main calculation of the field is to generate frequent 2-item sets LK, Park, etc., use this nature to introduce a mixture of phantom techniques to improve the production of frequent 2-item sets. 4.2.4 Division
Savasere et al. Designed a division-based algorithm [SON95], this algorithm first divides the database from a block that is not intersecting each other, considering a block and generates all the intersections, then generates Frequent integration, used to generate all possible frequency sets, and finally calculate the support of these items. The size of the blockage is selected such that each block can be placed in the main memory, and each stage is only scanned once. The correctness of the algorithm is guaranteed by at least one of a possible frequency set at least in a certain piece. The algorithm discussed above is a highly parallel, and each block can be assigned to a processor generated frequency set, respectively. After each cycle of the frequency set is completed, communication between the processor is communicating to generate a global candidate K-item set. Usually the communication process here is the main bottleneck of the algorithm execution time; on the other hand, each independent processor generates a frequency set time is also a bottleneck. Other methods are also shared between multiprocessors to create a collection. More parallelization methods for generating frequency sets can be found in literature [AS96].
4.2.5 Patement
Basic thinking is a subset excavation at a given data. For information obtained by scan the previous, carefully combined analysis, you can get an improved algorithm, Mannila, etc., first considering this [MTV94], they believe that the sampling is an effective way to find the rules. Subsequently, TOIVONEN has further developed this idea [toi96], first use the sample extracted from the database to get some rules that may be established throughout the database, and then verify the result of the remainder of the database. TOIVONEN's algorithm is quite simple and significantly reduced I / O consideration, but a big disadvantage is that the result is not accurate, that is, there is so-called Data Skew. The data distributed on the same page is often highly correlated, which may not represent the distribution of the mode in the entire database, which caused by the cost of sample 5% of transaction data, may be similar to the data.
4.2.6 Dynamic Item Set Count
BRIN et al. Gives the algorithm [BMUT97]. Dynamic item set counting technology divides the database into blocks of the label start point. Unlike Apriori to determine new candidates only before each complete database scan, in this deformation, you can add a new candidate set at any starting point. This technology dynamically evaluates the support of all items that are counted, if all subsets of a item set are confirmed to be frequent, add it as a new candidate. The database scan required by the result algorithm is less than Apriori.
4.3 FP-tree frequency set algorithm
Aiming at the inherent defects of the Apriori algorithm, J. Han et al. Proposed a method of not generating a current set of candidate excavation [hpy00]. The strategy of using a division, after the first pass scan, compress the frequency set in the database into a frequent mode tree (FP-TREE), and still retain the associated information, then the FP-TREE is divided into some conditions. The library, each library and a frequency set of 1 length, and then excavate these conditional libraries. When the amount of raw data is large, it is also possible to combine the division, so that an FP-Tree can be placed in the main memory. Experiments have shown that FP-Growth has good adaptability to different lengths of different lengths, while the Apriori algorithm is huge in efficiency.
4.4 Multi-layer association rules
For a lot of applications, due to the dispersion of the data distribution, it is difficult to find some strong relationship rules at the level of the data. When we introduce the conceptual level, you can minimize [HF95, SA95] at a higher level. Although the rules obtained at higher levels may be more common information, but for a user is ordinary information, it is not necessary for another user. Therefore, data mining should provide such a function of mining at multiple hierarchies. Classification of Multilayer Association Rules: The multi-layer association rules can be divided into the same layer association rules and interlayer association rules according to the levels involved in the rules.
Multi-layer association rules can basically follow the "Support-Credibility" framework. However, there are some things that must be considered on the issue of support settings.
The same level association rule can adopt two support strategies:
1) Uniform minimum support. For different levels, the same minimum support is used. This is easy for users and algorithms, but the drawbacks are also obvious.
2) The minimum support is decremented. Each level has different minimum support, and the lower level of minimum support is relatively small. At the same time, you can also use the information obtained by the upper layer to perform some filtration.
When the interlayer association rule considers the minimum support, it should be determined according to the minimum support of the lower level.
4.5 Multidimensional Association Rules Mining
For multi-dimensional databases, in addition to the association rules in the dimension, there is a type of multi-dimensional association rules. E.g:
Age (X, "20 ... 30") occupation (X, "Student") ==> Purchase (X, "Laptop")
Here we involve three dimensions: age, occupation, purchase.
Depending on whether the same dimension is allowed to appear, it can be subdivided into intermediate association rules (not allowed to repeat the appearance) and the hybrid dimensional correlation rules (allowed to be maintained at the same time as the rules).
Age (X, "20 ... 30") Buy (X, "Laptop") ==> Purchase (X, "Printer")
This rule is to mix the tariff rules.
When excavating the interposition rules and hybrid correlation rules, you have to consider different fields: types and numerical types.
For species fields, the original algorithm can be processed. For numerical fields, you can do a certain process of processing [KHC97]. The method of processing a numeric field is basically the following:
1) The numerical field is divided into some predefined hierarchies. These intervals are predefined by the user. The rules that are also called static quantity association rules.
2) The numerical field is divided into some Boolean fields according to the distribution of data. Each Boolean field represents a range of numeric fields, which is 1, and it is 0. This kind of branch is dynamic. The rules derived are called the quantity association rules.
3) The numeric field is divided into some intervals that can reflect it. It considers the factors between the distance between data. The resulting rule called the distance based on the distance.
4) Directly analyze the raw data in the numerical field. Use some statistical methods to analyze the value of the numerical field, and combine the concept of multi-layer association rules, thereby comparison between multiple hierarchies to draw some useful rules. The rule is called multiple layers of correlation rules.
5 look
For development in the field of association rules, the author believes that in-depth research can be performed in some directions: how to improve algorithm efficiency when handling extremely large amounts of data; further study of mining algorithms for excavating quickly updated; In the middle, a method of interacting with the user is provided to combine the user's domain knowledge; for processing problems in the association rules; the resulting results, and the like.
references
[AIS93b] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD Conference on Management of data, pp 207-216, May 1993. [AS94a] R Assical Report FJ9839, IBM Almaden Research Center, San Jose, CA, JUN. 1994.
[AS94B] R. Agrawal, And R. Srikant. Fast algorithms for mining association rules. In proc. 1994 int. Conf. VLDBABASESS (VLDB'94), Sep. 1994.
[As96] R. Agrawal, And J. Shafer. Parallel Mining of Association Rules: Design, Implementation, And Experience. IEEE TRANS. Knowledge and data engineering, 8: 962-969, Jan. 1996.
[BMUT97] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In ACM SIGMOD International Conference On the Management of Data, p.p. 255-264, May 1997.
[HF95] J, HAN and Y. Fu. Discovery of Multiple-Level Association Rules from large databases. In Proc. Very Large Databases (VLDB'95), P.P. 402-431, Sep. 1995.
[Hpy00] j.han, j.pei, and y.yin.mining. Frequent Patterns WITHOUT CANDIDATE GENERATION. IN PR. 2000 ACM-SIGMOD INT. CONF. Management of Data (Sigmod'00), PP 1-12, MAY 2000.
[KHC97] M. Kamber, J. Han, J. Chang. Metarule-guided mining of multi-dimensional association rules997 int. Conf. Khowledge discovery and data mining (kdd'97), PP 207 -210, Aug. 1997
[KPR98] J. Kleinberg, C. Papadimitriou, And P. Raghavan. Segmentation Problems. Proceedings of the 30th Annual Symposium on Theory of Computing, ACM. Sep. 1998.
[MTV94] H. Mannila, H. Toivonen, and A. Verkamo. Efficient algorithm for discovering association rules. AAAI Workshop on Knowledge Discovery in Databases, pp 181-192, Jul. 1994. [PCY95a] JS Park, MS Chen, and Ps yu. An Effective Hash-based Algorithm for Mining Association Rules. Proceedings of ACM Sigmod International Conference ON Management Of Data, PP 175-186, May 1995.
[PCY95B] J. S. Park, M. S. Chen, And P. S. Yu. Efficient Parallel Data Mining of Association Rules. 4th International Conference ON Information and Knowledge Management, Baltimore, Maryland, Nov. 1995.
[SA95] R. Srikant, And R. Agrawal. Mining Generalized Association Rules. Proceedings of the 21st International Conference ON VERY LARGE DATABASE, P.P. 407-419, Sep.1995.
[SON95] A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. Proceedings of the 21st International Conference on Very Large Database, p.p. 432-443, Sep. 1995.
[TOI96] H. TOIVONEN. Sampling Large Databases for Association Rules. Proceedings of the 22nd International Conference ON VERY LARGE DATABASE, BOMBAY, INDIA, P.P. 134-145, Sep. 1996.
[related resources]
· BHW98 column: http://www.9cbs.net/develop/author/netauthor/bhw98/
First release: 2003-09-23 Last revision: 2003-09-23