Data Mining - Mehmed Kantardzic [160]
TABLE 10.1. A Model of A Simple Transaction Database
Database DB:
TID Items
001 A C D
002 B C E
003 A B C E
004 B E
Let X be a set of items. A transaction T is said to contain X if and only if X ⊆ T. An association rule implies the form X ⇒ Y, where X ⊆ I, Y ⊆ I, and X ∩Y = Ø. The rule X ⇒ Y holds in the transaction set DB with confidence c if c% of the transactions in DB that contain X also contain Y. The rule X ⇒ Y has support s in the transaction set DB if s% of the transactions in DB contain X ∪ Y. Confidence denotes the strength of implication and support indicates the frequency of the patterns occurring in the rule. It is often desirable to pay attention to only those rules that may have a reasonably large support. Such rules with high confidence and strong support are referred to as strong rules. The task of mining association rules is essentially to discover strong association rules in large databases. The problem of mining association rules may be decomposed into two phases:
1. Discover the large itemsets, that is, the sets of items that have transaction support s above a predetermined minimum threshold.
2. Use the large itemsets to generate the association rules for the database that have confidence c above a predetermined minimum threshold.
The overall performance of mining association rules is determined primarily by the first step. After the large itemsets are identified, the corresponding association rules can be derived in a straightforward manner. Efficient counting of large itemsets is thus the focus of most mining algorithms, and many efficient solutions have been designed to address previous criteria. The Apriori algorithm provided one early solution to the problem, and it will be explained in greater detail in this chapter. Other subsequent algorithms built upon the Apriori algorithm represent refinements of a basic solution and they are explained in a wide spectrum of articles including texts recommended in Section 10.10.
10.2 ALGORITHM APRIORI
The algorithm Apriori computes the frequent itemsets in the database through several iterations. Iteration i computes all frequent i-itemsets (itemsets with i elements). Each iteration has two steps: candidate generation and candidate counting and selection.
In the first phase of the first iteration, the generated set of candidate itemsets contains all 1-itemsets (i.e., all items in the database). In the counting phase, the algorithm counts their support by searching again through the whole database. Finally, only 1-itemsets (items) with s above required threshold will be selected as frequent. Thus, after the first iteration, all frequent 1-itemsets will be known.
What are the itemsets generated in the second iteration? In other words, how does one generate 2-itemset candidates? Basically, all pairs of items are candidates. Based on knowledge about infrequent itemsets obtained from previous iterations, the Apriori algorithm reduces the set of candidate itemsets by pruning—a priori—those candidate itemsets that cannot be frequent. The pruning is based on the observation that if an itemset is frequent all its subsets could be frequent as well. Therefore, before entering the candidate-counting step, the algorithm discards every candidate itemset that has an infrequent subset.
Consider the database in Table 10.1. Assume that the minimum support s = 50%, so an itemset is frequent if it is contained in at least 50% of the transactions—in our example, in two out of every four transactions in the database. In each iteration, the Apriori algorithm constructs a candidate set of large itemsets, counts the number of occurrences of each candidate, and then determines large itemsets based on the predetermined minimum support s = 50%.
In the first step of the first iteration, all single items are candidates. Apriori