Data Mining_ Concepts and Techniques - Jiawei Han [157]
6The Apriori property has many applications. For example, it can also be used to prune search during data cube computation (Chapter 5).
“How is the Apriori property used in the algorithm?” To understand this, let us look at how Lk − 1 is used to find Lk for . A two-step process is followed, consisting of join and prune actions.
1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk − 1 with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk − 1. The notation refers to the j th item in li (e.g., refers to the second to the last item in l1). For efficient implementation, Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. For the (k − 1)-itemset, li, this means that the items are sorted such that . The join, , is performed, where members of Lk − 1 are joinable if their first items are in common. That is, members l1 and l2 of Lk − 1 are joined if (. The condition simply ensures that no duplicates are generated. The resulting itemset formed by joining l1 and l2 is .
2. The prune step: Ck is a superset of Lk, that is, its members may or may not be frequent, but all of the frequent k-itemsets are included in Ck. A database scan to determine the count of each candidate in Ck would result in the determination of Lk (i.e., all candidates having a count no less than the minimum support count are frequent by definition, and therefore belong to Lk). Ck, however, can be huge, and so this could involve heavy computation. To reduce the size of Ck, the Apriori property is used as follows. Any (k − 1)-itemset that is not frequent cannot be a subset of a frequent k-itemset. Hence, if any (k − 1)-subset of a candidate k-itemset is not in Lk − 1, then the candidate cannot be frequent either and so can be removed from Ck. This subset testing can be done quickly by maintaining a hash tree of all frequent itemsets.
Apriori
Let's look at a concrete example, based on the AllElectronics transaction database, D, of Table 6.1. There are nine transactions in this database, that is, . We use Figure 6.2 to illustrate the Apriori algorithm for finding frequent itemsets in D.
Figure 6.2 Generation of the candidate itemsets and frequent itemsets, where the minimum support count is 2.
Table 6.1 Transactional Data for an AllElectronics Branch
TIDList of item_IDs
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
1. In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets, C1. The algorithm simply scans all of the transactions to count the number of occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, . (Here, we are referring to absolute support because we are using a support count. The corresponding relative support is %.) The set of frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets satisfying minimum support. In our example, all of the candidates in C1 satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join to generate a candidate set of 2-itemsets, C2. 7C2 consists of 2-itemsets. Note that no candidates are removed from C2 during the prune step because each subset of the candidates is also frequent.
7 is equivalent to , since the definition of requires the two joining itemsets to share k − 1 = 0 items.
4. Next, the transactions in D are scanned and the support count of each candidate itemset in C2 is accumulated, as shown in the middle table of the second row in Figure 6.2.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having minimum support.
6. The generation of the set of the candidate 3-itemsets,