Data Mining - Mehmed Kantardzic [161]
Figure 10.2. First iteration of the Apriori algorithm for a database DB. (a) Generate phase; (b1) count phase; (b2) select phase.
To discover the set of large 2-itemsets, because any subset of a large itemset could also have minimum support, the Apriori algorithm uses L1*L1 to generate the candidates. The operation * is defined in general as
For k = 1 the operation represents a simple concatenation. Therefore, C2 consists of 2-itemsets generated by the operation|L1 · (|L1|−1)/2 as candidates in the second iteration. In our example, this number is 4 · 3/2 = 6. Scanning the database DB with this list, the algorithm counts the support for every candidate and in the end selects a large 2-itemsets L2 for which s ≥ 50%. All these steps and the corresponding results of the second iteration are given in Figure 10.3.
Figure 10.3. Second iteration of the Apriori algorithm for a database DB. (a) Generate phase; (b1) count phase; (b2) select phase
The set of candidate itemset C3 is generated from L2 using the previously defined operation L2*L2. Practically, from L2, two large 2-itemsets with the same first item, such as {B, C} and {B, E}, are identified first. Then, Apriori tests whether the 2-itemset {C, E}, which consists of the second items in the sets {B, C} and {B, E}, constitutes a large 2-itemset or not. Because {C, E} is a large itemset by itself, we know that all the subsets of {B, C, E} are large, and then {B, C, E} becomes a candidate 3-itemset. There is no other candidate 3-itemset from L2 in our database DB. Apriori then scans all the transactions and discovers the large 3-itemsets L3, as shown in Figure 10.4.
Figure 10.4. Third iteration of the Apriori algorithm for a database DB. (a) Generate phase; (b1) count phase; (b2) select phase
In our example, because there is no candidate 4-itemset to be constituted from L3, Apriori ends the iterative process.
Apriori counts not only the support of all frequent itemsets, but also the support of those infrequent candidate itemsets that could not be eliminated during the pruning phase. The set of all candidate itemsets that are infrequent but whose support is counted by Apriori is called the negative border. Thus, an itemset is in the negative border if it is infrequent, but all its subsets are frequent. In our example, analyzing Figures 10.2 and 10.3, we can see that the negative border consists of itemsets {D}, {A, B}, and {A, E}. The negative border is especially important for some improvements in the Apriori algorithm such as increased efficiency in the generation of large itemsets.
10.3 FROM FREQUENT ITEMSETS TO ASSOCIATION RULES
The second phase in discovering association rules based on all frequent i-itemsets, which have been found in the first phase using the Apriori or some other similar algorithm, is relatively simple and straightforward. For a rule that implies {x1,x2, x3} → x4, it is necessary that both itemset {x1, x2, x3, x4} and {x1, x2, x3} are frequent. Then, the confidence c of the rule is computed as the quotient of supports for the itemsets c = s(x1,x2,x3,x4)/s(x1,x2,x3). Strong association rules are rules with a confidence value c above a given threshold.
For our example of database DB in Table 10.1, if we want to check whether the association rule {B, C} → E is a strong rule, first we select the corresponding supports from tables L2 and L3:
and using these supports we compute the confidence of the rule:
Whatever the selected threshold for strong association rules is (e.g., cT = 0.8 or 80%), this rule will pass because its confidence is maximal, that is, if a transaction contains items B and C, it will also contain item E. Other