Data Mining_ Concepts and Techniques - Jiawei Han [164]
Similar to Example 6.2, suppose a transaction database has only two transactions: {, }, and the minimum support count is min_sup = 2. The projection on the first item, , derives the frequent itemset, {}, based on the itemset merging optimization. Because support ({}) = support ({}) = 2, and {} is a proper subset of {}, there is no need to examine and its projected database. Similar pruning can be done for as well. Thus, the mining of closed frequent itemsets in this data set terminates after mining 's projected database.
Item skipping: In the depth-first mining of closed itemsets, at each level, there will be a prefix itemset X associated with a header table and a projected database. If a local frequent item p has the same support in several header tables at different levels, we can safely prune p from the header tables at higher levels.
Consider, for example, the previous transaction database having only two transactions: {, }, where . Because in 's projected database has the same support as in the global header table, can be pruned from the global header table. Similar pruning can be done for . There is no need to mine anything more after mining 's projected database.
Besides pruning the search space in the closed itemset mining process, another important optimization is to perform efficient checking of each newly derived frequent itemset to see whether it is closed. This is because the mining process cannot ensure that every generated frequent itemset is closed.
When a new frequent itemset is derived, it is necessary to perform two kinds of closure checking: (1) superset checking, which checks if this new frequent itemset is a superset of some already found closed itemsets with the same support, and (2) subset checking, which checks whether the newly found itemset is a subset of an already found closed itemset with the same support.
If we adopt the item merging pruning method under a divide-and-conquer framework, then the superset checking is actually built-in and there is no need to explicitly perform superset checking. This is because if a frequent itemset is found later than itemset X, and carries the same support as X, it must be in X's projected database and must have been generated during itemset merging.
To assist in subset checking, a compressed pattern-tree can be constructed to maintain the set of closed itemsets mined so far. The pattern-tree is similar in structure to the FP-tree except that all the closed itemsets found are stored explicitly in the corresponding tree branches. For efficient subset checking, we can use the following property: If the current itemset can be subsumed by another already found closed itemset , then (1) and have the same support, (2) the length of is smaller than that of , and (3) all of the items in are contained in .
Based on this property, a two-level hash index structure can be built for fast accessing of the pattern-tree: The first level uses the identifier of the last item in as a hash key (since this identifier must be within the branch of ), and the second level uses the support of as a hash key (since and have the same support). This will substantially speed up the subset checking process.
This discussion illustrates methods for efficient mining of closed frequent itemsets. “Can we extend these methods for efficient mining of maximal frequent itemsets?” Because maximal frequent itemsets share many similarities with closed frequent itemsets, many of the optimization techniques developed here can be extended to mining maximal frequent itemsets. However, we leave this method as an exercise for interested readers.
6.3. Which Patterns Are Interesting?—Pattern Evaluation Methods
Most association rule mining algorithms employ a support–confidence framework. Although minimum support and confidence thresholds help weed out or exclude the exploration of a good number of uninteresting rules, many of the rules generated are still not interesting to the users. Unfortunately, this is especially true when mining at low support thresholds or mining