Data Mining_ Concepts and Techniques - Jiawei Han [171]
02 T200 {Best-Cheese, Dairyland-Milk, Goldenfarm-Apple, Tasty-Pie, Wonder-Bread}
01 T300 {Westcoast-Apple, Dairyland-Milk, Wonder-Bread, Tasty-Pie}
03 T400 {Wonder-Bread, Sunset-Milk, Dairyland-Cheese}
(a) At the granularity of item_category (e.g., could be “Milk”), for the rule template,
list the frequent k-itemset for the largest k, and all the strong association rules (with their support s and confidence c) containing the frequent k-itemset for the largest k.
(b) At the granularity of brand-item_category (e.g., could be “Sunset-Milk”), for the rule template,
list the frequent k-itemset for the largest k (but do not print any rules).
6.9 Suppose that a large store has a transactional database that is distributed among four locations. Transactions in each component database have the same format, namely , where Tj is a transaction identifier, and ik () is the identifier of an item purchased in the transaction. Propose an efficient algorithm to mine global association rules. You may present your algorithm in the form of an outline. Your algorithm should not require shipping all the data to one site and should not cause excessive network communication overhead.
6.10 Suppose that frequent itemsets are saved for a large transactional database, . Discuss how to efficiently mine the (global) association rules under the same minimum support threshold, if a set of new transactions, denoted as , is (incrementally) added in?
6.11 Most frequent pattern mining algorithms consider only distinct items in a transaction. However, multiple occurrences of an item in the same shopping basket, such as four cakes and three jugs of milk, can be important in transactional data analysis. How can one mine frequent itemsets efficiently considering multiple occurrences of items? Propose modifications to the well-known algorithms, such as Apriori and FP-growth, to adapt to such a situation.
6.12 (Implementation project) Many techniques have been proposed to further improve the performance of frequent itemset mining algorithms. Taking FP-tree–based frequent pattern growth algorithms (e.g., FP-growth) as an example, implement one of the following optimization techniques. Compare the performance of your new implementation with the unoptimized version.
(a) The frequent pattern mining method of Section 6.2.4 uses an FP-tree to generate conditional pattern bases using a bottom-up projection technique (i.e., project onto the prefix path of an item p). However, one can develop a top-down projection technique, that is, project onto the suffix path of an item p in the generation of a conditional pattern base. Design and implement such a top-down FP-tree mining method. Compare its performance with the bottom-up projection method.
(b) Nodes and pointers are used uniformly in an FP-tree in the FP-growth algorithm design. However, such a structure may consume a lot of space when the data are sparse. One possible alternative design is to explore array- and pointer-based hybrid implementation, where a node may store multiple items when it contains no splitting point to multiple sub-branches. Develop such an implementation and compare it with the original one.
(c) It is time and space consuming to generate numerous conditional pattern bases during pattern-growth mining. An interesting alternative is to push right the branches that have been mined for a particular item p, that is, to push them to the remaining branch(es) of the FP-tree. This is done so that fewer conditional pattern bases have to be generated and additional sharing can be explored when mining the remaining FP-tree branches. Design and implement such a method and conduct a performance study on it.
6.13 Give a short example to show that items in a strong association rule actually may be negatively correlated.
6.14 The following contingency table summarizes supermarket transaction data, where hot dogs refers to the transactions containing hot dogs, refers to the transactions that do not contain hot dogs, hamburgers refers to the transactions