Data Mining_ Concepts and Techniques - Jiawei Han [170]
■ Not all strong association rules are interesting. Therefore, the support–confidence framework should be augmented with a pattern evaluation measure, which promotes the mining of interesting rules. A measure is null-invariant if its value is free from the influence of null-transactions (i.e., the transactions that do not contain any of the itemsets being examined). Among many pattern evaluation measures, we examined lift, , all_confidence, max_confidence, Kulczynski, and cosine, and showed that only the latter four are null-invariant. We suggest using the Kulczynski measure, together with the imbalance ratio, to present pattern relationships among itemsets.
6.5. Exercises
6.1 Suppose you have the set of all frequent closed itemsets on a data set D, as well as the support count for each frequent closed itemset. Describe an algorithm to determine whether a given itemset X is frequent or not, and the support of X if it is frequent.
6.2 An itemset X is called a generator on a data set D if there does not exist a proper sub-itemset such that . A generator X is a frequent generator if passes the minimum support threshold. Let be the set of all frequent generators on a data set D.
(a) Can you determine whether an itemset A is frequent and the support of A, if it is frequent, using only and the support counts of all frequent generators? If yes, present your algorithm. Otherwise, what other information is needed? Can you give an algorithm assuming the information needed is available?
(b) What is the relationship between closed itemsets and generators?
6.3 The Apriori algorithm makes use of prior knowledge of subset support properties.
(a) Prove that all nonempty subsets of a frequent itemset must also be frequent.
(b) Prove that the support of any nonempty subset of itemset s must be at least as great as the support of s.
(c) Given frequent itemset l and subset s of l, prove that the confidence of the rule “” cannot be more than the confidence of “,” where is a subset of s.
(d) A partitioning variation of Apriori subdivides the transactions of a database D into n nonoverlapping partitions. Prove that any itemset that is frequent in D must be frequent in at least one partition of D.
6.4 Let c be a candidate itemset in Ck generated by the Apriori algorithm. How many length-(k − 1) subsets do we need to check in the prune step? Per your previous answer, can you give an improved version of procedure has_infrequent_subset in Figure 6.4?
6.5 Section 6.2.2 describes a method for generating association rules from frequent itemsets. Propose a more efficient method. Explain why it is more efficient than the one proposed there. (Hint: Consider incorporating the properties of (b) and (c) into your design.)
6.6 A database has five transactions. Let % and %.
TIDitems_bought
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y }
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I, E}
(a) Find all frequent itemsets using Apriori and FP-growth, respectively. Compare the efficiency of the two mining processes.
(b) List all the strong association rules (with support s and confidence c) matching the following metarule, where X is a variable representing customers, and denotes variables representing items (e.g., “A,” “B,”):
6.7 (Implementation project) Using a programming language that you are familiar with, such as C++ or Java, implement three frequent itemset mining algorithms introduced in this chapter: (1) Apriori [AS94b], (2) FP-growth [HPY00] and (3) Eclat [Zak00] (mining using the vertical data format). Compare the performance of each algorithm with various kinds of large data sets. Write a report to analyze the situations (e.g., data size, data distribution, minimal support threshold setting, and pattern density) where one algorithm may perform better than the others, and state why.
6.8 A database has four transactions. Let % and %.
cust_IDTIDitems_bought (in the form of brand-item_category)
01 T100 {King's-Crab,