Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining_ Concepts and Techniques - Jiawei Han [155]

By Root 1422 0

That is,

(6.2)

(6.3)

Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf) are called strong. By convention, we write support and confidence values so as to occur between 0% and 100%, rather than 0 to 1.0.

1Notice that the notation indicates the probability that a transaction contains the union of sets A and B (i.e., it contains every item in A and B). This should not be confused with , which indicates the probability that a transaction contains either A or B.

A set of items is referred to as an itemset. 2 An itemset that contains k items is a k-itemset. The set {computer, antivirus_software } is a 2-itemset. The occurrence frequency of an itemset is the number of transactions that contain the itemset. This is also known, simply, as the frequency, support count, or count of the itemset. Note that the itemset support defined in Eq. (6.2) is sometimes referred to as relative support, whereas the occurrence frequency is called the absolute support. If the relative support of an itemset I satisfies a prespecified minimum support threshold (i.e., the absolute support of I satisfies the corresponding minimum support count threshold), then I is a frequent itemset. 3 The set of frequent k-itemsets is commonly denoted by Lk. 4

2In the data mining research literature, “itemset” is more commonly used than “item set.”

3In early work, itemsets satisfying minimum support were referred to as large. This term, however, is somewhat confusing as it has connotations of the number of items in an itemset rather than the frequency of occurrence of the set. Hence, we use the more recent term frequent.

4Although the term frequent is preferred over large, for historic reasons frequent k-itemsets are still denoted as Lk.

From Eq. (6.3), we have

(6.4)

Equation (6.4) shows that the confidence of rule can be easily derived from the support counts of A and . That is, once the support counts of A, B, and are found, it is straightforward to derive the corresponding association rules and and check whether they are strong. Thus, the problem of mining association rules can be reduced to that of mining frequent itemsets.

In general, association rule mining can be viewed as a two-step process:

1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min_sup.

2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence.

Additional interestingness measures can be applied for the discovery of correlation relationships between associated items, as will be discussed in Section 6.3. Because the second step is much less costly than the first, the overall performance of mining association rules is determined by the first step.

A major challenge in mining frequent itemsets from a large data set is the fact that such mining often generates a huge number of itemsets satisfying the minimum support (min_sup) threshold, especially when min_sup is set low. This is because if an itemset is frequent, each of its subsets is frequent as well. A long itemset will contain a combinatorial number of shorter, frequent sub-itemsets. For example, a frequent itemset of length 100, such as {}, contains frequent 1-itemsets: , , …, ; frequent 2-itemsets: , ,…,; and so on. The total number of frequent itemsets that it contains is thus

(6.5)

This is too huge a number of itemsets for any computer to compute or store. To overcome this difficulty, we introduce the concepts of closed frequent itemset and maximal frequent itemset.

An itemset X is closed in a data set D if there exists no proper super-itemset Y5 such that Y has the same support count as X in D. An itemset X is a closed frequent itemset in set D if X is both closed and frequent in D. An itemset X is a maximal frequent itemset (or max-itemset) in a data set D if X is frequent, and there exists no super-itemset Y such that and Y is frequent in D.

5Y is a proper super-itemset of X if

Online Book Reader

Data Mining_ Concepts and Techniques - Jiawei Han [155]

®Online Book Reader