Data Mining_ Concepts and Techniques - Jiawei Han [156]
Let be the set of closed frequent itemsets for a data set D satisfying a minimum support threshold, . Let be the set of maximal frequent itemsets for D satisfying . Suppose that we have the support count of each itemset in and . Notice that and its count information can be used to derive the whole set of frequent itemsets. Thus, we say that contains complete information regarding its corresponding frequent itemsets. On the other hand, registers only the support of the maximal itemsets. It usually does not contain the complete support information regarding its corresponding frequent itemsets. We illustrate these concepts with Example 6.2.
Closed and maximal frequent itemsets
Suppose that a transaction database has only two transactions: {; }. Let the minimum support count threshold be . We find two closed frequent itemsets and their support counts, that is, {{; {}. There is only one maximal frequent itemset: {{}. Notice that we cannot include {} as a maximal frequent itemset because it has a frequent superset, {}. Compare this to the preceding where we determined that there are frequent itemsets, which are too many to be enumerated!
The set of closed frequent itemsets contains complete information regarding the frequent itemsets. For example, from , we can derive, say, (1) {} since {} is a sub-itemset of the itemset {}; and (2) {} since {} is not a sub-itemset of the previous itemset but of the itemset {}. However, from the maximal frequent itemset, we can only assert that both itemsets ({} and {}) are frequent, but we cannot assert their actual support counts.
6.2. Frequent Itemset Mining Methods
In this section, you will learn methods for mining the simplest form of frequent patterns such as those discussed for market basket analysis in Section 6.1.1. We begin by presenting Apriori, the basic algorithm for finding frequent itemsets (Section 6.2.1). In Section 6.2.2, we look at how to generate strong association rules from frequent itemsets. Section 6.2.3 describes several variations to the Apriori algorithm for improved efficiency and scalability. Section 6.2.4 presents pattern-growth methods for mining frequent itemsets that confine the subsequent search space to only the data sets containing the current frequent itemsets. Section 6.2.5 presents methods for mining frequent itemsets that take advantage of the vertical data format.
6.2.1. Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean association rules [AS94b]. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties, as we shall see later. Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k + 1)-itemsets. First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support. The resulting set is denoted by L1. Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database.
To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property is used to reduce the search space.
Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
The Apriori property is based on the following observation. By definition, if an itemset I does not satisfy the minimum support threshold, min_sup, then I is not frequent, that is, . If an item A is added to the itemset I, then the resulting itemset (i.e., ) cannot occur more frequently than I. Therefore, is not frequent either, that is, .
This property belongs to a special category of properties