Data Mining_ Concepts and Techniques - Jiawei Han [192]
Shortcomings of closed itemsets and maximal itemsets for compression
Table 7.3 shows a subset of frequent itemsets on a large data set, where a, b, c, d, e, f represent individual items. There are no closed itemsets here; therefore, we cannot use closed frequent itemsets to compress the data. The only maximal frequent itemset is P3. However, we observe that itemsets P2, P3, and P4 are significantly different with respect to their support counts. If we were to use P3 to represent a compressed version of the data, we would lose this support count information entirely. From visual inspection, consider the two pairs (P1, P2) and (P4, P5). The patterns within each pair are very similar with respect to their support and expression. Therefore, intuitively, P2, P3, and P4, collectively, should serve as a better compressed version of the data.
Table 7.3 Subset of Frequent Itemsets
IDItemsetsSupport
P1 {b, c, d, e} 205, 227
P2 {b, c, d, e, f} 205, 211
P3 {a, b, c, d, e, f} 101, 758
P4 {a, c, d, e, f} 161, 563
P5 {a, c, d, e} 161, 576
So, let's see if we can find a way of clustering frequent patterns as a means of obtaining a compressed representation of them. We will need to define a good similarity measure, cluster patterns according to this measure, and then select and output only a representative pattern for each cluster. Since the set of closed frequent patterns is a lossless compression over the original frequent patterns set, it is a good idea to discover representative patterns over the collection of closed patterns.
We can use the following distance measure between closed patterns. Let P1 and P2 be two closed patterns. Their supporting transaction sets are T(P1) and T(P2), respectively. The pattern distance of P1 and P2, Pat_Dist(P1, P2), is defined as
(7.14)
Pattern distance is a valid distance metric defined on the set of transactions. Note that it incorporates the support information of patterns, as desired previously.
Pattern distance
Suppose P1 and P2 are two patterns such that and , where ti is a transaction in the database. The distance between P1 and P2 is .
Now, let's consider the expression of patterns. Given two patterns A and B, we say B can be expressed by A if O(B) ⊂ O(A), where O(A) is the corresponding itemset of pattern A. Following this definition, assume patterns are in the same cluster. The representative pattern Pr of the cluster should be able to express all the other patterns in the cluster. Clearly, we have .
Using the distance measure, we can simply apply a clustering method, such as k-means (Section 10.2), on the collection of frequent patterns. However, this introduces two problems. First, the quality of the clusters cannot be guaranteed; second, it may not be able to find a representative pattern for each cluster (i.e., the pattern Pr may not belong to the same cluster). To overcome these problems, this is where the concept of δ-cluster comes in, where δ (0 ≤ δ ≤ 1) measures the tightness of a cluster.
A pattern P is δ-covered by another pattern P′ if and Pat_Dist(P, P′) ≤ δ. A set of patterns form a δ-cluster if there exists a representative pattern Pr such that for each pattern P in the set, P is δ-covered by Pr.
Note that according to the concept of δ-cluster, a pattern can belong to multiple clusters. Also, using δ-cluster, we only need to compute the distance between each pattern and the representative pattern of the cluster. Because a pattern P is δ-covered by a representative pattern Pr only if , we can simplify the distance calculation by considering only the supports of the patterns:
(7.15)
If we restrict the representative pattern to be frequent, then the number of representative patterns (i.e., clusters) is no less than the number of maximal frequent patterns. This is because a maximal frequent pattern can only be covered by itself. To achieve more succinct compression,