Data Mining_ Concepts and Techniques - Jiawei Han [189]
Figure 7.6 A simple colossal patterns example: The data set contains an exponential number of midsize patterns of size 20 but only one that is colossal, namely (41, 42, …, 79).
Figure 7.7 Synthetic data that contain some colossal patterns but exponentially many midsize patterns.
All of the pattern mining strategies we have studied so far, such as Apriori and FP-growth, use an incremental growth strategy by nature, that is, they increase the length of candidate patterns by one at a time. Breadth-first search methods like Apriori cannot bypass the generation of an explosive number of midsize patterns generated, making it impossible to reach colossal patterns. Even depth-first search methods like FP-growth can be easily trapped in a huge amount of subtrees before reaching colossal patterns. Clearly, a completely new mining methodology is needed to overcome such a hurdle.
A new mining strategy called Pattern-Fusion was developed, which fuses a small number of shorter frequent patterns into colossal pattern candidates. It thereby takes leaps in the pattern search space and avoids the pitfalls of both breadth-first and depth-first searches. This method finds a good approximation to the complete set of colossal frequent patterns.
The Pattern-Fusion method has the following major characteristics. First, it traverses the tree in a bounded-breadth way. Only a fixed number of patterns in a bounded-size candidate pool are used as starting nodes to search downward in the pattern tree. As such, it avoids the problem of exponential search space.
Second, Pattern-Fusion has the capability to identify “shortcuts” whenever possible. Each pattern's growth is not performed with one-item addition, but with an agglomeration of multiple patterns in the pool. These shortcuts direct Pattern-Fusion much more rapidly down the search tree toward the colossal patterns. Figure 7.8 conceptualizes this mining model.
Figure 7.8 Pattern tree traversal: Candidates are taken froma pool of patterns, which results in shortcuts through pattern space to the colossal patterns.
As Pattern-Fusion is designed to give an approximation to the colossal patterns, a quality evaluation model is introduced to assess the patterns returned by the algorithm. An empirical study verifies that Pattern-Fusion is able to efficiently return high-quality results.
Let's examine the Pattern-Fusion method in more detail. First, we introduce the concept of core pattern. For a pattern α, an itemset β ⊆ α is said to be a τ-core pattern of α if , 0 < τ ≤ 1, where |Dα| is the number of patterns containing α in database D. τ is called the core ratio. A pattern α is (d, τ)-robust if d is the maximum number of items that can be removed from α for the resulting pattern to remain a τ-core pattern of α, that is,
Core patterns
Figure 7.9 shows a simple transaction database of four distinct transactions, each with 100 duplicates: . If we set τ = 0.5, then (ab) is a core pattern of α1 because (ab) is contained only by α1 and α4. Therefore, . α1 is (2, 0.5)-robust while α4 is (4, 0.5)-robust. The table also shows that larger patterns (e.g., (abcfe)) have far more core patterns than smaller ones (e.g., (bcf)).
Figure 7.9 A transaction database, which contains duplicates, and core patterns for each distinct transaction.
From Example 7.11, we can deduce that large or colossal patterns have far more core patterns than smaller patterns do. Thus, a colossal pattern is more robust in the sense that if a small number of items are removed from the pattern, the resulting