Data Mining_ Concepts and Techniques - Jiawei Han [128]
Figure 5.7 illustrates how the input is partitioned first according to the different attribute values of dimension A, and then B, C, and D. To do so, BUC scans the input, aggregating the tuples to obtain a count for all, corresponding to the cell (∗, ∗, ∗, ∗). Dimension A is used to split the input into four partitions, one for each distinct value of A. The number of tuples (counts) for each distinct value of A is recorded in dataCount.
Figure 5.7 BUC partitioning snapshot given an example 4-D data set.
BUC uses the Apriori property to save time while searching for tuples that satisfy the iceberg condition. Starting with A dimension value, a1, the a1 partition is aggregated, creating one tuple for the A group-by, corresponding to the cell (a1, ∗, ∗, ∗). Suppose (a1, ∗, ∗, ∗) satisfies the minimum support, in which case a recursive call is made on the partition for a1. BUC partitions a1 on the dimension B. It checks the count of (a1, b1, ∗, ∗) to see if it satisfies the minimum support. If it does, it outputs the aggregated tuple to the AB group-by and recurses on (a1, b1, ∗, ∗) to partition on C, starting with c1. Suppose the cell count for (a1, b1, c1, ∗) is 2, which does not satisfy the minimum support. According to the Apriori property, if a cell does not satisfy the minimum support, then neither can any of its descendants. Therefore, BUC prunes any further exploration of (a1, b1, c1, ∗). That is, it avoids partitioning this cell on dimension D. It backtracks to the a1, b1 partition and recurses on (a1, b1, c2, ∗), and so on. By checking the iceberg condition each time before performing a recursive call, BUC saves a great deal of processing time whenever a cell's count does not satisfy the minimum support.
The partition process is facilitated by a linear sorting method, CountingSort. CountingSort is fast because it does not perform any key comparisons to find partition boundaries. In addition, the counts computed during the sort can be reused to compute the group-by's in BUC. Line 2 is an optimization for partitions having a count of 1 such as (a1, b2, ∗, ∗) in our example. To save on partitioning costs, the count is written to each of the tuple's descendant group-by's. This is particularly useful since, in practice, many partitions have a single tuple.
The BUC performance is sensitive to the order of the dimensions and to skew in the data. Ideally, the most discriminating dimensions should be processed first. Dimensions should be processed in the order of decreasing cardinality. The higher the cardinality, the smaller the partitions, and thus the more partitions there will be, thereby providing BUC with a greater opportunity for pruning. Similarly, the more uniform a dimension (i.e., having less skew), the better it is for pruning.
BUC's major contribution is the idea of sharing partitioning costs. However, unlike MultiWay, it does not share the computation of aggregates between parent and child group-by's. For example, the computation of cuboid AB does not help that of ABC. The latter needs to be computed essentially from scratch.
5.2.3. Star-Cubing: Computing Iceberg Cubes Using a Dynamic Star-Tree Structure
In this section, we describe the Star-Cubing algorithm for computing iceberg cubes. Star-Cubing combines the strengths of the other methods we have studied up to this point. It integrates top-down and bottom-up cube computation and explores both multidimensional aggregation (similar to MultiWay) and Apriori-like pruning (similar to BUC). It operates from a data structure called a star-tree, which performs lossless data compression, thereby reducing the computation time and memory requirements.
The Star-Cubing algorithm explores both the bottom-up and top-down computation models as follows: On the global computation order, it uses the bottom-up