Data Mining_ Concepts and Techniques - Jiawei Han [130]
Star-tree construction
A base cuboid table is shown in Table 5.1. There are five tuples and four dimensions. The cardinalities for dimensions A, B, C, D are 2, 4, 4, 4, respectively. The one-dimensional aggregates for all attributes are shown in Table 5.2. Suppose min_sup = 2 in the iceberg condition. Clearly, only attribute values a1, a2, b1, c3, d4 satisfy the condition. All other values are below the threshold and thus become star-nodes. By collapsing star-nodes, the reduced base table is Table 5.3. Notice that the table contains two fewer rows and also fewer distinct values than Table 5.1.
Table 5.1 Base (Cuboid) Table: Before Star Reduction
ABCDcount
a1 b1 c1 d1 1
a1 b1 c4 d3 1
a1 b2 c2 d2 1
a2 b3 c3 d4 1
a2 b4 c3 d4 1
Table 5.2 One-Dimensional Aggregates
Dimensioncount = 1count ≥ 2
A — a1(3), a2(2)
B b2, b3, b4 b1(2)
C c1, c2, c4 c3(2)
D d1, d2, d3 d4(2)
Table 5.3 Compressed Base Table: After Star Reduction
ABCDcount
a1 b1 ∗ ∗ 2
a1 ∗ ∗ ∗ 1
a2 ∗ c3 d4 2
We use the reduced base table to construct the cuboid tree because it is smaller. The resultant star-tree is shown in Figure 5.10.
Figure 5.10 Compressed base table star-tree.
Now, let's see how the Star-Cubing algorithm uses star-trees to compute an iceberg cube. The algorithm is given later in Figure 5.13.
Figure 5.13 Star-Cubing algorithm.
Star-Cubing
Using the star-tree generated in Example 5.7 (Figure 5.10), we start the aggregation process by traversing in a bottom-up fashion. Traversal is depth-first. The first stage (i.e., the processing of the first branch of the tree) is shown in Figure 5.11. The leftmost tree in the figure is the base star-tree. Each attribute value is shown with its corresponding aggregate value. In addition, subscripts by the nodes in the tree show the traversal order. The remaining four trees are BCD, ACD/A, ABD/AB, and ABC/ABC. They are the child trees of the base star-tree, and correspond to the level of 3-D cuboids above the base cuboid in Figure 5.8. The subscripts in them correspond to the same subscripts in the base tree—they denote the step or order in which they are created during the tree traversal. For example, when the algorithm is at step 1, the BCD child tree root is created. At step 2, the ACD/A child tree root is created. At step 3, the ABD/AB tree root and the b∗ node in BCD are created.
Figure 5.11 Aggregation stage one: processing the leftmost branch of the base tree.
When the algorithm has reached step 5, the trees in memory are exactly as shown in Figure 5.11. Because depth-first traversal has reached a leaf at this point, it starts backtracking. Before traversing back, the algorithm notices that all possible nodes in the base dimension (ABC) have been visited. This means the ABC/ABC tree is complete, so the count is output and the tree is destroyed. Similarly, upon moving back from d∗ to c∗ and seeing that c∗ has no siblings, the count in ABD/AB is also output and the tree is destroyed.
When the algorithm is at b∗ during the backtraversal, it notices that there exists a sibling in b1. Therefore, it will keep ACD/A in memory and perform a depth-first search on b1 just as it did on b∗. This traversal and the resultant trees are shown in Figure 5.12. The child trees ACD/A and ABD/AB are created again but now with the new values from the b1 subtree. For example, notice that the aggregate count of c∗ in the ACD/A tree has increased from 1 to 3. The trees that remained intact during the last traversal are reused and the new aggregate values are added on. For instance, another branch is added to the BCD tree.
Figure 5.12 Aggregation stage two: processing the second branch of the base tree.
Just like before, the algorithm will reach a leaf node at d∗ and traverse back. This time, it will reach a1 and notice that there exists a sibling in a2. In this case, all child trees except