Data Mining_ Concepts and Techniques - Jiawei Han [133]
Figure 5.14 Shell fragment computation algorithm.
The benefit of computing local cubes of each shell fragment instead of computing the complete cube shell can be seen by a simple calculation. For a base cuboid of 60 dimensions, there are only 7 × 20 = 140 cuboids to be computed according to the preceding shell fragment partitioning. This is in contrast to the 36,050 cuboids computed for the cube shell of size 3 described earlier! Notice that the above fragment partitioning is based simply on the grouping of consecutive dimensions. A more desirable approach would be to partition based on popular dimension groupings. This information can be obtained from domain experts or the past history of OLAP queries.
Let's return to our running example to see how shell fragments are computed.
Compute shell fragments
Suppose we are to compute the shell fragments of size 3. We first divide the five dimensions into two fragments, namely (A, B, C) and (D, E). For each fragment, we compute the full local data cube by intersecting the TID lists in Table 5.5 in a top-down depth-first order in the cuboid lattice. For example, to compute the cell (a1, b2, ∗), we intersect the TID lists of a1 and b2 to obtain a new list of {2, 3}. Cuboid AB is shown in Table 5.6.
Table 5.6 Cuboid AB
CellIntersectionTID ListList Size
(a1, b1) {1, 2, 3} ∩ {1, 4, 5} {1} 1
(a1, b2) {1, 2, 3} ∩ {2, 3} {2, 3} 2
(a2, b1) {4, 5} ∩ {1, 4, 5} {4, 5} 2
(a2, b2) {4, 5} ∩ {2, 3} {} 0
After computing cuboid AB, we can then compute cuboid ABC by intersecting all pairwise combinations between Table 5.6 and the row c1 in Table 5.5. Notice that because cell (a2, b2) is empty, it can be effectively discarded in subsequent computations, based on the Apriori property. The same process can be applied to compute fragment (D, E), which is completely independent from computing (A, B, C). Cuboid DE is shown in Table 5.7.
Table 5.7 Cuboid DE
CellIntersectionTID ListList Size
(d1, e1) {1, 3, 4, 5} ∩ {1, 2} {1} 1
(d1, e2) {1, 3, 4, 5} ∩ {3, 4} {3, 4} 2
(d1, e3) {1, 3, 4, 5} ∩ {5} {5} 1
(d2, e1) {2} ∩ {1, 2} {2} 1
If the measure in the iceberg condition is count() (as in tuple counting), there is no need to reference the original database for this because the length of the TID list is equivalent to the tuple count. “Do we need to reference the original database if computing other measures such as average()?” Actually, we can build and reference an ID_measure array instead, which stores what we need to compute other measures. For example, to compute average(), we let the ID_measure array hold three elements, namely, (TID, item_count, sum), for each cell (line 3 of the shell fragment computation algorithm in Figure 5.14). The average() measure for each aggregate cell can then be computed by accessing only this ID_measure array, using sum()/item_count(). Considering a database with 106 tuples, each taking 4 bytes each for TID, item_count, and sum, the ID_measure array requires 12 MB, whereas the corresponding database of 60 dimensions will require (60 + 3) × 4 × 106 = 252 MB (assuming each attribute value takes 4 bytes). Obviously, ID_measure array is a more