Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining_ Concepts and Techniques - Jiawei Han [133]

By Root 1547 0

dimensions of the given data set into independent groups of dimensions, called fragments (line 1). We scan the base cuboid and construct an inverted index for each attribute (lines 2 to 6). Line 3 is for when the measure is other than the tuple count(), which will be described later. For each fragment, we compute the full local (i.e., fragment-based) data cube while retaining the inverted indices (lines 7 to 8). Consider a database of 60 dimensions, namely, A1, A2, …, A60. We can first partition the 60 dimensions into 20 fragments of size 3: (A1, A2, A3), (A4, A5, A6), …, (A58, A59, A60). For each fragment, we compute its full data cube while recording the inverted indices. For example, in fragment (A1, A2, A3), we would compute seven cuboids: A1, A2, A3, A1A2, A2A3, A1A3, A1A2A3. Furthermore, an inverted index is retained for each cell in the cuboids. That is, for each cell, its associated TID list is recorded.

Figure 5.14 Shell fragment computation algorithm.

The benefit of computing local cubes of each shell fragment instead of computing the complete cube shell can be seen by a simple calculation. For a base cuboid of 60 dimensions, there are only 7 × 20 = 140 cuboids to be computed according to the preceding shell fragment partitioning. This is in contrast to the 36,050 cuboids computed for the cube shell of size 3 described earlier! Notice that the above fragment partitioning is based simply on the grouping of consecutive dimensions. A more desirable approach would be to partition based on popular dimension groupings. This information can be obtained from domain experts or the past history of OLAP queries.

Let's return to our running example to see how shell fragments are computed.

Compute shell fragments

Suppose we are to compute the shell fragments of size 3. We first divide the five dimensions into two fragments, namely (A, B, C) and (D, E). For each fragment, we compute the full local data cube by intersecting the TID lists in Table 5.5 in a top-down depth-first order in the cuboid lattice. For example, to compute the cell (a1, b2, ∗), we intersect the TID lists of a1 and b2 to obtain a new list of {2, 3}. Cuboid AB is shown in Table 5.6.

Table 5.6 Cuboid AB

CellIntersectionTID ListList Size

(a1, b1) {1, 2, 3} ∩ {1, 4, 5} {1} 1

(a1, b2) {1, 2, 3} ∩ {2, 3} {2, 3} 2

(a2, b1) {4, 5} ∩ {1, 4, 5} {4, 5} 2

(a2, b2) {4, 5} ∩ {2, 3} {} 0

After computing cuboid AB, we can then compute cuboid ABC by intersecting all pairwise combinations between Table 5.6 and the row c1 in Table 5.5. Notice that because cell (a2, b2) is empty, it can be effectively discarded in subsequent computations, based on the Apriori property. The same process can be applied to compute fragment (D, E), which is completely independent from computing (A, B, C). Cuboid DE is shown in Table 5.7.

Table 5.7 Cuboid DE

CellIntersectionTID ListList Size

(d1, e1) {1, 3, 4, 5} ∩ {1, 2} {1} 1

(d1, e2) {1, 3, 4, 5} ∩ {3, 4} {3, 4} 2

(d1, e3) {1, 3, 4, 5} ∩ {5} {5} 1

(d2, e1) {2} ∩ {1, 2} {2} 1

If the measure in the iceberg condition is count() (as in tuple counting), there is no need to reference the original database for this because the length of the TID list is equivalent to the tuple count. “Do we need to reference the original database if computing other measures such as average()?” Actually, we can build and reference an ID_measure array instead, which stores what we need to compute other measures. For example, to compute average(), we let the ID_measure array hold three elements, namely, (TID, item_count, sum), for each cell (line 3 of the shell fragment computation algorithm in Figure 5.14). The average() measure for each aggregate cell can then be computed by accessing only this ID_measure array, using sum()/item_count(). Considering a database with 106 tuples, each taking 4 bytes each for TID, item_count, and sum, the ID_measure array requires 12 MB, whereas the corresponding database of 60 dimensions will require (60 + 3) × 4 × 106 = 252 MB (assuming each attribute value takes 4 bytes). Obviously, ID_measure array is a more

Online Book Reader

Data Mining_ Concepts and Techniques - Jiawei Han [133]

®Online Book Reader