Online Book Reader

Home Category

Data Mining_ Concepts and Techniques - Jiawei Han [122]

By Root 1637 0
uninteresting cells to compute. For example, suppose that there are 2 base cells for a database of 100 dimensions, denoted as {(a1, a2, a3, …, a100) : 10, (a1, a2, b3, …, b100) : 10}, where each has a cell count of 10. If the minimum support is set to 10, there will still be an impermissible number of cells to compute and store, although most of them are not interesting. For example, there are 2101 − 6 distinct aggregate cells, 2 like {(a1, a2, a3, a4, …, a99, ∗) : 10, …, (a1, a2, ∗, a4, …, a99, a100) : 10, …, (a1, a2, a3, ∗, …, ∗, ∗) : 10}, but most of them do not contain much new information. If we ignore all the aggregate cells that can be obtained by replacing some constants by ∗'s while keeping the same measure value, there are only three distinct cells left: {(a1, a2, a3, …, a100) : 10, (a1, a2, b3, …, b100) : 10, (a1, a2, ∗, …, ∗) : 20}. That is, out of 2101 − 4 distinct base and aggregate cells, only three really offer valuable information.

2The proof is left as an exercise for the reader.

To systematically compress a data cube, we need to introduce the concept of closed coverage. A cell, c, is a closed cell if there exists no cell, d, such that d is a specialization (descendant) of cell c (i.e., where d is obtained by replacing ∗ in c with a non-∗ value), and d has the same measure value as c. A closed cube is a data cube consisting of only closed cells. For example, the three cells derived in the preceding paragraph are the three closed cells of the data cube for the data set {(a1, a2, a3, …, a100) : 10, (a1, a2, b3, …, b100) : 10}. They form the lattice of a closed cube as shown in Figure 5.2. Other nonclosed cells can be derived from their corresponding closed cells in this lattice. For example, “(a1, ∗, ∗, …, ∗) : 20” can be derived from “(a1, a2, ∗, …, ∗) : 20” because the former is a generalized nonclosed cell of the latter. Similarly, we have “(a1, a2, b3, ∗, …, ∗) : 10.”

Figure 5.2 Three closed cells forming the lattice of a closed cube.

Another strategy for partial materialization is to precompute only the cuboids involving a small number of dimensions such as three to five. These cuboids form a cube shell for the corresponding data cube. Queries on additional combinations of the dimensions will have to be computed on-the-fly. For example, we could compute all cuboids with three dimensions or less in an n-dimensional data cube, resulting in a cube shell of size 3. This, however, can still result in a large number of cuboids to compute, particularly when n is large. Alternatively, we can choose to precompute only portions or fragments of the cube shell based on cuboids of interest. Section 5.2.4 discusses a method for computing shell fragments and explores how they can be used for efficient OLAP query processing.

5.1.2. General Strategies for Data Cube Computation

There are several methods for efficient data cube computation, based on the various kinds of cubes described in Section 5.1.1. In general, there are two basic data structures used for storing cuboids. The implementation of relational OLAP (ROLAP) uses relational tables, whereas multidimensional arrays are used in multidimensional OLAP (MOLAP). Although ROLAP and MOLAP may each explore different cube computation techniques, some optimization “tricks” can be shared among the different data representations. The following are general optimization techniques for efficient computation of data cubes.

Optimization Technique 1: Sorting, hashing, and grouping. Sorting, hashing, and grouping operations should be applied to the dimension attributes to reorder and cluster related tuples.

In cube computation, aggregation is performed on the tuples (or cells) that share the same set of dimension values. Thus, it is important to explore sorting, hashing, and grouping operations to access and group such data together to facilitate computation of such aggregates.

To compute total sales by branch, day, and item, for example, it can be more efficient to sort tuples or cells by branch, and then by day, and then group them according to the item name.

Return Main Page Previous Page Next Page

®Online Book Reader