Online Book Reader

Home Category

Data Mining_ Concepts and Techniques - Jiawei Han [120]

By Root 1703 0
a group-by. ABC is the base cuboid, containing all three of the dimensions. Here, the aggregate measure, M, is computed for each possible combination of the three dimensions. The base cuboid is the least generalized of all the cuboids in the data cube. The most generalized cuboid is the apex cuboid, commonly represented as all. It contains one value—it aggregates measure M for all the tuples stored in the base cuboid. To drill down in the data cube, we move from the apex cuboid downward in the lattice. To roll up, we move from the base cuboid upward. For the purposes of our discussion in this chapter, we will always use the term data cube to refer to a lattice of cuboids rather than an individual cuboid.

Figure 5.1 Lattice of cuboids making up a 3-D data cube with the dimensions A, B, and C for some aggregate measure, M.

A cell in the base cuboid is a base cell. A cell from a nonbase cuboid is an aggregate cell. An aggregate cell aggregates over one or more dimensions, where each aggregated dimension is indicated by a ∗ in the cell notation. Suppose we have an n-dimensional data cube. Let a = (a1, a2, …, an, measures) be a cell from one of the cuboids making up the data cube. We say that a is an m-dimensional cell (i.e., from an m-dimensional cuboid) if exactly m (m ≤ n) values among {a1, a2, …, an} are not ∗. If m = n, then a is a base cell; otherwise, it is an aggregate cell (i.e., where m < n).

Base and aggregate cells

Consider a data cube with the dimensions month, city, and customer_group, and the measure sales. (Jan, ∗, ∗, 2800) and (∗, Chicago, ∗, 1200) are 1-D cells; (Jan, ∗, Business, 150) is a 2-D cell; and (Jan, Chicago, Business, 45) is a 3-D cell. Here, all base cells are 3-D, whereas 1-D and 2-D cells are aggregate cells.


An ancestor–descendant relationship may exist between cells. In an n-dimensional data cube, an i-D cell a = (a1, a2, …, an, measuresa) is an ancestor of a j-D cell b = (b1, b2, …, bn, measuresb), and b is a descendant of a, if and only if (1) i < j, and (2) for 1 ≤ k ≤ n, ak = bk whenever ak ≠ ∗. In particular, cell a is called a parent of cell b, and b is a child of a, if and only if j = i + 1.

Ancestor and descendant cells

Referring to Example 5.1, 1-D cell a = (Jan, ∗, ∗, 2800) and 2-D cell b = (Jan, ∗, Business, 150) are ancestors of 3-D cell c = (Jan, Chicago, Business, 45); c is a descendant of both a and b; b is a parent of c; and c is a child of b.


To ensure fast OLAP, it is sometimes desirable to precompute the full cube (i.e., all the cells of all the cuboids for a given data cube). A method of full cube computation is given in Section 5.2.1. Full cube computation, however, is exponential to the number of dimensions. That is, a data cube of n dimensions contains 2n cuboids. There are even more cuboids if we consider concept hierarchies for each dimension. 1 In addition, the size of each cuboid depends on the cardinality of its dimensions. Thus, precomputation of the full cube can require huge and often excessive amounts of memory.

1Eq. (4.1) of Section 4.4.1 gives the total number of cuboids in a data cube where each dimension has an associated concept hierarchy.

Nonetheless, full cube computation algorithms are important. Individual cuboids may be stored on secondary storage and accessed when necessary. Alternatively, we can use such algorithms to compute smaller cubes, consisting of a subset of the given set of dimensions, or a smaller range of possible values for some of the dimensions. In these cases, the smaller cube is a full cube for the given subset of dimensions and/or dimension values. A thorough understanding of full cube computation methods will help us develop efficient methods for computing partial cubes. Hence, it is important to explore scalable methods for computing all the cuboids making up a data cube, that is, for full materialization. These methods must take into consideration the limited amount of main memory available for cuboid computation, the total size of the computed data cube, as well as the time required for such computation.

Partial

Return Main Page Previous Page Next Page

®Online Book Reader