Data Mining_ Concepts and Techniques - Jiawei Han [148]
■ Techniques for processing advanced queries have been proposed that take advantage of cube technology. These include sampling cubes for multidimensional analysis on sampling data, and ranking cubes for efficient top-k (ranking) query processing in large relational data sets.
■ This chapter highlighted three approaches to multidimensional data analysis with data cubes. Prediction cubes compute prediction models in multidimensional cube space. They help users identify interesting data subsets at varying degrees of granularity for effective prediction. Multifeature cubes compute complex queries involving multiple dependent aggregates at multiple granularities. Exception-based, discovery-driven exploration of cube space displays visual cues to indicate discovered data exceptions at all aggregation levels, thereby guiding the user in the data analysis process.
5.6. Exercises
5.1 Assume that a 10-D base cuboid contains only three base cells: (1) (a1, d2, d3, d4, …, d9, d10), (2) (d1, b2, d3, d4, …, d9, d10), and (3) (d1, d2, c3, d4, …, d9, d10), where a1 ≠ d1, b2 ≠ d2, and c3 ≠ d3. The measure of the cube is count().
(a) How many nonempty cuboids will a full data cube contain?
(b) How many nonempty aggregate (i.e., nonbase) cells will a full cube contain?
(c) How many nonempty aggregate cells will an iceberg cube contain if the condition of the iceberg cube is “count ≥ 2”?
(d) A cell, c, is a closed cell if there exists no cell, d, such that d is a specialization of cell c (i.e., d is obtained by replacing a ∗ in c by a non-∗ value) and d has the same measure value as c. A closed cube is a data cube consisting of only closed cells. How many closed cells are in the full cube?
5.2 There are several typical cube computation methods, such as MultiWay[ZDN97], BUC[BR99] and Star-Cubing[XHLW03]. Briefly describe these three methods (i.e., use one or two lines to outline the key points), and compare their feasibility and performance under the following conditions:
(a) Computing a dense full cube of low dimensionality (e.g., less than eight dimensions).
(b) Computing an iceberg cube of around 10 dimensions with a highly skewed data distribution.
(c) Computing a sparse iceberg cube of high dimensionality (e.g., over 100 dimensions).
5.3 Suppose a data cube, C, has D dimensions, and the base cuboid contains k distinct tuples.
(a) Present a formula to calculate the minimum number of cells that the cube, C, may contain.
(b) Present a formula to calculate the maximum number of cells that C may contain.
(c) Answer parts (a) and (b) as if the count in each cube cell must be no less than a threshold, v.
(d) Answer parts (a) and (b) as if only closed cells are considered (with the minimum count threshold, v).
5.4 Suppose that a base cuboid has three dimensions, A, B, C, with the following number of cells: |A| = 1,000,000, |B| = 100, and |C| = 1000. Suppose that each dimension is evenly partitioned into 10 portions for chunking.
(a) Assuming each dimension has only one level, draw the complete lattice of the cube.
(b) If each cube cell stores one measure with four bytes, what is the total size of the computed cube if the cube is dense?
(c) State the order for computing the chunks in the cube that requires the least amount of space, and compute the total amount of main memory space required for computing the 2-D planes.
5.5 Often, the aggregate count value of many cells in a large data cuboid is zero, resulting in a huge, yet sparse, multidimensional matrix.
(a) Design an implementation method that can elegantly overcome this sparse matrix problem. Note that you need to explain your data structures in detail and discuss the space needed, as well as how to retrieve data