Data Mining_ Concepts and Techniques - Jiawei Han [150]
(b) Based on your implementation, discuss the following:
i. What challenging computation problems are encountered as the number of dimensions grows large?
ii. How can iceberg cubing solve the problems of part (a) for some data sets (and characterize such data sets)?
iii. Give one simple example to show that sometimes iceberg cubes cannot provide a good solution.
(c) Instead of computing a high-dimensionality data cube, we may choose to materialize the cuboids that have only a small number of dimension combinations. For example, for a 30-D data cube, we may only compute the 5-D cuboids for every possible 5-D combination. The resulting cuboids form a shell cube. Discuss how easy or hard it is to modify your cube computation algorithm to facilitate such computation.
5.12 The sampling cube was proposed for multidimensional analysis of sampling data (e.g., survey data). In many real applications, sampling data can be of high dimensionality (e.g., it is not unusual to have more than 50 dimensions in a survey data set).
(a) How can we construct an efficient and scalable high-dimensional sampling cube in large sampling data sets?
(b) Design an efficient incremental update algorithm for such a high-dimensional sampling cube.
(c) Discuss how to support quality drill-down given that some low-level cells may be empty or contain too few data for reliable analysis.
5.13 The ranking cube was proposed for efficient computation of top-k (ranking) queries in relational databases. Recently, researchers have proposed another kind of query, called a skyline query. A skyline query returns all the objects pi such that pi is not dominated by any other object pj, where dominance is defined as follows. Let the value of pi on dimension d be v(pi, d). We say pi is dominated by pj if and only if for each preference dimension d, v(pj, d) ≤ v(pi, d), and there is at least one d where the equality does not hold.
(a) Design a ranking cube so that skyline queries can be processed efficiently.
(b) Skyline queries are sometimes too strict to be desirable to some users. One may generalize the concept of skyline into generalized skyline as follows: Given a d-dimensional database and a query q, the generalized skyline is the set of the following objects: (1) the skyline objects and (2) the nonskyline objects that are ϵ-neighbors of a skyline object, where r is an ϵ-neighbor of an object p if the distance between p and r is no more than ϵ. Design a ranking cube to process generalized skyline queries efficiently.
5.14 The ranking cube was designed to support top-k (ranking) queries in relational database systems. However, ranking queries are also posed to data warehouses, where ranking is on multidimensional aggregates instead of on measures of base facts. For example, consider a product manager who is analyzing a sales database that stores the nationwide sales history, organized by location and time. To make investment decisions, the manager may pose the following query: “What are the top-10 (state, year) cells having the largest total product sales?” He may further drill down and ask, “What are the top-10 (city, month) cells?” Suppose the system can perform such partial materialization to derive two types of materialized cuboids: a guiding cuboid and a supporting cuboid, where the former contains a number of guiding cells that provide concise, high-level data statistics to guide the ranking query processing, whereas the latter provides inverted indices for efficient online aggregation.
(a) Derive an efficient method for computing such aggregate ranking cubes.
(b) Extend your framework to handle more advanced measures. One such example could be as follows. Consider an organization donation database, where donors are grouped by “age,” “income,” and other attributes. Interesting questions include: “Which age and income groups have made the top-k average amount of donation