Data Mining_ Concepts and Techniques - Jiawei Han [142]
Table 5.11 Cuboid of a Ranking Cube for Used-Car Sales
producertypepricemileagecountTIDs
Ford sedan <5K 30–40K 7 t6, …, t68
Ford sedan 5–10K 30–40K 50 t15, …, t152
Honda sedan 10–15K 30–40K 20 t8, …, t32
… … … … … …
Query Q1 can be answered by using a selection condition to select the appropriate selection dimension values (i.e., producer = “Ford” and type = “sedan”) in cuboid CMT. In addition, the ranking function “(price − 10 K)2 + (mileage − 30 K)2” is used to find the tuples that most closely match the criteria. If there are not enough matching tuples found in the closest matching cells, the next closest matching cells will need to be accessed. We may even drill down to the corresponding lower-level cells to see the count distributions of cells that match the ranking function and additional criteria regarding, say, model, maintenance situation, or other loaded features. Only users who really want to see more detailed information, such as interior photos, will need to access the physical records stored in the database.
Most real-life top-k queries are likely to involve only a small subset of selection attributes. To support high-dimensional ranking cubes, we can carefully select the cuboids that need to be materialized. For example, we could choose to materialize only the cuboids that contain single-selection dimensions. This will achieve low space overhead and still have high performance when the number of selection dimensions is large. In some cases, there may exist many ranking dimensions to support multiple users with rather different preferences. For example, buyers may search for houses by considering various factors like price, distance to school or shopping, number of years old, floor space, and tax. In this case, a possible solution is to create multiple data partitions, each of which consists of a subset of the ranking dimensions. The query processing may need to search over a joint space involving multiple data partitions.
In summary, the general philosophy of ranking cubes is to materialize such cubes on the set of selection dimensions. Use of the interval-based partitioning in ranking dimensions makes the ranking cube efficient and flexible at supporting ad hoc user queries. Various implementation techniques and query optimization methods have been developed for efficient computation and query processing based on this framework.
5.4. Multidimensional Data Analysis in Cube Space
Data cubes create a flexible and powerful means to group and aggregate data subsets. They allow data to be explored in multiple dimensional combinations and at varying aggregate granularities. This capability greatly increases the analysis bandwidth and helps effective discovery of interesting patterns and knowledge from data. The use of cube space makes the data space both meaningful and tractable.
This section presents methods of multidimensional data analysis that make use of data cubes to organize data into intuitive regions of interest at varying granularities. Section 5.4.1 presents prediction cubes, a technique for multidimensional data mining that facilitates predictive modeling in multidimensional space. Section 5.4.2 describes how to construct multifeature cubes. These support complex analytical queries involving multiple dependent aggregates at multiple granularities. Finally, Section 5.4.3 describes an interactive method for users to systematically explore cube space. In such exception-based, discovery-driven exploration, interesting exceptions or anomalies in the data are automatically detected and marked for users with visual cues.
5.4.1. Prediction Cubes: Prediction Mining in Cube Space
Recently, researchers have turned their attention toward multidimensional data mining to uncover knowledge at varying dimensional combinations and granularities. Such mining is also known as exploratory multidimensional data mining and online analytical data mining (OLAM). Multidimensional data space is huge. In preparing the data, how can we identify the interesting subspaces