Data Mining_ Concepts and Techniques - Jiawei Han [119]
Although the data cube concept was originally intended for OLAP, it is also useful for data mining. Multidimensional data mining is an approach to data mining that integrates OLAP-based data analysis with knowledge discovery techniques. It is also known as exploratory multidimensional data mining and online analytical mining (OLAM). It searches for interesting patterns by exploring the data in multidimensional space. This gives users the freedom to dynamically focus on any subset of interesting dimensions. Users can interactively drill down or roll up to varying abstraction levels to find classification models, clusters, predictive rules, and outliers.
This chapter focuses on data cube technology. In particular, we study methods for data cube computation and methods for multidimensional data analysis. Precomputing a data cube (or parts of a data cube) allows for fast accessing of summarized data. Given the high dimensionality of most data, multidimensional analysis can run into performance bottlenecks. Therefore, it is important to study data cube computation techniques. Luckily, data cube technology provides many effective and scalable methods for cube computation. Studying these methods will also help in our understanding and further development of scalable methods for other data mining tasks such as the discovery of frequent patterns (Chapter 6 and Chapter 7).
We begin in Section 5.1 with preliminary concepts for cube computation. These summarize the data cube notion as a lattice of cuboids, and describe basic forms of cube materialization. General strategies for cube computation are given. Section 5.2 follows with an in-depth look at specific methods for data cube computation. We study both full materialization (i.e., where all the cuboids representing a data cube are precomputed and thereby ready for use) and partial cuboid materialization (where, say, only the more “useful” parts of the data cube are precomputed). The multiway array aggregation method is detailed for full cube computation. Methods for partial cube computation, including BUC, Star-Cubing, and the use of cube shell fragments, are discussed.
In Section 5.3, we study cube-based query processing. The techniques described build on the standard methods of cube computation presented in Section 5.2. You will learn about sampling cubes for OLAP query answering on sampling data (e.g., survey data, which represent a sample or subset of a target data population of interest). In addition, you will learn how to compute ranking cubes for efficient top-k (ranking) query processing in large relational data sets.
In Section 5.4, we describe various ways to perform multidimensional data analysis using data cubes. Prediction cubes are introduced, which facilitate predictive modeling in multidimensional space. We discuss multifeature cubes, which compute complex queries involving multiple dependent aggregates at multiple granularities. You will also learn about the exception-based discovery-driven exploration of cube space, where visual cues are displayed to indicate discovered data exceptions at all aggregation levels, thereby guiding the user in the data analysis process.
5.1. Data Cube Computation: Preliminary Concepts
Data cubes facilitate the online analytical processing of multidimensional data. “But how can we compute data cubes in advance, so that they are handy and readily available for query processing?” This section contrasts full cube materialization (i.e., precomputation) versus various strategies for partial cube materialization. For completeness, we begin with a review of the basic terminology involving data cubes. We also introduce a cube cell notation that is useful for describing data cube computation methods.
5.1.1. Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Cube Shell
Figure 5.1 shows a 3-D data cube for the dimensions A, B, and C, and an aggregate measure, M. Commonly used measures include count(), sum(), min(), max(), and total_sales(). A data cube is a lattice of cuboids. Each cuboid represents