Data Mining_ Concepts and Techniques - Jiawei Han [143]
There are at least four ways in which OLAP-style analysis can be fused with data mining techniques:
1. Use cube space to define the data space for mining. Each region in cube space represents a subset of data over which we wish to find interesting patterns. Cube space is defined by a set of expert-designed, informative dimension hierarchies, not just arbitrary subsets of data. Therefore, the use of cube space makes the data space both meaningful and tractable.
2. Use OLAP queries to generate features and targets for mining. The features and even the targets (that we wish to learn to predict) can sometimes be naturally defined as OLAP aggregate queries over regions in cube space.
3. Use data mining models as building blocks in a multistep mining process. may consist of multiple steps, where data mining models can be viewed as building blocks that are used to describe the behavior of interesting data sets, rather than the end results.
4. Use data cube computation techniques to speed up repeated model construction. may require building a model for each candidate data space, which is usually too expensive to be feasible. However, by carefully sharing computation across model construction for different candidates based on data cube computation techniques, efficient mining is achievable.
In this subsection we study prediction cubes, an example of multidimensional data mining where the cube space is explored for prediction tasks. A prediction cube is a cube structure that stores prediction models in multidimensional data space and supports prediction in an OLAP manner. Recall that in a data cube, each cell value is an aggregate number (e.g., count) computed over the data subset in that cell. However, each cell value in a prediction cube is computed by evaluating a predictive model built on the data subset in that cell, thereby representing that subset's predictive behavior.
Instead of seeing prediction models as the end result, prediction cubes use prediction models as building blocks to define the interestingness of data subsets, that is, they identify data subsets that indicate more accurate prediction. This is best explained with an example.
Prediction cube for identification of interesting cube subspaces
Suppose a company has a customer table with the attributes time (with two granularity levels: month and year), location (with two granularity levels: state and country), gender, salary, and one class-label attribute: valued_customer. A manager wants to analyze the decision process of whether a customer is highly valued with respect to time and location. In particular, he is interested in the question “Are there times at and locations in which the value of a customer depended greatly on the customer's gender?” Notice that he believes time and location play a role in predicting valued customers, but at what granularity levels do they depend on gender for this task? For example, is performing analysis using {month, country} better than {year, state}?
Consider a data table D (e.g., the customer table). Let X be the (e.g., gender, salary). Let Y be the class-label attribute (e.g., valued_customer), and Z be the set of attributes, that is, attributes for which concept hierarchies have been defined (e.g., time, location). Let V be the set of attributes for which we would like to define their predictiveness. In our example, this set is {gender}. The predictiveness of V on a data subset can be quantified by the difference in accuracy between the model built on that subset using X to predict Y and the model built on that subset using X − V (e.g., {salary}) to predict Y. The intuition is that, if the difference is large, V must play an important role in the prediction of class label Y.
Given a set of attributes, V, and a learning