Data Mining_ Concepts and Techniques - Jiawei Han [140]
Method 2. Intercuboid query expansion. In this case, the expansion occurs by looking to a more general cell, as shown in Figure 5.15(b). For example, the cell in the 2-D cuboid age-occupation can use its parent in either of the 1-D cuboids, age or occupation. Think of intercuboid expansion as just an extreme case of intracuboid expansion, where all the cells within a dimension are used in the expansion. This essentially sets the dimension to ∗ and thus generalizes to a higher-level cuboid.
A k-dimensional cell has k direct parents in the cuboid lattice, where each parent is (k − 1)-dimensional. There are many more ancestor cells in the data cube (e.g., if multiple dimensions are rolled up simultaneously). However, we choose only one parent here to make the search space tractable and to limit the change in the query's semantics. As with intracuboid query expansion, correlated dimensions are not allowed in intercuboid expansions. Within the uncorrelated dimensions, the two-sample t-test can be performed to confirm that the parent and the query cell share the same sample mean. If multiple parent cells pass the test, the test's confidence level can be adjusted progressively higher until only one passes. Alternatively, multiple parent cells can be used to boost the confidence simultaneously. The choice is application dependent.
Intercuboid expansion to answer a query on sample data
Given the input relation in Table 5.10, let the query on income be “occupation = teacher ∧ gender = male.” There is only one sample in Table 5.10 that matches the query, and it has an income of $80,000. Suppose the corresponding confidence interval is larger than a preset threshold. We use intercuboid expansion to find a more reliable answer. There are two parent cells in the data cube: “gender = male” and “occupation = teacher.” By moving up to “gender = male” (and thus setting occupation to ∗), the mean income is $101,000. A two sample t-test reveals that this parent's sample mean, so it is ignored. Next, “occupation = teacher” is considered. It has a mean income of $85,000 and passes the two-sample t-test. As a result, the query is expanded to “occupation = teacher” and an income value of $85,000 is returned with acceptable reliability.
“How can we determine which method to choose—intracuboid expansion or intercuboid expansion?” This is difficult to answer without knowing the data and the application. A strategy for choosing between the two is to consider what the tolerance is for change in the query's semantics. This depends on the specific dimensions chosen in the query. For instance, the user might tolerate a bigger change in semantics for the age dimension than education. The difference in tolerance could be so large that the user is willing to set age to ∗ (i.e., intercuboid expansion) rather than letting education change at all. Domain knowledge is helpful here.
So far, our discussion has only focused on full materialization of the sampling cube. In many real-world problems, this is often impossible, especially for high-dimensional cases. Real-world survey data, for example, can easily contain over 50 variables (i.e., dimensions). The sampling cube size would grow exponentially with the number of dimensions. To handle high-dimensional data, a sampling cube method called Sampling Cube Shell was developed. It integrates the Frag-Shell method of Section 5.2.4 with the query expansion approach. The shell computes only a subset of the full sampling cube. The subset should consist of relatively low-dimensional cuboids (that are commonly queried) and cuboids that offer the most benefit to the user. The details are left to interested readers as an exercise. The method was tested on both real and synthetic data and found to be efficient and effective in answering queries.
5.3.2. Ranking Cubes: Efficient Computation of Top-k Queries
The data cube helps not only online analytical processing of multidimensional queries