Data Mining_ Concepts and Techniques - Jiawei Han [138]
“What can we do to boost the reliability of query answers?” Consider what affects the confidence interval size. There are two main factors: the variance of the sample data and the sample size. First, a rather large variance in the cell may indicate that the chosen cube cell is poor for prediction. A better solution is probably to drill down on the query cell to a more specific one (i.e., asking more specific queries). Second, a small sample size can cause a large confidence interval. When there are very few samples, the corresponding tc is large because of the small degree of freedom. This in turn could cause a large confidence interval. Intuitively, this makes sense. Suppose one is trying to figure out the average income of people in the United States. Just asking two or three people does not give much confidence to the returned response.
The best way to solve this small sample size problem is to get more data. Fortunately, there is usually an abundance of additional data available in the cube. The data do not match the query cell exactly; however, we can consider data from cells that are “close by.” There are two ways to incorporate such data to enhance the reliability of the query answer: (1) intracuboid query expansion, where we consider nearby cells within the same cuboid, and (2) intercuboid query expansion, where we consider more general versions (from parent cuboids) of the query cell. Let's see how this works, starting with intra-cuboid query expansion.
Method 1. Intracuboid query expansion. Here, we expand the sample size by including nearby cells in the same cuboid as the queried cell, as shown in Figure 5.15(a). We just have to be careful that the new samples serve to increase the confidence in the answer without changing the query's semantics.
Figure 5.15 Query expansion within sampling cube: Given small data samples, both methods use strategies to boost the reliability of query answers by considering additional data cell values. (a) Intracuboid expansion considers nearby cells in the same cuboid as the queried cell. (b) Intercuboid expansion considers more general cells from parent cuboids.
So, the first question is “Which dimensions should be expanded?” The best candidates should be the dimensions that are uncorrelated or weakly correlated with the measure value (i.e, the value to be predicted). Expanding within these dimensions will likely increase the sample size and not shift the query's answer. Consider an example of a 2-D query specifying education = “college” and birth_month = “July.” Let the cube measure be average income. Intuitively, education has a high correlation to income while birth month does not. It would be harmful to expand the education dimension to include values such as “graduate” or “high school.” They are likely to alter the final result. However, expansion in the birth_month dimension to include other month values could be helpful, because it is unlikely to change the result but will increase sampling size.
To mathematically measure the correlation of a dimension to the cube value, the correlation between the dimension's values and their aggregated cube measures is computed. Pearson's correlation coefficient for numeric data and the χ2 correlation test for nominal data are popularly used correlation measures, although many other measures, such as covariance, can be used. (These measures were presented in Section 3.3.2.) A dimension that is strongly correlated with the value to be predicted should not be a candidate for expansion. Notice that since the correlation of a dimension with the cube measure is independent of a particular query, it should be precomputed and stored with the cube measure to facilitate efficient online analysis.
After selecting dimensions for expansion, the next question is “Which values within these dimensions should the expansion use?” This relies on the semantic knowledge of the dimensions in question. The goal should