Data Mining_ Concepts and Techniques - Jiawei Han [139]
By considering additional data during query expansion, we are aiming for a more accurate and reliable answer. As mentioned before, strongly correlated dimensions are precluded from expansion for this purpose. An additional strategy is to ensure that new samples share the “same” cube measure value (e.g., mean income) as the existing samples in the query cell. The two-sample t-test is a relatively simple statistical method that can be used to determine whether two samples have the same mean (or any other point estimate), where “same” means that they do not differ significantly. (It is described in greater detail in Section 8.5.5 on model selection using statistical tests of significance.)
The test determines whether two samples have the same mean (the null hypothesis) with the only assumption being that they are both normally distributed. The test fails if there is evidence that the two samples do not share the same mean. Furthermore, the test can be performed with a confidence level as an input. This allows the user to control how strict or loose the query expansion will be.
Example 5.14 shows how the intracuboid expansion strategies just described can be used to answer a query on sample data.
Intracuboid query expansion to answer a query on sample data
Consider a book retailer trying to learn more about its customers' annual income levels. In Table 5.10, a sample of the survey data collected is shown. 6 In the survey, customers are segmented by four attributes, namely gender, age, education, and occupation.
6For the sake of illustration, ignore the fact that the sample size is too small to be statistically significant.
Table 5.10 Sample Customer Survey Data
genderageeducationoccupationincome
female 23 college teacher $85,000
female 40 college programmer $50,000
female 31 college programmer $52,000
female 50 graduate teacher $90,000
female 62 graduate CEO $500,000
male 25 high school programmer $50,000
male 28 high school CEO $250,000
male 40 college teacher $80,000
male 50 college programmer $45,000
male 57 graduate programmer $80,000
Let a query on customer income be “age = 25,” where the user specifies a 95% confidence level. Suppose this returns an income value of $50,000 with a rather large confidence interval. 7 Suppose also, that this confidence interval is larger than a preset threshold and that the age dimension was found to have little correlation with income in this data set. Therefore, intracuboid expansion starts within the age dimension. The nearest cell is “age = 23,” which returns an income of $85,000. The two-sample t-test at the 95% confidence level passes so the query expands; it is now “age = {23, 25}” with a smaller confidence interval than initially. However, it is still larger than the threshold, so expansion continues to the next nearest cell: “age = 28,” which returns an income of $250,000. The two sample t-test between this cell and the original query cell fails; as a result, it is ignored. Next, “age = 31” is checked and it passes the test.
7For the sake of the example, suppose this is true even though there is only one sample. In practice, more points are needed to calculate a legitimate value.
The confidence interval of the three cells combined is now below the threshold and the expansion finishes at “age = {23, 25, 31}.” The mean of the income values at these three cells is , which is returned as the query answer. It has a smaller confidence interval, and thus