Online Book Reader

Home Category

Data Mining_ Concepts and Techniques - Jiawei Han [139]

By Root 1699 0
be to select semantically similar values to minimize the risk of altering the final result. Consider the age dimension—similarity of values in this dimension is clear. There is a definite (numeric) order to the values. Dimensions with numeric or ordinal (ranked) data (like education) have a definite ordering among data values. Therefore, we can select values that are close to the instantiated query value. For nominal data of a dimension that is organized in a multilevel hierarchy in a data cube (e.g., location), we should select those values located in the same branch of the tree (e.g., the same district or city).

By considering additional data during query expansion, we are aiming for a more accurate and reliable answer. As mentioned before, strongly correlated dimensions are precluded from expansion for this purpose. An additional strategy is to ensure that new samples share the “same” cube measure value (e.g., mean income) as the existing samples in the query cell. The two-sample t-test is a relatively simple statistical method that can be used to determine whether two samples have the same mean (or any other point estimate), where “same” means that they do not differ significantly. (It is described in greater detail in Section 8.5.5 on model selection using statistical tests of significance.)

The test determines whether two samples have the same mean (the null hypothesis) with the only assumption being that they are both normally distributed. The test fails if there is evidence that the two samples do not share the same mean. Furthermore, the test can be performed with a confidence level as an input. This allows the user to control how strict or loose the query expansion will be.

Example 5.14 shows how the intracuboid expansion strategies just described can be used to answer a query on sample data.

Intracuboid query expansion to answer a query on sample data

Consider a book retailer trying to learn more about its customers' annual income levels. In Table 5.10, a sample of the survey data collected is shown. 6 In the survey, customers are segmented by four attributes, namely gender, age, education, and occupation.

6For the sake of illustration, ignore the fact that the sample size is too small to be statistically significant.

Table 5.10 Sample Customer Survey Data

genderageeducationoccupationincome

female 23 college teacher $85,000

female 40 college programmer $50,000

female 31 college programmer $52,000

female 50 graduate teacher $90,000

female 62 graduate CEO $500,000

male 25 high school programmer $50,000

male 28 high school CEO $250,000

male 40 college teacher $80,000

male 50 college programmer $45,000

male 57 graduate programmer $80,000

Let a query on customer income be “age = 25,” where the user specifies a 95% confidence level. Suppose this returns an income value of $50,000 with a rather large confidence interval. 7 Suppose also, that this confidence interval is larger than a preset threshold and that the age dimension was found to have little correlation with income in this data set. Therefore, intracuboid expansion starts within the age dimension. The nearest cell is “age = 23,” which returns an income of $85,000. The two-sample t-test at the 95% confidence level passes so the query expands; it is now “age = {23, 25}” with a smaller confidence interval than initially. However, it is still larger than the threshold, so expansion continues to the next nearest cell: “age = 28,” which returns an income of $250,000. The two sample t-test between this cell and the original query cell fails; as a result, it is ignored. Next, “age = 31” is checked and it passes the test.

7For the sake of the example, suppose this is true even though there is only one sample. In practice, more points are needed to calculate a legitimate value.

The confidence interval of the three cells combined is now below the threshold and the expansion finishes at “age = {23, 25, 31}.” The mean of the income values at these three cells is , which is returned as the query answer. It has a smaller confidence interval, and thus

Return Main Page Previous Page Next Page

®Online Book Reader