Data Mining_ Concepts and Techniques - Jiawei Han [137]
A sampling cube framework was introduced to tackle each of the preceding challenges.
Sampling Cube Framework
The sampling cube is a data cube structure that stores the sample data and their multidimensional aggregates. It supports OLAP on sample data. It calculates confidence intervals as a quality measure for any multidimensional query. Given a sample data relation (i.e., base cuboid) R, the sampling cube CR typically computes the sample mean, sample standard deviation, and other task-specific measures.
In statistics, a confidence interval is used to indicate the reliability of an estimate. Suppose we want to estimate the mean age of all viewers of a given TV show. We have sample data (a subset) of this data population. Let's say our sample mean is 35 years. This becomes our estimate for the entire population of viewers as well, but how confident can we be that 35 is also the mean of the true population? It is unlikely that the sample mean will be exactly equal to the true population mean because of sampling error. Therefore, we need to qualify our estimate in some way to indicate the general magnitude of this error. This is typically done by computing a confidence interval, which is an estimated value range with a given high probability of covering the true population value. A confidence interval for our example could be “the actual mean will not vary by +/− two standard deviations 95% of the time.” (Recall that the standard deviation is just a number, which can be computed as shown in Section 2.2.2.) A confidence interval is always qualified by a particular confidence level. In our example, it is 95%.
The confidence interval is calculated as follows. Let x be a set of samples. The mean of the samples is denoted by , and the number of samples in x is denoted by l. Assuming that the standard deviation of the population is unknown, the sample standard deviation of x is denoted by s. Given a desired confidence level, the confidence interval for is
(5.1)
where tc is the critical t-value associated with the confidence level and is the estimated standard error of the mean. To find the appropriate tc, specify the desired confidence level (e.g., 95%) and also the degree of freedom, which is just l − 1.
The important thing to note is that the computation involved in computing a confidence interval is algebraic. Let's look at the three terms involved in Eq. (5.1). The first is the mean of the sample set, , which is algebraic; the second is the critical t-value, which is calculated by a lookup, and with respect to x, it depends on l, a distributive measure; and the third is , which also turns out to be algebraic if one records the linear sum () and squared sum (). Because the terms involved are either algebraic or distributive, the confidence interval computation is algebraic. Actually, since both the mean and confidence interval are algebraic, at every cell, exactly three values are sufficient to calculate them—all of which are either distributive or algebraic:
1. l
2.
3.
There are many efficient techniques for computing algebraic and distributive measures (Section 4.2.4). Therefore, any of the previously developed cubing algorithms can be used to efficiently construct a sampling cube.
Now that we have established that sampling cubes can be computed efficiently, our next step is to find a way of boosting the confidence of results obtained for queries on sample data.
Query Processing: Boosting Confidences for Small Samples
A query posed against a data cube can be either a point query or a range query. Without loss of generality, consider the case of a point query. Here, it corresponds to a cell in sampling cube CR. The goal is to provide an accurate point estimate for the samples in that cell. Because the cube also reports the confidence interval associated with the sample mean, there is some measure of “reliability” to the returned answer. If