Data Mining_ Concepts and Techniques - Jiawei Han [136]
5That is, the intersection of the TID lists for (a2, b1), (c1), and (e2) is {4}.
For large data sets, a fragment size of 2 or 3 typically results in reasonable storage requirements for the shell fragments and for fast query response time. Querying with shell fragments is substantially faster than answering queries using precomputed data cubes that are stored on disk. In comparison to full cube computation, Frag-Shells is recommended if there are less than four inquired dimensions. Otherwise, more efficient algorithms, such as Star-Cubing, can be used for fast online cube computation. Frag-Shells can be easily extended to allow incremental updates, the details of which are left as an exercise.
5.3. Processing Advanced Kinds of Queries by Exploring Cube Technology
Data cubes are not confined to the simple multidimensional structure illustrated in the last section for typical business data warehouse applications. The methods described in this section further develop data cube technology for effective processing of advanced kinds of queries. Section 5.3.1 explores sampling cubes. This extension of data cube technology can be used to answer queries on sample data, such as survey data, which represent a sample or subset of a target data population of interest. Section 5.3.2 explains how ranking cubes can be computed to answer top-k queries, such as “find the top 5 cars,” according to some user-specified criteria.
The basic data cube structure has been further extended for various sophisticated data types and new applications. Here we list some examples, such as spatial data cubes for the design and implementation of geospatial data warehouses, and multimedia data cubes for the multidimensional analysis of multimedia data (those containing images and videos). RFID data cubes handle the compression and multidimensional analysis of RFID (i.e., radio-frequency identification) data. Text cubes and topic cubes were developed for the application of vector-space models and generative language models, respectively, in the analysis of multidimensional text databases (which contain both structure attributes and narrative text attributes).
5.3.1. Sampling Cubes: OLAP-Based Mining on Sampling Data
When collecting data, we often collect only a subset of the data we would ideally like to gather. In statistics, this is known as collecting a sample of the data population. The resulting data are called sample data. Data are often sampled to save on costs, manpower, time, and materials. In many applications, the collection of the entire data population of interest is unrealistic. In the study of TV ratings or pre-election polls, for example, it is impossible to gather the opinion of everyone in the population. Most published ratings or polls rely on a data sample for analysis. The results are extrapolated for the entire population, and associated with certain statistical measures such as a confidence interval. The confidence interval tells us how reliable a result is. Statistical surveys based on sampling are a common tool in many fields like politics, healthcare, market research, and social and natural sciences.
“How effective is OLAP on sample data?” OLAP traditionally has the full data population on hand, yet with sample data, we have only a small subset. If we try to apply traditional OLAP tools to sample data, we encounter three challenges. First, sample data are often sparse in the multidimensional sense. When a user drills down on the data, it is easy to reach a point with very few or no samples even when the overall sample size is large. Traditional OLAP simply uses whatever data are available to compute a query answer. To extrapolate such an answer for a population based on a small sample could be misleading: A single outlier or a slight bias in the sampling can distort the answer significantly. Second, with sample data, statistical methods are used to provide a measure