Data Mining_ Concepts and Techniques - Jiawei Han [132]
Instead of computing a cube shell, we can compute only portions or fragments of it. This section discusses the shell fragment approach for OLAP query processing. It is based on the following key observation about OLAP in high-dimensional space. Although a data cube may contain many dimensions, most OLAP operations are performed on only a small number of dimensions at a time. In other words, an OLAP query is likely to ignore many dimensions (i.e., treating them as irrelevant), fix some dimensions (e.g., using query constants as instantiations), and leave only a few to be manipulated (for drilling, pivoting, etc.). This is because it is neither realistic nor fruitful for anyone to comprehend the changes of thousands of cells involving tens of dimensions simultaneously in a high-dimensional space at the same time.
Instead, it is more natural to first locate some cuboids of interest and then drill along one or two dimensions to examine the changes of a few related dimensions. Most analysts will only need to examine, at any one moment, the combinations of a small number of dimensions. This implies that if multidimensional aggregates can be computed quickly on a small number of dimensions inside a high-dimensional space, we may still achieve fast OLAP without materializing the original high-dimensional data cube. Computing the full cube (or, often, even an iceberg cube or cube shell) can be excessive. Instead, a semi-online computation model with certain preprocessing may offer a more feasible solution. Given a base cuboid, some quick preparation computation can be done first (i.e., offline). After that, a query can then be computed online using the preprocessed data.
The shell fragment approach follows such a semi-online computation strategy. It involves two algorithms: one for computing cube shell fragments and the other for query processing with the cube fragments. The shell fragment approach can handle databases of high dimensionality and can quickly compute small local cubes online. It explores the inverted index data structure, which is popular in information retrieval and Web-based information systems.
The basic idea is as follows. Given a high-dimensional data set, we partition the dimensions into a set of disjoint dimension fragments, convert each fragment into its corresponding inverted index representation, and then construct cube shell fragments while keeping the inverted indices associated with the cube cells. Using the precomputed cubes' shell fragments, we can dynamically assemble and compute cuboid cells of the required data cube online. This is made efficient by set intersection operations on the inverted indices.
To illustrate the shell fragment approach, we use the tiny database of Table 5.4 as a running example. Let the cube measure be count(). Other measures will be discussed later. We first look at how to construct the inverted index for the given database.
Table 5.4 Original Database
TIDABCDE
1 a1 b1 c1 d1 e1
2 a1 b2 c1 d2 e1
3 a1 b2 c1 d1 e2
4 a2 b1 c1 d1 e2
5 a2 b1 c1 d1 e3
Construct the inverted index
For each attribute value in each dimension, list the tuple identifiers (TIDs) of all the tuples that have that value. For example, attribute value a2 appears in tuples 4 and 5. The TID list for a2 then contains exactly two items, namely 4 and 5. The resulting inverted index table is shown in Table 5.5. It retains all the original database's information. If each table entry takes one unit of memory, Table 5.4 and Table 5.5 each takes 25 units, that is, the inverted index table uses the same amount of memory as the original database.
Table 5.5 Inverted Index
Attribute ValueTID ListList Size
a1 {1, 2, 3} 3
a2 {4, 5} 2
b1 {1, 4, 5} 3
b2 {2, 3} 2
c1 {1, 2, 3, 4, 5} 5
d1 {1, 3, 4, 5} 4
d2 {2} 1
e1 {1, 2} 2
e2 {3, 4} 2
e3 {5} 1
“How do we compute shell fragments of a data cube?” The shell fragment computation algorithm, Frag-Shells, is summarized in Figure 5.14. We first partition all the