Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining_ Concepts and Techniques - Jiawei Han [134]

By Root 1603 0

compact data structure and is more likely to fit in memory than the corresponding high-dimensional database.

To illustrate the design of the ID_measure array, let's look at Example 5.11.

Computing cubes with the average() measure

Table 5.8 shows an example sales database where each tuple has two associated values, such as item_count and sum, where item_count is the count of items sold.

Table 5.8 Database with Two Measure Values

TIDABCDEitem_countsum

1 a1 b1 c1 d1 e1 5 70

2 a1 b2 c1 d2 e1 3 10

3 a1 b2 c1 d1 e2 8 20

4 a2 b1 c1 d1 e2 5 40

5 a2 b1 c1 d1 e3 2 30

To compute a data cube for this database with the measure average(), we need to have a TID list for each cell: {TID1, …, TIDn}. Because each TID is uniquely associated with a particular set of measure values, all future computation just needs to fetch the measure values associated with the tuples in the list. In other words, by keeping an ID_measure array in memory for online processing, we can handle complex algebraic measures, such as average, variance, and standard deviation. Table 5.9 shows what exactly should be kept for our example, which is substantially smaller than the database itself.

Table 5.9 Table 5.8ID_ measure Array

TIDitem_countsum

1 5 70

2 3 10

3 8 20

4 5 40

5 2 30

The shell fragments are negligible in both storage space and computation time in comparison with the full data cube. Note that we can also use the Frag-Shells algorithm to compute the full data cube by including all the dimensions as a single fragment. Because the order of computation with respect to the cuboid lattice is top-down and depth-first (similar to that of BUC), the algorithm can perform Apriori pruning if applied to the construction of iceberg cubes.

“Once we have computed the shell fragments, how can they be used to answer OLAP queries?” Given the precomputed shell fragments, we can view the cube space as a virtual cube and perform OLAP queries related to the cube online. In general, two types of queries are possible: (1) point query and (2) subcube query.

In a point query, all of the relevant dimensions in the cube have been instantiated (i.e., there are no inquired dimensions in the relevant dimensions set). For example, in an n-dimensional data cube, A1A2…An, a point query could be in the form of , where A1 = {a11, a18}, A5 = {a52, a55, a59}, A9 = a94, and M is the inquired measure for each corresponding cube cell. For a cube with a small number of dimensions, we can use ∗ to represent a “don't care” position where the corresponding dimension is irrelevant, that is, neither inquired nor instantiated. For example, in the query for the database in Table 5.4, the first four dimension values are instantiated to a2, b1, c1, and d1, respectively, while the last dimension is irrelevant, and count() (which is the tuple count by context) is the inquired measure.

In a subcube query, at least one of the relevant dimensions in the cube is inquired. For example, in an n-dimensional data cube A1A2 … An, a subcube query could be in the form , where A1 = {a11, a18} and A9 = a94, A5 and A21 are the inquired dimensions, and M is the inquired measure. For a cube with a small number of dimensions, we can use ∗ for an irrelevant dimension and ? for an inquired one. For example, in the query we see that the first and third dimension values are instantiated to a2 and c1, respectively, while the fourth is irrelevant, and the second and the fifth are inquired. A subcube query computes all possible value combinations of the inquired dimensions. It essentially returns a local data cube consisting of the inquired dimensions.

“How can we use shell fragments to answer a point query?” Because a point query explicitly provides the instantiated variables set on the relevant dimensions set, we can make maximal use of the precomputed shell fragments by finding the best fitting (i.e., dimension-wise completely matching) fragments to fetch and intersect the associated TID lists.

Let the point query be of the form , where αi represents a set of instantiated values of dimension Ai, and so

Online Book Reader

Data Mining_ Concepts and Techniques - Jiawei Han [134]

®Online Book Reader