Data Mining_ Concepts and Techniques - Jiawei Han [121]
Many cells in a cuboid may actually be of little or no interest to the data analyst. Recall that each cell in a full cube records an aggregate value such as count or sum. For many cells in a cuboid, the measure value will be zero. When the product of the cardinalities for the dimensions in a cuboid is large relative to the number of nonzero-valued tuples that are stored in the cuboid, then we say that the cuboid is sparse. If a cube contains many sparse cuboids, we say that the cube is sparse.
In many cases, a substantial amount of the cube's space could be taken up by a large number of cells with very low measure values. This is because the cube cells are often quite sparsely distributed within a multidimensional space. For example, a customer may only buy a few items in a store at a time. Such an event will generate only a few nonempty cells, leaving most other cube cells empty. In such situations, it is useful to materialize only those cells in a cuboid (group-by) with a measure value above some minimum threshold. In a data cube for sales, say, we may wish to materialize only those cells for which count ≥ 10 (i.e., where at least 10 tuples exist for the cell's given combination of dimensions), or only those cells representing sales ≥ $100. This not only saves processing time and disk space, but also leads to a more focused analysis. The cells that cannot pass the threshold are likely to be too trivial to warrant further analysis.
Such partially materialized cubes are known as iceberg cubes. The minimum threshold is called the minimum support threshold, or minimum support (min_sup), for short. By materializing only a fraction of the cells in a data cube, the result is seen as the “tip of the iceberg,” where the “iceberg” is the potential full cube including all cells. An iceberg cube can be specified with an SQL query, as shown in Example 5.3.
Iceberg cube
compute cube sales_iceberg as
select month, city, customer_group, count(*)
from salesInfo
cube by month, city, customer_group
having count(*) > = min_sup
The compute cube statement specifies the precomputation of the iceberg cube, sales_iceberg, with the dimensions month, city, and customer_group, and the aggregate measure count(). The input tuples are in the salesInfo relation. The cube by clause specifies that aggregates (group-by's) are to be formed for each of the possible subsets of the given dimensions. If we were computing the full cube, each group-by would correspond to a cuboid in the data cube lattice. The constraint specified in the having clause is known as the iceberg condition. Here, the iceberg measure is count(). Note that the iceberg cube computed here could be used to answer group-by queries on any combination of the specified dimensions of the form having count(*) >= v, where v ≥ (min_sup). Instead of count(), the iceberg condition could specify more complex measures such as average().
If we were to omit the having clause, we would end up with the full cube. Let's call this cube sales_cube. The iceberg cube, sales_iceberg, excludes all the cells of sales_cube with a count that is less than min_sup. Obviously, if we were to set the minimum support to 1 in sales_iceberg, the resulting cube would be the full cube, sales_cube.
A naïve approach to computing an iceberg cube would be to first compute the full cube and then prune the cells that do not satisfy the iceberg condition. However, this is still prohibitively expensive. An efficient approach is to compute only the iceberg cube directly without computing the full cube. 5.2.2 and 5.2.3 discuss methods for efficient iceberg cube computation.
Introducing iceberg cubes will lessen the burden of computing trivial aggregate cells in a data cube. However, we could still end up with a large number of