Data Mining_ Concepts and Techniques - Jiawei Han [104]
2. Determine to which materialized cuboid(s) the relevant operations should be applied: This involves identifying all of the materialized cuboids that may potentially be used to answer the query, pruning the set using knowledge of “dominance” relationships among the cuboids, estimating the costs of using the remaining materialized cuboids, and selecting the cuboid with the least cost.
OLAP query processing
Suppose that we define a data cube for AllElectronics of the form “sales_cube [time, item, location ]: sum (sales_in_dollars).” The dimension hierarchies used are “day < month < quarter < year” for time; “item_name < brand < type” for item; and “street < city < province_or_state < country” for location.
Suppose that the query to be processed is on {brand, province_or_state }, with the selection constant “year = 2010.” Also, suppose that there are four materialized cuboids available, as follows:
■ cuboid 1: {year, item_name, city }
■ cuboid 2: {year, brand, country }
■ cuboid 3: {year, brand, province_or_state }
■ cuboid 4: {item_name, province_or_state }, where year = 2010
“Which of these four cuboids should be selected to process the query?” Finer-granularity data cannot be generated from coarser-granularity data. Therefore, cuboid 2 cannot be used because country is a more general concept than province_or_state. Cuboids 1, 3, and 4 can be used to process the query because (1) they have the same set or a superset of the dimensions in the query, (2) the selection clause in the query can imply the selection in the cuboid, and (3) the abstraction levels for the item and location dimensions in these cuboids are at a finer level than brand and province_or_state, respectively.
“How would the costs of each cuboid compare if used to process the query?” It is likely that using cuboid 1 would cost the most because both item_name and city are at a lower level than the brand and province_or_state concepts specified in the query. If there are not many year values associated with items in the cube, but there are several item_names for each brand, then cuboid 3 will be smaller than cuboid 4, and thus cuboid 3 should be chosen to process the query. However, if efficient indices are available for cuboid 4, then cuboid 4 may be a better choice. Therefore, some cost-based estimation is required to decide which set of cuboids should be selected for query processing.
4.4.4. OLAP Server Architectures: ROLAP versus MOLAP versus HOLAP
Logically, OLAP servers present business users with multidimensional data from data warehouses or data marts, without concerns regarding how or where the data are stored. However, the physical architecture and implementation of OLAP servers must consider data storage issues. Implementations of a warehouse server for OLAP processing include the following:
Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a relational back-end server and client front-end tools. They use a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces. ROLAP servers include optimization for each DBMS back end, implementation of aggregation navigation logic, and additional tools and services. ROLAP technology tends to have greater scalability than MOLAP technology. The DSS server of Microstrategy, for example, adopts the ROLAP approach.
Multidimensional OLAP (MOLAP) servers: These servers support multidimensional data views through array-based multidimensional storage engines. They map multidimensional views directly to data cube array structures. The advantage of using a data cube is that it allows fast indexing to precomputed summarized data. Notice that with multidimensional data stores, the storage utilization may be low if the data set is sparse. In such cases, sparse matrix compression techniques should be explored (Chapter 5).
Many MOLAP servers adopt a two-level storage representation to handle dense