Data Mining_ Concepts and Techniques - Jiawei Han [85]
Other features that distinguish between OLTP and OLAP systems include database size, frequency of operations, and performance metrics. These are summarized in Table 4.1.
Table 4.1 Comparison of OLTP and OLAP Systems
Note: Table is partially based on Chaudhuri and Dayal [CD97].
FeatureOLTPOLAP
Characteristic operational processing informational processing
Orientation transaction analysis
User clerk, DBA, database professional knowledge worker (e.g., manager, executive, analyst)
Function day-to-day operations long-term informational requirements decision support
DB design ER-based, application-oriented star/snowflake, subject-oriented
Data current, guaranteed up-to-date historic, accuracy maintainedover time
Summarization primitive, highly detailed summarized, consolidated
View detailed, flat relational summarized, multidimensional
Unit of work short, simple transaction complex query
Access read/write mostly read
Focus data in information out
Operations index/hash on primary key lots of scans
Number of records accessed tens millions
Number of users thousands hundreds
DB size GB to high-order GB ≥ TB
Priority high performance, high availability high flexibility, end-user autonomy
Metric transaction throughput query throughput, response time
4.1.3. But, Why Have a Separate Data Warehouse?
Because operational databases store huge amounts of data, you may wonder, “Why not perform online analytical processing directly on such databases instead of spending additional time and resources to construct a separate data warehouse?” A major reason for such a separation is to help promote the high performance of both systems. An operational database is designed and tuned from known tasks and workloads like indexing and hashing using primary keys, searching for particular records, and optimizing “canned” queries. On the other hand, data warehouse queries are often complex. They involve the computation of large data groups at summarized levels, and may require the use of special data organization, access, and implementation methods based on multidimensional views. Processing OLAP queries in operational databases would substantially degrade the performance of operational tasks.
Moreover, an operational database supports the concurrent processing of multiple transactions. Concurrency control and recovery mechanisms (e.g., locking and logging) are required to ensure the consistency and robustness of transactions. An OLAP query often needs read-only access of data records for summarization and aggregation. Concurrency control and recovery mechanisms, if applied for such OLAP operations, may jeopardize the execution of concurrent transactions and thus substantially reduce the throughput of an OLTP system.
Finally, the separation of operational databases from data warehouses is based on the different structures, contents, and uses of the data in these two systems. Decision support requires historic data, whereas operational databases do not typically maintain historic data. In this context, the data in operational databases, though abundant, are usually far from complete for decision making. Decision support requires consolidation (e.g., aggregation and summarization) of data from heterogeneous sources, resulting in high-quality, clean, integrated data. In contrast, operational databases contain only detailed raw data, such as transactions, which need to be consolidated before analysis. Because the two systems provide quite different functionalities and require different kinds of data, it is presently necessary to maintain separate databases. However, many vendors of operational relational database management systems are beginning to optimize such systems to support OLAP queries. As this trend continues, the separation between OLTP and OLAP systems is expected to decrease.
4.1.4. Data Warehousing: A Multitiered Architecture
Data warehouses often adopt a three-tier architecture, as presented in Figure 4.1.
Figure 4.1 A three-tier data warehousing architecture.
1. The bottom tier is