Data Mining_ Concepts and Techniques - Jiawei Han [111]
4.5.2. Efficient Implementation of Attribute-Oriented Induction
“How is attribute-oriented induction actually implemented?”Section 4.5.1 provided an introduction to attribute-oriented induction. The general procedure is summarized in Figure 4.18. The efficiency of this algorithm is analyzed as follows:
■ Step 1 of the algorithm is essentially a relational query to collect the task-relevant data into the working relation, W. Its processing efficiency depends on the query processing methods used. Given the successful implementation and commercialization of database systems, this step is expected to have good performance.
■ Step 2 collects statistics on the working relation. This requires scanning the relation at most once. The cost for computing the minimum desired level and determining the mapping pairs, , for each attribute is dependent on the number of distinct values for each attribute and is smaller than |W|, the number of tuples in the working relation. Notice that it may not be necessary to scan the working relation once, since if the working relation is large, a sample of such a relation will be sufficient to get statistics and determine which attributes should be generalized to a certain high level and which attributes should be removed. Moreover, such statistics may also be obtained in the process of extracting and generating a working relation in Step 1.
■ Step 3 derives the prime relation, P. This is performed by scanning each tuple in the working relation and inserting generalized tuples into P. There are a total of |W| tuples in W and p tuples in P. For each tuple, t, in W, we substitute its attribute values based on the derived mapping pairs. This results in a generalized tuple, t′. If variation (a) in Figure 4.18 is adopted, each t′ takes O(log p) to find the location for the count increment or tuple insertion. Thus, the total time complexity is O(|W| × log p) for all of the generalized tuples. If variation (b) is adopted, each t′ takes O(1) to find the tuple for the count increment. Thus, the overall time complexity is O(N) for all of the generalized tuples.
Figure 4.18 Basic algorithm for attribute-oriented induction.
Many data analysis tasks need to examine a good number of dimensions or attributes. This may involve dynamically introducing and testing additional attributes rather than just those specified in the mining query. Moreover, a user with little knowledge of the truly relevant data set may simply specify "in relevance to ∗ " in the mining query, which includes all of the attributes in the analysis. Therefore, an advanced–concept description mining process needs to perform attribute relevance analysis on large sets of attributes to select the most relevant ones. This analysis may employ correlation measures or tests of statistical significance, as described in Chapter 3 on data preprocessing.
Presentation of generalization results
Suppose that attribute-oriented induction was performed on a sales relation of the AllElectronics database, resulting in the generalized description of Table 4.7 for sales last year. The description is shown in the form of a generalized relation. Table 4.6 is another generalized relation example.
Table 4.7 Generalized Relation for Last Year's Sales
locationitemsales (in million dollars)count (in thousands)
Asia TV 15 300
Europe TV 12 250
North_America TV 28 450
Asia computer 120 1000
Europe computer 150 1200
North_America computer 200 1800
Such generalized relations can also be presented in the form of cross-tabulation forms, various kinds of graphic presentation (e.g., pie charts and bar charts), and quantitative characteristics rules (i.e., showing how different value combinations are distributed in the generalized relation).
4.5.3. Attribute-Oriented Induction for Class Comparisons
In many applications, users may not be interested in having a single class (or concept) described or characterized, but prefer to mine a description that compares or distinguishes one class (or concept) from other comparable classes (or