Online Book Reader

Home Category

Data Mining_ Concepts and Techniques - Jiawei Han [107]

By Root 1634 0
offline precomputation of multidimensional space can speed up attribute-oriented induction as well.

The general idea of attribute-oriented induction is to first collect the task-relevant data using a database query and then perform generalization based on the examination of the number of each attribute's distinct values in the relevant data set. The generalization is performed by either attribute removal or attribute generalization. Aggregation is performed by merging identical generalized tuples and accumulating their respective counts. This reduces the size of the generalized data set. The resulting generalized relation can be mapped into different forms (e.g., charts or rules) for presentation to the user.

The following illustrates the process of attribute-oriented induction. We first discuss its use for characterization. The method is extended for the mining of class comparisons in Section 4.5.3.

A data mining query for characterization

Suppose that a user wants to describe the general characteristics of graduate students in the Big University database, given the attributes name, gender, major, birth_place, birth_date, residence, phone# (telephone number), and gpa (grade_point_average). A data mining query for this characterization can be expressed in the data mining query language, DMQL, as follows:

use Big_University_DB

mine characteristics as “Science_Students”

in relevance to name, gender, major, birth_place,birth_date, residence,

phone#, gpa

from student

where status in “graduate”

We will see how this example of a typical data mining query can apply attribute-oriented induction to the mining of characteristic descriptions.

First, data focusing should be performed before attribute-oriented induction. This step corresponds to the specification of the task-relevant data (i.e., data for analysis). The data are collected based on the information provided in the data mining query. Because a data mining query is usually relevant to only a portion of the database, selecting the relevant data set not only makes mining more efficient, but also derives more meaningful results than mining the entire database.

Specifying the set of relevant attributes (i.e., attributes for mining, as indicated in DMQL with the in relevance to clause) may be difficult for the user. A user may select only a few attributes that he or she feels are important, while missing others that could also play a role in the description. For example, suppose that the dimension birth_place is defined by the attributes city, province_or_state, and country. Of these attributes, let's say that the user has only thought to specify city. In order to allow generalization on the birth_place dimension, the other attributes defining this dimension should also be included. In other words, having the system automatically include province_or_state and country as relevant attributes allows city to be generalized to these higher conceptual levels during the induction process.

At the other extreme, suppose that the user may have introduced too many attributes by specifying all of the possible attributes with the clause in relevance to ∗. In this case, all of the attributes in the relation specified by the from clause would be included in the analysis. Many of these attributes are unlikely to contribute to an interesting description. A correlation-based analysis method (Section 3.3.2) can be used to perform attribute relevance analysis and filter out statistically irrelevant or weakly relevant attributes from the descriptive mining process. Other approaches such as attribute subset selection, are also described in Chapter 3.

“What does the 'where status in “graduate"' clause mean?” The where clause implies that a concept hierarchy exists for the attribute status. Such a concept hierarchy organizes primitive-level data values for status (e.g., “M.Sc.,” “M.A.,” “M.B.A.,” “Ph.D.,” “B.Sc.,” and “B.A.”) into higher conceptual levels (e.g., “graduate” and “undergraduate”). This use of concept hierarchies does not appear in traditional relational query

Return Main Page Previous Page Next Page

®Online Book Reader