Data Mining_ Concepts and Techniques - Jiawei Han [113]
phone#, gpa
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
fromstudent
Let's see how this typical example of a data mining query for mining comparison descriptions can be processed.
First, the query is transformed into two relational queries that collect two sets of task-relevant data: one for the initial target-class working relation and the other for the initial contrasting-class working relation, as shown in Table 4.8 and Table 4.9. This can also be viewed as the construction of a data cube, where the status {graduate, undergraduate} serves as one dimension, and the other attributes form the remaining dimensions.
Table 4.8 Initial Working Relations: The Target Class (Graduate Students)
namegendermajorbirth_placebirth_dateresidencephone#gpa
Jim Woodman M CS Vancouver, BC, Canada 12-8-76 3511 Main St., Richmond 687-4598 3.67
Scott Lachance M CS Montreal, Que, Canada 7-28-75 345 1st Ave., Vancouver 253-9106 3.70
Laura Lee F Physics Seattle, WA, USA 8-25-70 125 Austin Ave., Burnaby 420-5232 3.83
… … … … … … … …
Table 4.9 Initial Working Relations: The Contrasting Class (Undergraduate Students)
namegendermajorbirth_placebirth_dateresidencephone#gpa
Bob Schumann M Chemistry Calgary, Alt, Canada 1-10-78 2642 Halifax St., Burnaby 294-4291 2.96
Amy Eau F Biology Golden, BC, Canada 3-30-76 463 Sunset Cres., Vancouver 681-5417 3.52
… … … … … … … …
Second, dimension relevance analysis can be performed, when necessary, on the two classes of data. After this analysis, irrelevant or weakly relevant dimensions (e.g., name, gender, birth_place, residence, and phone#) are removed from the resulting classes. Only the highly relevant attributes are included in the subsequent analysis.
Third, synchronous generalization is performed on the target class to the levels controlled by user- or expert-specified dimension thresholds, forming the prime target class relation. The contrasting class is generalized to the same levels as those in the prime target class relation, forming the prime contrasting class(es) relation, as presented in Table 4.10 and Table 4.11. In comparison with undergraduate students, graduate students tend to be older and have a higher GPA in general.
Table 4.10 Prime Generalized Relation for the Target Class (Graduate Students)
majorage_rangegpacount%
Science 21…25 good 5.53
Science 26…30 good 5.02
Science over_30 very good 5.86
… … … …
Business over_30 excellent 4.68
Table 4.11 Prime Generalized Relation for the Contrasting Class (Undergraduate Students)
majorage_rangegpacount%
Science 16…20 fair 5.53
Science 16…20 good 4.53
… … … …
Science 26…30 good 2.32
… … … …
Business over_30 excellent 0.68
Finally, the resulting class comparison is presented in the form of tables, graphs, and/or rules. This visualization includes a contrasting measure (e.g., count% ) that compares the target class and the contrasting class. For example, 5.02% of the graduate students majoring in science are between 26 and 30 years old and have a “good” GPA, while only 2.32% of undergraduates have these same characteristics. Drilling and other OLAP operations may be performed on the target and contrasting classes as deemed necessary by the user in order to adjust the abstraction levels of the final description.
In summary, attribute-oriented induction for data characterization and generalization provides an alternative data generalization method in comparison to the data cube approach. It is not confined to relational data because such an induction can be performed on spatial, multimedia, sequence, and other kinds of data sets. In addition, there is no need to precompute a data cube because generalization can be performed online upon receiving a user's query.
Moreover, automated analysis can be added to such an induction process to automatically filter out irrelevant or unimportant attributes. However, because attribute-oriented induction automatically generalizes data to a higher level, it cannot efficiently support