Data Mining_ Concepts and Techniques - Jiawei Han [112]
Our discussions on class characterization in the previous sections handle multilevel data summarization and characterization in a single class. The techniques developed can be extended to handle class comparison across several comparable classes. For example, the attribute generalization process described for class characterization can be modified so that the generalization is performed synchronously among all the classes compared. This allows the attributes in all of the classes to be generalized to the same abstraction levels.
Suppose, for instance, that we are given the AllElectronics data for sales in 2009 and in 2010 and want to compare these two classes. Consider the dimension location with abstractions at the city, province_or_state, and country levels. Data in each class should be generalized to the same location level. That is, they are all synchronously generalized to either the city level, the province_or_state level, or the country level. Ideally, this is more useful than comparing, say, the sales in Vancouver in 2009 with the sales in the United States in 2010 (i.e., where each set of sales data is generalized to a different level). The users, however, should have the option to overwrite such an automated, synchronous comparison with their own choices, when preferred.
“How is class comparison performed?” In general, the procedure is as follows:
1. Data collection: The set of relevant data in the database is collected by query processing and is partitioned respectively into a target class and one or a set of contrasting classes.
2. Dimension relevance analysis: If there are many dimensions, then dimension relevance analysis should be performed on these classes to select only the highly relevant dimensions for further analysis. Correlation or entropy-based measures can be used for this step (Chapter 3).
3. Synchronous generalization: Generalization is performed on the target class to the level controlled by a user- or expert-specified dimension threshold, which results in a prime target class relation. The concepts in the contrasting class(es) are generalized to the same level as those in the prime target class relation, forming the prime contrasting class(es) relation.
4. Presentation of the derived comparison: The resulting class comparison description can be visualized in the form of tables, graphs, and rules. This presentation usually includes a “contrasting” measure such as count% (percentage count) that reflects the comparison between the target and contrasting classes. The user can adjust the comparison description by applying drill-down, roll-up, and other OLAP operations to the target and contrasting classes, as desired.
The preceding discussion outlines a general algorithm for mining comparisons in databases. In comparison with characterization, the previous algorithm involves synchronous generalization of the target class with the contrasting classes, so that classes are simultaneously compared at the same abstraction levels.
Example 4.14 mines a class comparison describing the graduate and undergraduate students at Big University.
Mining a class comparison
Suppose that you would like to compare the general properties of the graduate and undergraduate students at Big_University, given the attributes name, gender, major, birth_place, birth_date, residence, phone#, and gpa.
This data mining task can be expressed in DMQL as follows:
use Big_University_DB
mine comparison as "grad_vs_undergrad_students"
in relevance to name, gender, major, birth_place,birth_date, residence,