Data Mining_ Concepts and Techniques - Jiawei Han [108]
The data mining query presented in Example 4.11 is transformed into the following relational query for the collection of the task-relevant data set:
use Big_University_DB
select name, gender, major, birth_place,birth_date, residence, phone#, gpa
from student
where status in {“M.Sc.,” “M.A.,” “M.B.A.,” “Ph.D.”} The transformed query is executed against the relational database, Big_University_DB, and returns the data shown earlier in Table 4.5. This table is called the (task-relevant) initial working relation. It is the data on which induction will be performed. Note that each tuple is, in fact, a conjunction of attribute–value pairs. Hence, we can think of a tuple within a relation as a rule of conjuncts, and of induction on the relation as the generalization of these rules.
Table 4.5 Initial Working Relation: A Collection of Task-Relevant Data
namegendermajorbirth_placebirth_dateresidencephone#gpa
Jim Woodman M CS Vancouver, BC, Canada 12-8-76 3511 Main St., Richmond 687-4598 3.67
Scott Lachance M CS Montreal, Que, Canada 7-28-75 345 1st Ave., Richmond 253-9106 3.70
Laura Lee F Physics Seattle, WA, USA 8-25-70 125 Austin Ave., Burnaby 420-5232 3.83
… … … … … … … …
“Now that the data are ready for attribute-oriented induction, how is attribute-oriented induction performed?” The essential operation of attribute-oriented induction is data generalization, which can be performed in either of two ways on the initial working relation: attribute removal and attribute generalization.
Attribute removal is based on the following rule: If there is a large set of distinct values for an attribute of the initial working relation, but either (case 1) there is no generalization operator on the attribute (e.g., there is no concept hierarchy defined for the attribute), or (case 2) its higher-level concepts are expressed in terms of other attributes, then the attribute should be removed from the working relation.
Let's examine the reasoning behind this rule. An attribute–value pair represents a conjunct in a generalized tuple, or rule. The removal of a conjunct eliminates a constraint and thus generalizes the rule. If, as in case 1, there is a large set of distinct values for an attribute but there is no generalization operator for it, the attribute should be removed because it cannot be generalized. Preserving it would imply keeping a large number of disjuncts, which contradicts the goal of generating concise rules. On the other hand, consider case 2, where the attribute's higher-level concepts are expressed in terms of other attributes. For example, suppose that the attribute in question is street, with higher-level concepts that are represented by the attributes 〈city, province_or_state, country〉. The removal of street is equivalent to the application of a generalization operator. This rule corresponds to the generalization rule known as dropping condition in the machine learning literature on learning from examples.
Attribute generalization is based on the following rule: If there is a large set of distinct values for an attribute in the initial working relation, and there exists a set of generalization operators on the attribute, then a generalization operator should be selected and applied to the attribute. This rule is based on the following reasoning. Use of a generalization operator to generalize an attribute value within a tuple, or rule, in the working relation will make the rule cover more of the original data tuples, thus generalizing the concept it represents. This corresponds to the generalization rule known as climbing generalization trees in learning from examples, or concept tree ascension.
Both rules–attribute removal and attribute generalization –claim that if there is a large set of distinct values for an attribute, further generalization should be applied. This raises the question: How large is “a large set of distinct values for an attribute” considered to be?
Depending on the attributes or application involved,