Online Book Reader

Home Category

Data Mining_ Concepts and Techniques - Jiawei Han [110]

By Root 1526 0
sum() and avg(). For a given generalized tuple, sum() contains the sum of the values of a given numeric attribute for the initial working relation tuples making up the generalized tuple. Suppose that tuple T contained sum( units_sold ) as an aggregate function. The sum value for tuple T would then be set to the total number of units sold for each of the 52 tuples. The aggregate avg() (average) is computed according to the formula avg() = sum()/count().

Attribute-oriented induction

Here we show how attribute-oriented induction is performed on the initial working relation of Table 4.5. For each attribute of the relation, the generalization proceeds as follows:

1. name: Since there are a large number of distinct values for name and there is no generalization operation defined on it, this attribute is removed.

2. gender: Since there are only two distinct values for gender, this attribute is retained and no generalization is performed on it.

3. major: Suppose that a concept hierarchy has been defined that allows the attribute major to be generalized to the values {arts&sciences, engineering, business}. Suppose also that the attribute generalization threshold is set to 5, and that there are more than 20 distinct values for major in the initial working relation. By attribute generalization and attribute generalization control, major is therefore generalized by climbing the given concept hierarchy.

4. birth_place: This attribute has a large number of distinct values; therefore, we would like to generalize it. Suppose that a concept hierarchy exists for birth_place, defined as “city < province_or_state < country.” If the number of distinct values for country in the initial working relation is greater than the attribute generalization threshold, then birth_place should be removed, because even though a generalization operator exists for it, the generalization threshold would not be satisfied. If, instead, the number of distinct values for country is less than the attribute generalization threshold, then birth_place should be generalized to birth_country.

5. birth_date: Suppose that a hierarchy exists that can generalizebirth_date to age and age to age_range, and that the number of age ranges (or intervals) is small with respect to the attribute generalization threshold. Generalization of birth_dateshould therefore take place.

6. residence: Suppose that residence is defined by the attributes number, street, residence_city, residence_province_or_state, and residence_country. The number of distinct values for number and street will likely be very high, since these concepts are quite low level. The attributes number and street should therefore be removed so that residence is then generalized to residence_city, which contains fewer distinct values.

7. phone#: As with the name attribute, phone# contains too many distinct values and should therefore be removed in generalization.

8. gpa: Suppose that a concept hierarchy exists for gpa that groups values for grade point average into numeric intervals like {3.75–4.0, 3.5–3.75, …}, which in turn are grouped into descriptive values such as {"excellent”, “very_good”, …}. The attribute can therefore be generalized.

The generalization process will result in groups of identical tuples. For example, the first two tuples of Table 4.5 both generalize to the same identical tuple (namely, the first tuple shown in Table 4.6). Such identical tuples are then merged into one, with their counts accumulated. This process leads to the generalized relation shown in Table 4.6.

Table 4.6 Generalized Relation Obtained by Attribute-Oriented Induction on Table 4.5's Data

gendermajorbirth_countryage_rangeresidence_citygpacount

M Science Canada 20–25 Richmond very_good 16

F Science Foreign 25–30 Burnaby excellent 22

… … … … … … …

Based on the vocabulary used in OLAP, we may view count() as a measure, and the remaining attributes as dimensions. Note that aggregate functions, such as sum(), may be applied to numeric attributes (e.g., salary and sales). These attributes are referred to

Return Main Page Previous Page Next Page

®Online Book Reader