Data Mining_ Concepts and Techniques - Jiawei Han [223]
Rule Quality Measures
Learn_One_Rule needs a measure of rule quality. Every time it considers an attribute test, it must check to see if appending such a test to the current rule's condition will result in an improved rule. Accuracy may seem like an obvious choice at first, but consider Example 8.8.
Choosing between two rules based on accuracy
Consider the two rules as illustrated in Figure 8.12. Both are for the class loan_decision = accept. We use “a” to represent the tuples of class “accept” and “r” for the tuples of class “reject.” Rule correctly classifies 38 of the 40 tuples it covers. Rule covers only two tuples, which it correctly classifies. Their respective accuracies are 95% and 100%. Thus, has greater accuracy than , but it is not the better rule because of its small coverage.
Figure 8.12 Rules for the class loan_decision = accept, showing accept (a) and reject (r) tuples.
From this example, we see that accuracy on its own is not a reliable estimate of rule quality. Coverage on its own is not useful either—for a given class we could have a rule that covers many tuples, most of which belong to other classes! Thus, we seek other measures for evaluating rule quality, which may integrate aspects of accuracy and coverage. Here we will look at a few, namely entropy, another based on information gain, and a statistical test that considers coverage. For our discussion, suppose we are learning rules for the class c. Our current rule is R: IF condition THEN . We want to see if logically ANDing a given attribute test to condition would result in a better rule. We call the new condition, , where : IF THEN is our potential new rule. In other words, we want to see if is any better than R.
We have already seen entropy in our discussion of the information gain measure used for attribute selection in decision tree induction (Section 8.2.2, Eq. 8.1). It is also known as the expected information needed to classify a tuple in data set, D. Here, D is the set of tuples covered by and pi is the probability of class Ci in D. The lower the entropy, the better is. Entropy prefers conditions that cover a large number of tuples of a single class and few tuples of other classes.
Another measure is based on information gain and was proposed in FOIL (First Order Inductive Learner), a sequential covering algorithm that learns first-order logic rules. Learning first-order rules is more complex because such rules contain variables, whereas the rules we are concerned with in this section are propositional (i.e., variable-free). 5 In machine learning, the tuples of the class for which we are learning rules are called positive tuples, while the remaining tuples are negative. Let pos(neg) be the number of positive (negative) tuples covered by R. Let () be the number of positive (negative) tuples covered by . FOIL assesses the information gained by extending as
(8.18)
It favors rules that have high accuracy and cover many positive tuples.
5Incidentally, FOIL was also proposed by Quinlan, the father of ID3.
We can also use a statistical test of significance to determine if the apparent effect of a rule is not attributed to chance but instead indicates a genuine correlation between attribute values and classes. The test compares the observed distribution among classes of tuples covered by a rule with the expected distribution that would result if the rule made predictions at random. We want to assess whether any observed differences between these two distributions may be attributed to chance. We can use the likelihood ratio statistic,
(8.19)
where m is the number of classes.
For tuples satisfying the rule, fi