Data Mining_ Concepts and Techniques - Jiawei Han [226]
Figure 8.15 Confusion matrix for the classes buys_computer = yes and buys_computer = no, where an entry in row i and column j shows the number of tuples of class i that were labeled by the classifier as class j. Ideally, the nondiagonal entries should be zero or close to zero.
For example, we see that it mislabeled 412 “no” tuples as“yes.” Accuracy is most effective when the class distribution is relatively balanced.
We can also speak of the error rate or misclassification rate of a classifier, M, which is simply , where is the accuracy of M. This also can be computed as
(8.22)
If we were to use the training set (instead of a test set) to estimate the error rate of a model, this quantity is known as the resubstitution error. This error estimate is optimistic of the true error rate (and similarly, the corresponding accuracy estimate is optimistic) because the model is not tested on any samples that it has not already seen.
We now consider the class imbalance problem, where the main class of interest is rare. That is, the data set distribution reflects a significant majority of the negative class and a minority positive class. For example, in fraud detection applications, the class of interest (or positive class) is “fraud,” which occurs much less frequently than the negative “nonfraudulant” class. In medical data, there may be a rare class, such as“cancer.” Suppose that you have trained a classifier to classify medical data tuples, where the class label attribute is “cancer” and the possible class values are “yes” and“no.” An accuracy rate of, say, 97% may make the classifier seem quite accurate, but what if only, say, 3% of the training tuples are actually cancer? Clearly, an accuracy rate of 97% may not be acceptable—the classifier could be correctly labeling only the noncancer tuples, for instance, and misclassifying all the cancer tuples. Instead, we need other measures, which access how well the classifier can recognize the positive tuples (cancer = yes) and how well it can recognize the negative tuples (cancer = no).
The sensitivity and specificity measures can be used, respectively, for this purpose. Sensitivity is also referred to as the true positive (recognition) rate (i.e., the proportion of positive tuples that are correctly identified), while specificity is the true negative rate (i.e., the proportion of negative tuples that are correctly identified). These measures are defined as
(8.23)
(8.24)
It can be shown that accuracy is a function of sensitivity and specificity:
(8.25)
Sensitivity and specificity
Figure 8.16 shows a confusion matrix for medical data where the class values are yes and no for a class label attribute, cancer. The sensitivity of the classifier is %. The specificity is %. The classifier's overall accuracy is %. Thus, we note that although the classifier has a high accuracy, it's ability to correctly label the positive (rare) class is poor given its low sensitivity. It has high specificity, meaning that it can accurately recognize negative tuples. Techniques for handling class-imbalanced data are given in Section 8.6.5.
Figure 8.16 Confusion matrix for the classes cancer = yes and cancer = no.
The precision and recall measures are also widely used in classification. Precision can be thought of as a measure of exactness (i.e., what percentage of tuples labeled as positive are actually such), whereas recall is a measure of completeness (what percentage of positive tuples are labeled as such). If recall seems familiar, that's because it is the same as sensitivity (or