Data Mining_ Concepts and Techniques - Jiawei Han [263]
9.7.1. Multiclass Classification
Some classification algorithms, such as support vector machines, are designed for binary classification. How can we extend these algorithms to allow for multiclass classification (i.e., classification involving more than two classes)?
A simple approach is one-versus-all (OVA). Given m classes, we train m binary classifiers, one for each class. Classifier j is trained using tuples of class j as the positive class, and the remaining tuples as the negative class. It learns to return a positive value for class j and a negative value for the rest. To classify an unknown tuple, X, the set of classifiers vote as an ensemble. For example, if classifier j predicts the positive class for X, then class j gets one vote. If it predicts the negative class for X, then each of the classes except j gets one vote. The class with the most votes is assigned to X.
All-versus-all (AVA) is an alternative approach that learns a classifier for each pair of classes. Given m classes, we construct binary classifiers. A classifier is trained using tuples of the two classes it should discriminate. To classify an unknown tuple, each classifier votes. The tuple is assigned the class with the maximum number of votes. All-versus-all tends to be superior to one-versus-all.
A problem with the previous schemes is that binary classifiers are sensitive to errors. If any classifier makes an error, it can affect the vote count.
Error-correcting codes can be used to improve the accuracy of multiclass classification, not just in the previous situations, but for classification in general. Error-correcting codes were originally designed to correct errors during data transmission for communication tasks. For such tasks, the codes are used to add redundancy to the data being transmitted so that, even if some errors occur due to noise in the channel, the data can be correctly received at the other end. For multiclass classification, even if some of the individual binary classifiers make a prediction error for a given unknown tuple, we may still be able to correctly label the tuple.
An error-correcting code is assigned to each class, where each code is a bit vector. Figure 9.16 show an example of 7-bit codewords assigned to classes , and C4. We train one classifier for each bit position. Therefore, in our example we train seven classifiers. If a classifier makes an error, there is a better chance that we may still be able to predict the right class for a given unknown tuple because of the redundancy gained by having additional bits. The technique uses a distance measurement called the Hamming distance to guess the “closest” class in case of errors, and is illustrated in Example 9.3.
Figure 9.16 Error-correcting codes for a multiclass classification problem involving four classes.
Multiclass classification with error-correcting codes
Consider the 7-bit codewords associated with classes C1 to C4 in Figure 9.16. Suppose that, given an unknown tuple to label, the seven trained binary classifiers collectively output the codeword 0001010, which does not match a codeword for any of the four classes. A classification error has obviously occurred, but can we figure out what the classification most likely should be? We can try by using the Hamming distance, which is the number of different bits between two codewords. The Hamming distance between the output codeword and the codeword for C1 is 5 because five bits—namely, the first, second, third, fifth, and seventh—differ. Similarly, the Hamming distance between the output code and the codewords for C2 through C4 are 3, 3, and 1, respectively. Note that the output codeword is closest to the codeword for C4. That is, the smallest Hamming distance between the output and a class codeword is for class C4. Therefore, we assign C4 as the class label of the given tuple.
Error-correcting