Data Mining_ Concepts and Techniques - Jiawei Han [264]
9.7.2. Semi-Supervised Classification
Semi-supervised classification uses labeled data and unlabeled data to build a classifier. Let be the set of labeled data and be the set of unlabeled data. Here we describe a few examples of this approach for learning.
Self-training is the simplest form of semi-supervised classification. It first builds a classifier using the labeled data. The classifier then tries to label the unlabeled data. The tuple with the most confident label prediction is added to the set of labeled data, and the process repeats (Figure 9.17). Although the method is easy to understand, a disadvantage is that it may reinforce errors.
Figure 9.17 Self-training and cotraining methods of semi-supervised classification.
Cotraining is another form of semi-supervised classification, where two or more classifiers teach each other. Each learner uses a different and ideally independent set of features for each tuple. Consider web page data, for example, where attributes relating to the images on the page may be used as one set of features, while attributes relating to the corresponding text constitute another set of features for the same data. Each set of features should be sufficient to train a good classifier. Suppose we split the feature set into two sets and train two classifiers, f1 and f2, where each classifier is trained on a different set. Then, f1 and f2 are used to predict the class labels for the unlabeled data, Xu. Each classifier then teaches the other in that the tuple having the most confident prediction from f1 is added to the set of labeled data for f2 (along with its label).
Similarly, the tuple having the most confident prediction from f2 is added to the set of labeled data for f1. The method is summarized in Figure 9.17. Cotraining is less sensitive to errors than self-training. A difficulty is that the assumptions for its usage may not hold true, that is, it may not be possible to split the features into mutually exclusive and class-conditionally independent sets.
Alternate approaches to semi-supervised learning exist. For example, we can model the joint probability distribution of the features and the labels. For the unlabeled data, the labels can then be treated as missing data. The EM algorithm (Chapter 11) can be used to maximize the likelihood of the model. Methods using support vector machines have also been proposed.
9.7.3. Active Learning
Active learning is an iterative type of supervised learning that is suitable for situations where data are abundant, yet the class labels are scarce or expensive to obtain. The learning algorithm is active in that it can purposefully query a user (e.g., a human oracle) for labels. The number of tuples used to learn a concept this way is often much smaller than the number required in typical supervised learning.
“How does active learning work to overcome the labeling bottleneck?” To keep costs down, the active learner aims to achieve high accuracy using as few labeled instances as possible. Let D be all of data under consideration. Various strategies exist for active learning on D. Figure 9.18 illustrates a pool-based approach to active learning. Suppose that a small subset of D is class-labeled. This set is denoted L. U is the set of unlabeled data in D. It is also referred to as a pool of unlabeled data. An active learner begins with L as the initial training set. It then uses a querying function to carefully select one or more data samples from U and requests labels for them from an oracle (e.g., a human annotator).