Data Mining - Mehmed Kantardzic [71]
where xr and xs are any support vectors (SVs) from each class. The classifier can then be constructed as:
Only the points xi, which will have nonzero Lagrangian multipliers α0, are termed SVs. If the data are linearly separable, all the SVs will lie on the margin and hence the number of SVs can be very small as it is represented in Figure 4.21. This “sparse” representation can be viewed as data compression in the construction of the classifier. The SVs are the “hard” cases; these are the training samples that are most difficult to classify correctly and that lie closest to the decision boundary.
Figure 4.21. A maximal margin hyperplane with its support vectors encircled.
The SVM learning algorithm is defined so that, in a typical case, the number of SVs is small compared with the total number of training examples. This property allows the SVM to classify new examples efficiently, since the majority of the training examples can be safely ignored. SVMs effectively remove the uninformative patterns from the data set by assigning them αi weights of 0. So, if internal points that are not SVs are changed, no effect will be made on the decision boundary. The hyperplane is represented sparsely as a linear combination of “SV” points. The SVM automatically identifies a subset of these informative points and uses them to represent the solution.
In real-world applications, SVMs must deal with (a) handling the cases where subsets cannot be completely separated, (b) separating the points with nonlinear surfaces, and (c) handling classifications with more than two categories. Illustrative examples are given in Figure 4.22. What are solutions in these cases? We will start with the problem of data that are not linearly separable. The points such as shown on the Figure 4.23a could be separated only by a nonlinear region. Is it possible to define a linear margin where some points may be on opposite sides of hyperplanes?
Figure 4.22. Issues for an SVM in real-world applications. (a) Subset cannot be completely separated; (b) nonlinear separation; (c) three categories.
Figure 4.23. Soft margin SVM. (a) Soft separating hyperplane; (b) error points with their distances.
Obviously, there is no hyperplane that separates all of the samples in one class from all of the samples in the other class. In this case there would be no combination of w and b that could ever satisfy the set of constraints. This situation is depicted in Figure 4.23b, where it becomes apparent that we need to soften the constraint that these data lay on the correct side of the +1 and −1 hyperplanes. We need to allow some, but not too many data points to violate these constraints by a preferably small amount. This alternative approach turns out to be very useful not only for datasets that are not linearly separable, but also, and perhaps more importantly, in allowing improvements in generalization. We modify the optimization problem including cost of violation factor for samples that violate constraints:
under new constraints:
where C is a parameter representing the cost of violating the constraints, and ξi are distances of samples that violate constraints. To allow some flexibility in separating the categories, SVM models introduce a cost parameter, C, that controls the trade-off between allowing training errors