Data Mining - Mehmed Kantardzic [72]
Figure 4.24. Trade-off between allowing training errors and forcing rigid margins. (a) Parameters C and X for a soft margin; (b) soft classifier with a fat margin (C > 0).
This SVM model is a very similar case to the previous optimization problem for the linear separable data, except that there is an upper bound C on all αi parameters. The value of C trades between how large of a margin we would prefer, as opposed to how many of the training set examples violate this margin (and by how much). The process of optimization is going through the same steps: Lagrangian, optimization of αi parameters, and determining w and b values for classification hyperplane. The dual stay the same, but with additional constraints on α parameters: 0 ≤ αi ≤ C.
Most classification tasks require more complex models in order to make an optimal separation, that is, correctly classify new test samples on the basis of the trained SVM. The reason is that the given data set requires nonlinear separation of classes. One solution to the inseparability problem is to map the data into a higher dimensional space and define a separating hyperplane there. This higher dimensional space is called the feature space, as opposed to the input space occupied by the training samples. With an appropriately chosen feature space of sufficient dimensionality, any consistent training set can be made linearly separable. However, translating the training set into a higher dimensional space incurs both computational and learning costs. Representing the feature vectors corresponding to the training set can be extremely expensive in terms of memory and time. Computation in the feature space can be costly because it is high-dimensional. Also, in general, there is the question of which function is appropriate for transformation. Do we have to select from infinite number of potential functions?
There is one characteristic of the SVM optimization process that helps in determining the steps in the methodology. The SVM decision function for classifying points with respect to the hyperplane only involves dot products between points. Furthermore, the algorithm that finds a separating hyperplane in the feature space can be stated entirely in terms of vectors in the input space and dot products in the feature space. We are transforming training samples from one space into the other. But we are making computation only with scalar products of points in this new space. This product is computationally inexpensive because only a small subset of points is SVs involved in product computation. Thus, an SVM can locate a separating hyperplane in the feature space and classify points in that space without ever representing the space explicitly, simply by defining a function, called a kernel function. Kernel function K always plays the role of the dot product in the feature space:
This approach avoids the computational burden of explicitly representing all transformed source data and high-dimensional feature vectors. The two most widely used kernel functions are Polynomial Kernel
and Gaussian Kernel
The polynomial kernel is valid for all positive integers d ≥ 1. The Gaussian kernel is one of a group of kernel functions known as radial basis functions (RBFs). RBFs are kernel functions that depend only on the geometric distance between x and y, and the kernel is valid for all nonzero values of the kernel width σ. It is probably the most useful and commonly applied kernel function. The concept of a kernel mapping function is very powerful, such as in the example given in Figure 4.25. It allows SVM models to perform separations even with very complex boundaries. The relation between a kernel function and a feature space can analyze for