Data Mining - Mehmed Kantardzic [73]
defining a 3-D feature space . Similar analysis may be performed for other kernel function. For example, through the similar process verify that for the “full” quadratic kernel ( Figure 4.25. An example of a mapping Φ to a feature space in which the data become linearly separable. (a) One-dimensional input space; (b) two-dimensional feature space. In practical use of SVM, only the kernel function k (and not transformation function Φ) is specified. The selection of an appropriate kernel function is important, since the kernel function defines the feature space in which the training set examples will be classified. As long as the kernel function is legitimate, an SVM will operate correctly even if the designer does not know exactly what features of the training data are being used in the kernel-induced feature space. The definition of a legitimate kernel function is given by Mercer’s theorem: The function must be continuous and positive-definite. Modified and enhanced SVM constructs an optimal separating hyperplane in the higher dimensional space. In this case, the optimization problem becomes where K(x,y) is the kernel function performing the nonlinear mapping into the feature space, and the constraints are unchanged. Using kernel function we will perform minimization of dual Lagrangian in the feature space, and determine all margin parameter, without representing points in this new space. Consequently, everything that has been derived concerning the linear case is also applicable for a nonlinear case by using a suitable kernel K instead of the dot product. The approach with kernel functions gives a modular SVM methodology. One module is always the same: Linear Learning Module. It will find margin for linear separation of samples. If the problem of classification is more complex, requiring nonlinear separation, then we include a new preparatory module. This module is based on kernel function, and it transforms input space into higher, feature space where the same Linear Learning Module may be applied, and the final solution is nonlinear classification model. Illustrative example is given in Figure 4.26. This combination of different kernel functions with standard SVM learning algorithm for linear separation gives the flexibility to the SVM methodology for efficient application in nonlinear cases. Figure 4.26. SVM performs nonlinear classification by kernel-based transformations. (a) 2-D input space; (b) 3-D feature space; (c) 2-D input space. The idea of using a hyperplane to separate the feature vectors into two groups works well when there are only two target categories, but how does SVM handle the case where the target variable has more than two categories? Several approaches have been suggested, but two are the most popular: (a) “one against many” where each category is split out and all of the other categories are merged; and (b) “one against one” where k (k − 1)/2 models are constructed and k is the number of categories. A preparation process for SVM applications is enormously important for the final results, and it includes preprocessing of raw data and setting model parameters. SVM requires that each data sample is represented as a vector of real numbers. If there are categorical attributes, we first have to convert them into numeric data. Multi-attribute coding is recommended in this case. For example, a three-category attribute such as red, green, and blue can be represented with three separate attributes and corresponding codes such as (0,0,1), (0,1,0), and (1,0,0). This approach is appropriate only if the number of values in an attribute is not too large. Second, scaling values of all numerical attributes before applying SVM is very important in successful application of the technology. The main advantage is to avoid attributes with greater numeric ranges to dominate those in smaller ranges. Normalization for each attribute may be applied to the range [−1; +1] or [0; 1]. Selection