Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [70]

By Root 636 0

error and model complexity. It recommends maximum margin, such as the one in Figure 4.18, as optimal separation criteria ensuring that SVM worst case generalization errors are minimized.

Figure 4.18. SRM principle expresses a trade-off between training error and model complexity. (a) “Fat” margin; (b) “skinny” margin.

Based on the vector equation of the line in 2-D we can define function f(x) = w x + b as a separation model. For all points above line f(x) > 0, and for the points below line f(x) < 0. The sign of this function h(x) = sign(f[x]) we define as a classification function because it has the value 1 for all points above the line, and the value -1 for all points below line. An example is given in Figure 4.19.

Figure 4.19. Classification function, sign(), on a 2-D space.

Before we continue, it is important to note that while the examples mentioned show a 2-D data set, which can be conveniently represented by points in a plane, in fact we will typically be dealing with higher dimensional data. The question is how to determine an optimal hyperplane in n-dimensional spaces based on a given set of training samples. Consider the problem of separating the set of training vectors D belonging to two classes (coded binary with −1 and 1).

with a hyperplane

The set of vectors is said to be optimally separated by the hyperplane if it is separated without error and the distance between the closest vectors to the hyperplane is maximal. An illustration of the separation with a graphical interpretation of main parameters w and b is given in Figure 4.20a. In this way we have parameterized the function by the weight vector w and the scalar b. The notation denotes the inner or scalar product of vectors w and x, defined by

Figure 4.20. A separating hyperplane (w,b) for 2-D data. (a) Parameters w and b; (b) two parallel hyperplanes define margin.

In order for our hyperplane to correctly separate the two classes, we need to satisfy the following constraints:

The set of constraints that we have so far is equivalent to saying that these data must lie on the correct side (according to class label) of this decision surface. Next notice that we have also plotted as dotted lines two other hyperplanes represented in Figure 4.20b, which are the hyperplanes where the function + b is equal to −1 (on the lower left) and +1 (on the upper right). In order to find the maximum margin hyperplane, we can see intuitively that we should keep the dotted lines parallel and equidistant to the decision surface, and maximize their distance from one another, while satisfying the constraint that the data lie on the correct side of the dotted lines associated with that class. In mathematical form, the final clause of this sentence (the constraints) can be written as

The distance between these two margin hyperplanes may be formalized, because it is the parameter we want to maximize. We may obtain the distance between hyperplanes in an n-dimensional space using equations

where x1 and x2 are any points on corresponding hyperplanes. If we subtract these equations

and representing scalar product of vectors by definition,

we obtain

where || || represents Euclidean normor

Therefore, the problem of “maximum margin” is represented as a maximum of a distance parameter d, which is a function of parameters w. Maximizing d means maximizing 1/|w| or minimizing |w|. The learning problem may be reformulated as:

subject to the constraints of linear separability. So, the final problem for optimization is

The problem of optimization under constraints may be transformed using the Lagrangian L(w,b)

where αi are the Lagrange multipliers, one for each data point. The first term is the same as the original objective function, and the second term captures the inequality constraints. The Lagrangian has to be minimized with respect to w and b:

Substituting results of partial derivatives into L lead to the dual formulation of the optimization problem that has to be maximized with respect to the constraints αi ≥ 0.

The dual

Online Book Reader

Data Mining - Mehmed Kantardzic [70]

®Online Book Reader