Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [76]

By Root 812 0

This again is a convex polygon and the space is partitioned into convex polygons, within each of which the set of kNN is invariant.

Figure 4.28. Voronoi diagram in a 2-D space.

The parameter k in kNN is often chosen based on experience or knowledge about the classification problem at hand. It is desirable for k to be odd to make ties less likely. k = 3 and k = 5 are common choices, but much larger values up to 100 are also used. An alternative way of setting the parameter is to select k through the iterations of testing process, and select k that gives best results on testing set.

Time complexity of the algorithm is linear in the size of the training set as we need to compute the distance of each training sample from the new test sample. Of course, the computing time goes up as k goes up, but the advantage is that higher values of k provide smoothing of the classification surface that reduces vulnerability to noise in the training data. At the same time high value for k may destroy the locality of the estimation since farther samples are taken into account, and large k increases the computational burden. In practical applications, typically, k is in units or tens rather than in hundreds or thousands. The nearest neighbor classifier is quite simple algorithm, but very computationally intensive especially in the testing phase. The choice of the distance measure is another important consideration. It is well known that the Euclidean distance measure becomes less discriminating as the number of attributes increases, and in some cases it may be better to use cosine or other measures rather than Euclidean distance.

Testing time of the algorithm is independent of the number of classes, and kNN therefore has a potential advantage for classification problems with multiple classes. For the example in Figure 4.29 we have three classes (ω1, ω2, ω3) represented by a set of training samples, and the goal is to find a class label for the testing sample xu. In this case, we use the Euclidean distance and a value of k = 5 neighbors as the threshold. Of the five closest neighbors, four belong to ω1 class and 1 belongs to ω3 class, so xu is assigned to ω1 as the predominant class in the neighborhood.

Figure 4.29. Nearest neighbor classifier for k = 5.

In summary, kNN classifier only requires a parameter k, a set of labeled training samples, and a metric measure for determining distances in an n-dimensional space. kNN classification process is usually based on the following steps:

Determine parameter k—number of nearest neighbors.

Calculate the distance between each testing sample and all the training samples.

Sort the distance and determine nearest neighbors based on the k-th threshold.

Determine the category (class) for each of the nearest neighbors.

Use simple majority of the category of nearest neighbors as the prediction value of the testing sample classification.

There are many techniques available for improving the performance and speed of a nearest neighbor classification. One solution is to choose a subset of the training data for classification. The idea of the Condensed Nearest Neighbor (CNN) is to select the smallest subset Z of training data X such that when Z is used instead of X, error in classification of new testing samples does not increase. 1NN is used as the nonparametric estimator for classification. It approximates the classification function in a piecewise linear manner. Only the samples that define the classifier need to be kept. Other samples, inside regions, need not be stored because they belong to the same class. An example of CNN classifier in a 2-D space is given in Figure 4.30. Greedy CNN algorithm is defined with the following steps:

1. Start wit empty set Z.

2. Pass samples from X one by one in a random order, and check whether they can be classified correctly by instances in Z.

3. If a sample is misclassified it is added to Z; if it is correctly classified Z is unchanged.

4. Repeat a few times over a training data set until Z is unchanged. Algorithm does not guarantee a minimum subset

Online Book Reader

Data Mining - Mehmed Kantardzic [76]

®Online Book Reader