Data Mining - Mehmed Kantardzic [75]
4.6 KNN: NEAREST NEIGHBOR CLASSIFIER
Unlike SVM’s global classification model, k nearest neighbor or kNN classifier determines the decision boundary locally. For 1NN we assign each new sample to the class of its closest neighbor as shown in Figure 4.27a. Initially we have samples belonging to two classes (+ and −). The new sample “?” should be labeled with the class of its closest neighbor. 1NN classifier is not a very robust methodology. The classification decision of each test sample relies on the class of a single training sample, which may be incorrectly labeled or atypical. For larger k, kNN will assign new sample to the majority class of its k closest neighbors where k is a parameter of the methodology. An example for k = 4 is given in Figure 4.27b. kNN classifier for k > 1 is more robust. Larger k values help reduce the effects of noisy points within the training data set.
Figure 4.27. k nearest neighbor classifier. (a) k = 1; (b) k = 4.
The rationale of kNN classification is that we expect a test sample X to have the same label as the training sample located in the local region surrounding X. kNN classifiers are lazy learners, that is, models are not built explicitly unlike SVM and the other classification models given in the following chapters. Training a kNN classifier simply consists of determining k. In fact, if we preselect a value for k and do not preprocess given samples, then kNN approach requires no training at all. kNN simply memorizes all samples in the training set and then compares the test sample with them. For this reason, kNN is also called memory-based learning or instance-based learning. It is usually desirable to have as much training data as possible in machine learning. But in kNN, large training sets come with a severe efficiency penalty in classification of testing samples.
Building the kNN model is cheap (just store the training data), but classifying unknown sample is relatively expensive since it requires the computation of the kNN of the testing sample to be labeled. This, in general, requires computing the distance of the unlabeled object to all the objects in the labeled set, which can be expensive particularly for large training sets. Among the various methods of supervised learning, the nearest neighbor classifier achieves consistently high performance, without a priori assumptions about the distributions from which the training examples are drawn. The reader may have noticed the similarity between the problem of finding nearest neighbors for a test sample and ad hoc retrieval methodologies. In standard information retrieval systems such as digital libraries or web search, we search for the documents (samples) with the highest similarity to the query document represented by a set of key words. Problems are similar, and often the proposed solutions are applicable in both disciplines.
Decision boundaries in 1NN are concatenated segments of the Voronoi diagram as shown in Figure 4.28. The Voronoi diagram decomposes space into Voronoi cells, where each cell consists of all points that are closer to the sample than to other samples. Assume that we have X training samples in a 2-D space. The diagram then partitions the 2-D plane into |X| convex polygons, each containing its corresponding sample (and no other), where a convex polygon is a convex region in a 2-D space bounded by lines. For general k > 1 case, consider the region in the space for which the set of kNN is the same.