In pattern recognition, the kNearest Neighbors algorithm (or kNN for short) is a nonparametric method used for classification and regression.[1] In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether kNN is used for classification or regression:
1 In kNN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
2 In kNN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.
kNN is a type of instancebased learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The kNN algorithm is among the simplest of all machine learning algorithms.
Both for classification and regression, it can be useful to assign weight to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor.[2]
The neighbors are taken from a set of objects for which the class (for kNN classification) or the object property value (for kNN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.
A shortcoming of the kNN algorithm is that it is sensitive to the local structure of the data. The algorithm has nothing to do with and is not to be confused with kmeans, another popular machine learning technique.
Algorithm
The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.
In the classification phase, k is a userdefined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point.
A commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, such as the overlap metric (or Hamming distance). In the context of gene expression microarray data, for example, kNN has also been employed with correlation coefficients such as Pearson and Spearman.[3] Often, the classification accuracy of kNN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components analysis.
A drawback of the basic "majority voting" classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, because they tend to be common among the k nearest neighbors due to their large number.[4] One way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of its k nearest neighbors. The class (or value, in regression problems) of each of the k nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation. For example in a selforganizing map (SOM), each node is a representative (a center) of a cluster of similar points, regardless of their density in the original training data. KNN can then be applied to the SOM.
Parameter selection
The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise on the classification,[5] but make boundaries between classes less distinct. A good k can be selected by various heuristic techniques (see hyperparameter optimization). The special case where the class is predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest neighbor algorithm.
The accuracy of the kNN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance. Much research effort has been put into selecting or scaling features to improve classification. A particularly popular[citation needed] approach is the use of evolutionary algorithms to optimize feature scaling.[6] Another popular approach is to scale features by the mutual information of the training data with the training classes.[citation needed]
In binary (two class) classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. One popular way of choosing the empirically optimal k in this setting is via bootstrap method.[7]
Properties
kNN is a special case of a variablebandwidth, kernel density "balloon" estimator with a uniform kernel.[8] [9]
The naive version of the algorithm is easy to implement by computing the distances from the test example to all stored examples, but it is computationally intensive for large training sets. Using an appropriate nearest neighbor search algorithm makes kNN computationally tractable even for large data sets. Many nearest neighbor search algorithms have been proposed over the years; these generally seek to reduce the number of distance evaluations actually performed.
kNN has some strong consistency results. As the amount of data approaches infinity, the algorithm is guaranteed to yield an error rate no worse than twice the Bayes error rate (the minimum achievable error rate given the distribution of the data).[10] kNN is guaranteed to approach the Bayes error rate for some value of k (where k increases as a function of the number of data points). Various improvements to kNN are possible by using proximity graphs.[11]
Metric Learning
The Knearest neighbor classification performance can often be significantly improved through (supervised) metric learning. Popular algorithms are Neighbourhood components analysis and Large margin nearest neighbor. Supervised metric learning algorithms use the label information to learn a new metric or pseudometric.
Feature extraction
When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant (e.g. the same measurement in both feet and meters) then the input data will be transformed into a reduced representation set of features (also named features vector). Transforming the input data into the set of features is called feature extraction. If the features extracted are carefully chosen it is expected that the features set will extract the relevant information from the input data in order to perform the desired task using this reduced representation instead of the full size input. Feature extraction is performed on raw data prior to applying kNN algorithm on the transformed data in feature space.
An example of a typical computer vision computation pipeline for face recognition using kNN including feature extraction and dimension reduction preprocessing steps (usually implemented with OpenCV):
1. Haar face detection
2. Meanshift tracking analysis
3. PCA or Fisher LDA projection into feature space, followed by kNN classification
Dimension reduction
For highdimensional data (e.g., with number of dimensions more than 10) dimension reduction is usually performed prior to applying the kNN algorithm in order to avoid the effects of the curse of dimensionality. [12]
The curse of dimensionality in the kNN context basically means that Euclidean distance is unhelpful in high dimensions because all vectors are almost equidistant to the search query vector (imagine multiple points lying more or less on a circle with the query point at the center; the distance from the query to all data points in the search space is almost the same).
Feature extraction and dimension reduction can be combined in one step using principal component analysis (PCA), linear discriminant analysis (LDA), or canonical correlation analysis (CCA) techniques as a preprocessing step, followed by clustering by kNN on feature vectors in reduceddimension space. In machine learning this process is also called lowdimensional embedding.[13]
For veryhighdimensional datasets (e.g. when performing a similarity search on live video streams, DNA data or highdimensional time series) running a fast approximate kNN search using locality sensitive hashing, "random projections",[14] "sketches" [15] or other highdimensional similarity search techniques from VLDB toolbox might be the only feasible option.
Decision boundary
Nearest neighbor rules in effect implicitly compute the decision boundary. It is also possible to compute the decision boundary explicitly, and to do so efficiently, so that the computational complexity is a function of the boundary complexity.[16]
Data reduction
Data reduction is one of the most important problems for work with huge data sets. Usually, only some of the data points are needed for accurate classification. Those data are called the prototypes and can be found as follows:
1. Select the classoutliers, that is, training data that are classified incorrectly by kNN (for a given k)
2. Separate the rest of the data into two sets: (i) the prototypes that are used for the classification decisions and (ii) the absorbed points that can be correctly classified by kNN using prototypes. The absorbed points can then be removed from the training set.
Selection of classoutliers
A training example surrounded by examples of other classes is called a class outlier. Causes of class outliers include:

random error

insufficient training examples of this class (an isolated example appears instead of a cluster)

missing important features (the classes are separated in other dimensions which we do not know)

too many training examples of other classes (unbalanced classes) that create a "hostile" background for the given small class
Class outliers with kNN produce noise. They can be detected and separated for future analysis. Given two natural numbers, k>r>0, a training example is called a (k,r)NN classoutlier if its k nearest neighbors include more than r examples of other classes.
CNN for data reduction
Condensed nearest neighbor (CNN, the Hart algorithm) is an algorithm designed to reduce the data set for kNN classification.[17] It selects the set of prototypes U from the training data, such that 1NN with U can classify the examples almost as accurately as 1NN does with the whole data set.
Given a training set X, CNN works iteratively:
1. Scan all elements of X, looking for an element x whose nearest prototype from U has a different label than x.
2. Remove x from X and add it to U
3. Repeat the scan until no more prototypes are added to U.
Use U instead of X for classification. The examples that are not prototypes are called "absorbed" points.
It is efficient to scan the training examples in order of decreasing border ratio.[18] The border ratio of a training example x is defined as
a(x) = x'y / xy
where xy is the distance to the closest example y having a different color than x, and x'y is the distance from y to its closest example x' with the same label as x.
The border ratio is in the interval [0,1] because x'y never exceeds xy. This ordering gives preference to the borders of the classes for inclusion in the set of prototypesU. A point of a different label than x is called external to x. The calculation of the border ratio is illustrated by the figure on the right. The data points are labeled by colors: the initial point is x and its label is red. External points are blue and green. The closest to x external point is y. The closest to y red point is x' . The border ratio a(x)=x'y/xy is the attribute of the initial point x.
Below is an illustration of CNN in a series of figures. There are three classes (red, green and blue). Fig. 1: initially there are 60 points in each class. Fig. 2 shows the 1NN classification map: each pixel is classified by 1NN using all the data. Fig. 3 shows the 5NN classification map. White areas correspond to the unclassified regions, where 5NN voting is tied (for example, if there are two green, two red and one blue points among 5 nearest neighbors). Fig. 4 shows the reduced data set. The crosses are the classoutliers selected by the (3,2)NN rule (all the three nearest neighbors of these instances belong to other classes); the squares are the prototypes, and the empty circles are the absorbed points. The left bottom corner shows the numbers of the classoutliers, prototypes and absorbed points for all three classes. The number of prototypes varies from 15% to 20% for different classes in this example. Fig. 5 shows that the 1NN classification map with the prototypes is very similar to that with the initial data set. The figures were produced using the Mirkes applet.[18]
CNN model reduction for kNN classifiers
Fig. 1. The dataset. Fig. 2. The 1NN Fig. 3. The 5NN Fig. 4. The CNN Fig. 5. The 1NN classification
classification map. classification map. reduced dataset. map based on the CNN
extracted prototypes.
kNN regression
In kNN regression, the kNN algorithm is used for estimating continuous variables. One such algorithm uses a weighted average of the k nearest neighbors, weighted by the inverse of their distance. This algorithm works as follows:
1. Compute the Euclidean or Mahalanobis distance from the query example to the labeled examples.
2. Order the labeled examples by increasing distance.
3. Find a heuristically optimal number k of nearest neighbors, based on RMSE. This is done using cross validation.
4. Calculate an inverse distance weighted average with the knearest multivariate neighbors.
Validation of results
A confusion matrix or "matching matrix" is often used as a tool to validate the accuracy of kNN classification. More robust statistical methods such as likelihoodratio test can also be applied.
References

Altman, N. S. (1992). "An introduction to kernel and nearestneighbor nonparametric regression". The American Statistician 46 (3): 175–185. doi:10.1080/00031305.1992.10475879.

This scheme is a generalization of linear interpolation

Jaskowiak, P. A.; Campello, R. J. G. B. "Comparing Correlation Coefficients as Dissimilarity Measures for Cancer Classification in Gene Expression Data". http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.208.993. Brazilian Symposium on Bioinformatics (BSB 2011). pp. 1–8. Retrieved 16 October 2014.

D. Coomans; D.L. Massart (1982). "Alternative knearest neighbour rules in supervised pattern recognition : Part 1. kNearest neighbour classification by using alternative voting rules". Analytica Chimica Acta 136: 15–27. doi:10.1016/S00032670(01)953590

Everitt, B. S., Landau, S., Leese, M. and Stahl, D. (2011) Miscellaneous Clustering Methods, in Cluster Analysis, 5th Edition, John Wiley & Sons, Ltd, Chichester, UK.

Nigsch F, Bender A, van Buuren B, Tissen J, Nigsch E, Mitchell JB (2006). "Melting point prediction employing knearest neighbor algorithms and genetic parameter optimization". Journal of Chemical Information and Modeling 46 (6): 2412–2422. doi:10.1021/ci060149f. PMID 17125183

Hall P, Park BU, Samworth RJ (2008). "Choice of neighbor order in nearestneighbor classification". Annals of Statistics 36 (5): 2135–2152. doi:10.1214/07AOS537

D. G. Terrell; D. W. Scott (1992). "Variable kernel density estimation". Annals of Statistics 20 (3): 1236–1265. doi:10.1214/aos/1176348768

Mills, Peter. "Efficient statistical classification of satellite measurements". International Journal of Remote Sensing.

Cover TM, Hart PE (1967). "Nearest neighbor pattern classification". IEEE Transactions on Information Theory 13 (1): 21–27. doi:10.1109/TIT.1967.1053964.

Toussaint GT (April 2005). "Geometric proximity graphs for improving nearest neighbor methods in instancebased learning and data mining". International Journal of Computational Geometry and Applications 15 (2): 101–150. doi:10.1142/S0218195905001622.

Beyer, Kevin, et al.. 'When is “nearest neighbor” meaningful?. Database Theory—ICDT’99, 217235year 1999

Shaw, Blake, and Tony Jebara. 'Structure preserving embedding. Proceedings of the 26th Annual International Conference on Machine Learning. ACM,2009

Bingham, Ella, and Heikki Mannila. Random projection in dimensionality reduction: applications to image and text data. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM  year 2001

Shasha, D High Performance Discovery in Time Series.Berlin: Springer, 2004, ISBN 0387008578

Bremner D, Demaine E, Erickson J, Iacono J, Langerman S, Morin P, Toussaint G (2005). "Outputsensitive algorithms for computing nearestneighbor decision boundaries". Discrete and Computational Geometry 33 (4): 593–604. doi:10.1007/s0045400411520.

P. E. Hart, The Condensed Nearest Neighbor Rule. IEEE Transactions on Information Theory 18 (1968) 515–516. doi: 10.1109/TIT.1968.1054155

E. M. Mirkes, KNN and Potential Energy: applet. University of Leicester, 2011.
* https://en.wikipedia.org/