Get to know your k-Nearest Neighbor

Published in

Analytics Vidhya

3 min readJan 7, 2021

Simple and effective

K-nearest neighbors algorithm (k-NN) is a supervised, instance-based, non-parametric algorithm which makes use of the k closest examples in the feature space. Supervised means that it needs to be fed with training data alongside its labels (target values) in order to learn a function that will adequately generate future outputs-from-inputs based on those past pairings. Instance-based translates into the use of comparison between instances within the specific training set rather than generalization for further use in future data. Lastly, non-parametric indicates the self-reliance of the algorithm, since its structure is concluded based on the existing data and its internal relationships without the massive use of predestined model parameters.

When to use it

it can be used both for regression and classification.

Classification: each object is assigned to the most common class among its k nearest neighbors.
Regression: instead of the object itself, the output consists of a value which is the average of all the values of its k nearest neighbors.

Why use it

it can be quite useful for multiple classes due to its tendency to separate and re-group topologically close objects. Consequently, it can straighten up datasets by generating approximate subgroups of it, which in turn give the analyst a greater insight into the data categories, while simultaneously offering great density information about the clusters once you visualize them.

What’s ‘k’?

as stated above, k represents a predestined metric of how many close examples we are going to take into consideration for deciding where to categorize our object. Afterwards, the next big question is ‘how much ‘k’ is good? Frankly speaking, a trial & error approach is (almost) always a strong ally. Keeping in mind the intuition which states that variance has to be eliminated so that outliers don’t affect the results (small ‘k’), while simultaneously considering a balance to not let a particular set of objects dominate the field (big ‘k’), helps in identifying the pattern to an optimal solution.

The distance

There exist several mathematical methods in order to carry out the algorithm, and their differentiation is based on the distance metric. Below are the three most commonly used:

Manhattan: |x1−x2|+|y1−y2|, which essentially is a simple form defined for a plane with coordinates (x1, y1) for point a1 and (x2, y2) for point a2, with the latter representing the former’s nearest neighbor.
Minkowski: Its fundamental principle is the distance that can be represented as a vector with length. Using maps as an example, one can see that if we measure distances between cities there can not be any negative values — direction has only positive value, so the vector is normed — , length is our basis for measurement, and multiple cities can be linked together forming complex routes. Essentially, this macroscopic example can be shrunken to data points.

Euclidean: Classic way of stating: the absolute numerical difference of coordinates of two points (here ‘p’ and ‘q’ respectively) equals to their distance.

Euclidean variation for higher dimensions.

Conclusion

This effective algorithm was among the first to be taught at my University’s Data Science class, mainly due to its importance and secondly because of its easy to grasp nature. In summary, it constitutes a useful addition to every data scientist’s toolkit, and still plays a significant role in various projects, thus it’s good to get to know it.