Nearest Neighbor

Idea: Similar examples tend to have the same label.

The barebones algorithm for Nearest Neighbor is as follows:

Given a new example to label, find the "most similar" example (i.e. the nearest neighbor) in the dataset.
Predict using the label of the nearest neighbor.

How do we compute the distance between examples? What is the distance measure?

For real-valued attributes, we can use the Euclidean Distance.

The feature space gets divided into polygonal cells based on the distances to existing examples. The result resembles a Voronoi Diagram:

Issues with NN

  • One issue with the NN algorithm is that we have to compute the distance to each and every example in the dataset. This can be time consuming.

  • NN is a lazy learner: there is no training. The only processing that happens is when a test example is to be labeled.

  • The scale of attributes: Attributes may have different scales. This leads to attributes with smaller scales being ignored when we compute the Euclidean Distance. Therefore, it is important to scale the attributes i.e. to bring them to the same scale.

  • The Curse of Dimensionality: As the number of attributes increases, the amount of training data needed for effective classification (i.e. fast classification) increases exponentially. This impacts the NN algorithm because the examples move further and further apart as the number of dimensions (attributes) increases. Therefore, we need a lot more examples to be able to use NN effectively.

  • High Variance: Predictions are very sensitive to which examples are in the training set. This also makes it very sensitive to outliers. To overcome this, we use the k-NN algorithm. This uses the majority label of the k closest examples to predict the label of a test example.

    Note: The kNN algorithm can also be used for Regression! The average (or maybe the weighted average) of the outcomes of the k closest examples can be used to predict the outcome of a test example.

Last updated