In this lecture we’ll introduce another classification technique, \(k\)-Nearest Neighbors.
First we’ll introduce the notion of parametric and nonparametric models.
Parametric vs. Nonparametric Models
There are many ways to define models (whether supervised or unsupervised).
However a key distinction is this:
does the model have a fixed number of parameters or
does the number of parameters grow with the training data.
Parametric: If the model has a fixed number of parameters, it is called parametric.
Nonparametric: If the number of parameters grows with the data, the model is called nonparametric.
Parametric models
have fixed parameters independent of the training data
have the advantage of (often) being faster to use
have the disadvantage of making strong assumptions about the nature of the underlying data distributions.
Nonparametric models
have parameters that grow with the size of the training data
are more flexible
can be computationally intractable for large datasets
Parametric Models
Linear Regression: Assumes a linear relationship between the input variables and the output.
Logistic Regression: Used for binary classification problems.
Polynomial Regression: Extends linear regression by considering polynomial relationships.
Support Vector Machines (SVM): With a linear kernel.
Naive Bayes: Based on Bayes’ theorem with strong independence assumptions.
Generalized Linear Models (GLM): Extends linear models to allow for response variables that have error distribution models other than a normal distribution.
Nonparametric Models
K-Nearest Neighbors (KNN): Classifies a data point based on how its neighbors are classified.
Decision Trees: Splits the data into subsets based on the value of input features.
Random Forest: An ensemble of decision trees.
Support Vector Machines (SVM): With non-linear kernels like RBF.
\(k\)-Nearest Neighbors
When I see a bird that walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck.
–James Whitcomb Riley (1849 - 1916)
Like any classifier, \(k\)-Nearest Neighbors is trained by providing it a set of labeled data.
However, at training time, the classifier does very little. It just stores away the training data.
\(k\)-Nearest Neighbors can do either hard or soft classification.
As a hard classifier, it returns the majority vote of the labels on the \(k\) Nearest Neighbors.
Which may be indeterminate, as above.
As a soft classifier, it returns the fraction of points in the neighborhood with label \(c\).
\[
p(x = c\,|\,\mathbf{x}, k) = \frac{\text{number of points in neighborhood with label } c}{k}
\]
Model Selection for \(k\)-NN
Each value of \(k\) results in a different “model”.
The complexity of the resulting model is therefore controlled by the hyperparameter \(k\).
We will want to select \(k\) using held-out data to avoid over-fitting.
Varying \(k\)
Consider this dataset where items fall into three classes:
Code
import sklearn.datasets as sk_dataX, y = sk_data.make_blobs(n_samples=150, centers=[[-2, 0],[1, 5], [2.5, 1.5]], cluster_std = [2, 2, 3], n_features=2, center_box=(-10.0, 10.0),random_state=0)plt.figure(figsize = (5, 5))plt.axis('equal')plt.axis('off')plt.scatter(X[:, 0], X[:, 1], c = y, cmap = cmap_bold, s =80)plt.show()
Let’s observe how the complexity of the resulting model changes as we vary \(k\).
We’ll do this by plotting the decision regions. These show how the method would classify each potential test point in the space.
Code
# Plot the decision boundary. For that, we will assign a color to each# point in the mesh [x_min, x_max]x[y_min, y_max].h =.1# step size in the meshx_min, x_max = X[:, 0].min() -1, X[:, 0].max() +1y_min, y_max = X[:, 1].min() -1, X[:, 1].max() +1xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Code
f, axs = plt.subplots(1, 3, figsize=(15, 5))for i, k inenumerate([1, 5, 25]): knn = KNeighborsClassifier(n_neighbors = k) knn.fit(X, y) Z = knn.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) axs[i].pcolormesh(xx, yy, Z, cmap = cmap_light, shading ='auto') axs[i].axis('equal') axs[i].axis('off') axs[i].set_title(f'Decision Regions for $k$ = {k}')plt.show()
Notice how increasing \(k\) results in smoother decision boundaries.
These are more likely to show good generalization ability.
Challenges for \(k\)-NN
Working with a \(k\)-NN classifier can involve some challenges.
The computational cost of classification grows linearly with the size of the training data. Note that the training step is trivial, but the classification step can be prohibitively expensive.
Euclidean distance is the most common distance function used in \(k\)-NN so data scaling is important. As previously discussed, features should be scaled to prevent distance measures from being dominated by a potentially small subset of features.
The curse of dimensionality. If the training data lives in a high dimensional space, Euclidean distance measures become less effective.
This last point is subtle but important, so we will now look at the curse of dimensionality more closely.
The Curse of Dimensionality
The Curse of Dimensionality
The curse of dimensionality is a somewhat tongue-in-cheek term for serious problems that arise when we use geometric algorithms in high dimensions.
There are various aspects of the curse that affect \(k\)-NN.
The phrase `curse of dimensionality’ was coined by the mathematician Richard Bellman in 1957 in his book Dynamic Programming.
Points are far apart in high D
\(k\)-NN relies on there being one or more “close” points to the test point \(x\).
In other words, we need the training data to be relatively dense, so there are “close” points everywhere.
Unfortunately, the amount of space we work in grows exponentially with the dimension \(d\).
Imagine we have 10,000 (e.g. \(10^4\)) data points each with 40 features (e.g. \(d = 40\)).
Consider quantizing each dimension into 10 bins.
So in 40-D we have \(10^{40}\) possible bins.
On average, we will only have 1 data point per \(10^{36}\) bins. Most will be empty!!
So the amount of data we need to maintain a given density also grows exponentially with dimension \(d\).
Important
Hence, points in high-dimensional spaces tend not to be close to one another at all.
One very intuitive way to think about it is this.
In order for two points to be close in \(\mathbb{R}^d\), they must be close in each of the \(d\) dimensions. As the number of dimensions grows, it becomes harder and harder for a pair of points to be close in each dimension.
Distances shrink in high dimensions
Let’s see how distances shrink in high dimensions by way of an example.
Let’s generate 1000 normally distributed points in 2D, 3D, 100D, and 1000D and look at the ratio of the largest to smallest distance between points.
First, let’s generate the data.
Code
import numpy as np# Fix the random seed so we all have the same random numbersnp.random.seed(0)n_data =1000# Create 1000 data examples (columns) each with 2 dimensions (rows)n_dim =2x_2D = np.random.normal(size=(n_data, n_dim))# Create 1000 data examples (columns) each with 3 dimensions (rows)n_dim =3x_3D = np.random.normal(size=(n_data, n_dim))# Create 1000 data examples (columns) each with 100 dimensions (rows)n_dim =100x_100D = np.random.normal(size=(n_data, n_dim))# Create 1000 data examples (columns) each with 1000 dimensions (rows)n_dim =1000x_1000D = np.random.normal(size=(n_data, n_dim))
Scatter plot of 2D data
Code
import matplotlib.pyplot as plt# scatter plot of the 2D dataplt.scatter(x_2D[:,0], x_2D[:,1])plt.title('2D data')plt.show()
Scatter plot of 3D data
Code
# make an interactive scatter plot of the 3D datafig = plt.figure()ax = fig.add_subplot(111, projection='3d')ax.scatter(x_3D[:,0], x_3D[:,1], x_3D[:,2])plt.title('3D data')plt.show()
Closest/farthest distance ratios
Define a function to calculate the ratio of the largest to smallest distance between points in a dataset.
Code
from scipy.spatial import distancedef distance_ratio(x, metric='euclidean'):if metric =='euclidean':ord=2elif metric =='manhattan':ord=1elif metric =='cosine':passelse:raiseValueError(f"Metric {metric} not supported") smallest_dist = np.inf largest_dist =0for i inrange(x.shape[0]):for j inrange(i +1, x.shape[0]): # start from i+1 to avoid redundant calcuationsif i != j:if metric =='euclidean'or metric =='manhattan': dist = np.linalg.norm(x[i,:] - x[j,:], ord=ord) elif metric =='cosine': dist = distance.cosine(x[i,:].flatten(), x[j,:].flatten())if dist < smallest_dist: smallest_dist = distif dist > largest_dist: largest_dist = dist# print(f"smallest_dist = {smallest_dist}, largest_dist = {largest_dist}")# Calculate the ratio and return dist_ratio = largest_dist / smallest_distreturn dist_ratio
And then calculate the ratio for each dataset.
Code
dist_ratio_2d = distance_ratio(x_2D)print('Ratio of largest to smallest distance 2D: %3.3f'%(dist_ratio_2d))dist_ratio_3d = distance_ratio(x_3D)print('Ratio of largest to smallest distance 3D: %3.3f'%(dist_ratio_3d))dist_ratio_100d = distance_ratio(x_100D)print('Ratio of largest to smallest distance 100D: %3.3f'%(dist_ratio_100d))dist_ratio_1000d = distance_ratio(x_1000D)print('Ratio of largest to smallest distance 1000D: %3.3f'%(dist_ratio_1000d))
Ratio of largest to smallest distance 2D: 2699.982
Ratio of largest to smallest distance 3D: 356.654
Ratio of largest to smallest distance 100D: 1.974
Ratio of largest to smallest distance 1000D: 1.228
You see that the ratio of the largest to smallest distance drops dramatically as the dimension increases.
Volume of a hypersphere
The above phenomenon is related to the strange phenomenon of the volume of a hypersphere (or n-ball) in increasing dimensions.
For a fascinating exploration of the properties of unit spheres in high dimension, see An Adventure in the Nth Dimension by Brian Hayes in American Scientist.
Now, the volume of a hypersphere in 2D and 3D is:
For \(d = 2\), \(\pi r^2\), which is \(\pi\) for a unit radius.
For \(d = 3\), \(\frac{4\pi}{3} r^3\), which is \(\frac{4}{3} \pi\) for a unit radius.
The general equation for a hypersphere of radius \(R\) in \(d\) dimensions is:
And now let’s calculate the volume of a hypersphere of unit radius for dimensions 1 to 20 and store it in a list.
Code
radius =1.0vols = []for c_dim inrange(1,21): vols.append(volume_of_hypersphere(radius, c_dim))# print("Volume of unit radius hypersphere in %d dimensions is %3.3f"%(c_dim, volume_of_hypersphere(radius, c_dim)))
Plot volumes of hyperspheres
And plot the results.
Code
# plot volsplt.scatter(range(1,21), vols)plt.xlabel('Dimensions')plt.ylabel('Volume')plt.title('Volume of unit radius hypersphere')plt.show()
The volume decreases to almost nothing in high dimensions.
Ratio of unit hypersphere volume and enclosing hypercube volume
Another strange phenomenon to think about is via another example.
Assume you have data uniformly distributed in a hypercube with side length 2 which will exactly fit a unit hypersphere.
As the dimension grows, we want to know what fraction of the data is within unit distance of the center.
Code
side_length =2cube_volumes = [(lambda side, dim: side ** dim)(side_length, dim) for dim inrange(1, 21)]vol_ratios = [vols[i] / cube_volumes[i] for i inrange(len(vols))]
Plot volume ratios
Code
plt.scatter(range(1,21), vol_ratios)plt.xlabel('Dimensions')plt.ylabel('Volume ratio')plt.title('Volume of unit radius hypersphere / Volume of enclosing hypercube')plt.show()
The fraction quickly goes to zero as the dimension increases.
Perhaps an intuition is to think about the space in the “corners” of the hypercube. As the dimension grows, there are more corners.
Proportion of hypersphere in outer shell
Another strange thing about high dimensions is that most of the volume of a hypersphere is in the outer shell.
We can compare the volumes of hyperspheres of radius \(1\) and \(1-\epsilon\) for a small \(\epsilon\).
Code
ax = plt.figure(figsize = (4, 4)).add_subplot(projection ='3d')# coordinates of sphere surfaceu, v = np.mgrid[0:2*np.pi:50j, 0:np.pi:50j]x = np.cos(u)*np.sin(v)y = np.sin(u)*np.sin(v)z = np.cos(v)#ax.plot_surface(x, y, z, color='r', alpha =0.2)s3 =1/np.sqrt(3)ax.quiver(0, 0, 0, s3, s3, s3, color ='b')ax.text(s3/2, s3/2, s3/2-0.2, '1', size =14)#eps =0.9#ax.plot_surface(eps * x, eps * y, eps * z, color='b', alpha =0.2)ax.quiver(0, 0, 0, eps, 0, 0, color ='k')ax.text(1/2-0.2, 0, -0.4, r'$1-\epsilon$', size =14)ax.set_axis_off()plt.title('Inner and Outer Hyperspheres')plt.show()
Proportion of volume in outer shell
Let’s create a function to calculate the proportion of the volume of a hypersphere that is in the outer \(\epsilon\) of the radius.
Let’s calculate the proportion of the volume in the outer 1% of the radius for dimensions 1 to 300.
Code
propvols = []for c_dim in [1,2,10,20,50,100,150,200,250,300]: propvols.append(get_prop_of_volume_in_outer_epsilon(c_dim, 0.01))print('Proportion of volume in outer 1 percent of radius in %d dimensions =%3.3f'%(c_dim, get_prop_of_volume_in_outer_epsilon(c_dim, 0.01)))
Proportion of volume in outer 1 percent of radius in 1 dimensions =0.010
Proportion of volume in outer 1 percent of radius in 2 dimensions =0.020
Proportion of volume in outer 1 percent of radius in 10 dimensions =0.096
Proportion of volume in outer 1 percent of radius in 20 dimensions =0.182
Proportion of volume in outer 1 percent of radius in 50 dimensions =0.395
Proportion of volume in outer 1 percent of radius in 100 dimensions =0.634
Proportion of volume in outer 1 percent of radius in 150 dimensions =0.779
Proportion of volume in outer 1 percent of radius in 200 dimensions =0.866
Proportion of volume in outer 1 percent of radius in 250 dimensions =0.919
Proportion of volume in outer 1 percent of radius in 300 dimensions =0.951
Plot outer shell proportion
Code
# plot propvolsplt.scatter([1,2,10,20,50,100,150,200,250,300], propvols)plt.xlabel('Dimensions')plt.ylabel('Proportion of volume in outer 1%')plt.title('Proportion of volume in outer 1% of diameter of hypersphere')plt.show()
By the time we get to 300 dimensions most of the volume is in the outer 1 percent.
Fraction of points in outer shell
What is the fraction \(f_d\) of all the points that are within a unit distance, but not within a distance of \(1-\epsilon\)?
plt.figure(figsize = (6, 3))plt.plot(min_dists, label ="Minimum Distance")plt.plot(mean_dists, label ="Average Distance")plt.xlabel(r'Number of dimensions ($d$)')plt.ylabel('Distance')plt.legend(loc ='best')plt.title(f'Comparison of Minimum Versus Average Distance Between {nsamples} Points\nAs Dimension Grows')plt.show()
The average distance between points grows. However we also observe that the minimum distance between points grows at a similar rate.
Let’s look at the ratio between the average distance between points and the minimum distance between points.
Code
plt.figure(figsize = (6, 3))plt.plot([a/b for a, b inzip(min_dists, mean_dists)])plt.xlabel(r'Number of dimensions ($d$)')plt.ylabel('Ratio')plt.title(f'Ratio of Minimum to Average Distance Between {nsamples} Points\nAs Dimension Grows')plt.show()
This shows that, for any test point \(x\), the distance to the closest point to \(x\), gets closer and closer to the average distance between points.
If we used a point at the average distance for classifying \(x\) we’d get a very poor classifier.
Implications of the Curse
For \(k\)-NN, the Curse of Dimensionality means that in high dimensions most points are nearly the same distance from the test point.
This makes \(k\)-NN ineffective. It cannot reliably tell which are the \(k\) nearest neighbors and its performance degrades as \(d\) increases.
What Can be Done?
The problem is that you simply cannot have enough data to do a good job using \(k\)-NN in high dimensions.
Alternative approaches to mitigate this issue
Use a different dissimilarity metric
For example cosine distance behaves a little better, but you will have to decide if it is a good distance function for your problem.
Reduce the dimension of your data.
We will discuss dimensionality reduction techniques in lectures SVD I and SVD II.
More on Model Selection
Recall that each value of \(k\) provides a different model.
To better understand model selection, we need a way to evaluate our different models.
How do we evaluate a classifier?
Binary classification
In the simple case of a binary classifier, we can call
one class the ‘Positive’ class or index 1
the other class the ‘Negative’ class or index 0.
The most basic measure of success for a classifier is accuracy:
Accuracy is the fraction of test points that are correctly classified.
Accuracy is important, however it may not convey enough useful information.
For example, let’s say we have a dataset showing class imbalance, e.g., 90% of the data are the Positive class and 10% are the Negative class.
For this dataset, consider a classifier that always predicts ‘Positive’.
Its accuracy is 90%, but it is not an effective classifier.
Precision and Recall
A better way to measure the classifier’s performance is using a Confusion Matrix.
Diagonal elements represent successes and off diagonals represent errors.
Using the confusion matrix we can define some more useful measures:
Recall - defined as the fraction of actual positives correctly classified
TP/(TP + FN)
Precision - defined as the fraction of classified positives correctly classified
TP/(TP + FP)
Precision and Recall Illustration
Evaluating \(k\)- Nearest Neighbors
First we’ll generate some synthetic data to work with.
Code
X, y = datasets.make_circles(noise=.1, factor=.5, random_state=1)print('Shape of data: {}'.format(X.shape))print('Unique labels: {}'.format(np.unique(y)))
Shape of data: (100, 2)
Unique labels: [0 1]
Here is what the data looks like.
Code
plt.figure(figsize = (4, 4))plt.prism() # this sets a nice color mapplt.scatter(X[:, 0], X[:, 1], c=y, s =80)plt.axis('off')plt.axis('equal')plt.show()
Recall that we always want to test on data separate from our training data.
For now, we will do something very simple: take the first 50 examples for training and the rest for testing.
So how confident can we be that the test accuracy is 92% in general?
What we really need to do is consider many different train/test splits.
Thus, the proper way to evaluate generalization ability (accuracy on the test data) is:
Form a random train/test split
Train the classifier on the training split
Test the classifier on the testing split
Accumulate statistics
Repeat from step 1 until enough statistics have been collected.
Code
import sklearn.model_selection as model_selectionnreps =50kvals =range(1, 10)acc = []np.random.seed(4)for k in kvals: test_rep = [] train_rep = []for i inrange(nreps): X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size =0.5) knn = KNeighborsClassifier(n_neighbors = k) knn.fit(X_train, y_train) train_rep.append(knn.score(X_train, y_train)) test_rep.append(knn.score(X_test, y_test)) acc.append([np.mean(np.array(test_rep)), np.mean(np.array(train_rep))])accy = np.array(acc)
Code
plt.plot(kvals, accy[:, 0], '.-', label ='Accuracy on Test Data')plt.plot(kvals, accy[:, 1], '.-', label ='Accuracy on Training Data')plt.xlabel(r'$k$')plt.ylabel('Accuracy')plt.title('Train/Test Comparision of $k$-NN')plt.legend(loc ='best')plt.show()
Based on the generalization error, i.e., accuracy on the held-out test data, it looks like \(k = 2\) is the best choice.
Here is the decision boundary for \(k\)-NN with \(k = 2\).
To explore a few more issues, we’ll now turn to some famous datasets that have been extensively studied in the past.
The Iris Dataset
The Iris dataset is a famous dataset used by Ronald Fisher in a classic 1936 paper on classification.
Quoting from Wikipedia:
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.
First, we’ll explore setting the hyperparameters. We start with \(k\)-NN.
To set the hyperparameter \(k\), we evaluate error on the test set for many train/test splits:
Code
X = iris.datay = iris.targetkvals =range(2, 20)nreps =50acc = []std = []np.random.seed(0)for k in kvals: test_rep = [] train_rep = []for i inrange(nreps): X_train, X_test, y_train, y_test = model_selection.train_test_split( X, y, test_size =0.33) knn = KNeighborsClassifier(n_neighbors = k) knn.fit(X_train, y_train) train_rep.append(knn.score(X_train, y_train)) test_rep.append(knn.score(X_test, y_test)) acc.append([np.mean(np.array(test_rep)), np.mean(np.array(train_rep))]) std.append([np.std(np.array(test_rep)), np.std(np.array(train_rep))])
Code
plt.figure(figsize= (6, 4))accy = np.array(acc)stds = np.array(std)/np.sqrt(nreps)print(f'Max Test Accuracy at k = {kvals[np.argmax(accy[:, 0])]} with accuracy {np.max(accy[:, 0]):.03f}')plt.plot(kvals, accy[:, 0], '.-', label ='Accuracy on Test Data')plt.plot(kvals, accy[:, 1], '.-', label ='Accuracy on Training Data')plt.xlabel('k')plt.ylabel('Accuracy')plt.xticks(kvals)plt.legend(loc ='best')plt.show()
Max Test Accuracy at k = 13 with accuracy 0.969
It looks like \(k\) = 13 is the best-performing value of the hyperparameter.
Can we be sure?
Be careful! Each point in the above plot is the mean of 50 random train/test splits!
If we are going to be sure that \(k\) = 13 is best, then it should be be statistically distinguishable from the other values.
To make this call, let’s plot \(\pm 1 \sigma\) confidence intervals on the mean values.
plt.errorbar(kvals, accy[:, 0], stds[:, 0], label ='Accuracy on Test Data')plt.xlabel('k')plt.ylabel('Accuracy')plt.legend(loc ='lower center')plt.xticks(kvals)plt.title(r'Test Accuracy with $\pm 1\sigma$ Errorbars')plt.show()
It looks like \(k\) = 13 is a reasonable value, although a case can be made that 9 and 11 are not statistically distinguishable from 13.
To gain insight into the complexity of the model for \(k\) = 13, let’s look at the decision boundary.
We will re-run the classifier using only two (of four) features for visualization purposes.
Code
# Create color mapsfrom matplotlib.colors import ListedColormapcmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])# we will use only the first two (of four) features, so we can visualizeX = X_train[:, :2] h =.02# step size in the meshk =13knn = KNeighborsClassifier(n_neighbors=k)knn.fit(X, y_train)# Plot the decision boundary. For that, we will assign a color to each# point in the mesh [x_min, x_max]x[y_min, y_max].x_min, x_max = X[:, 0].min() -1, X[:, 0].max() +1y_min, y_max = X[:, 1].min() -1, X[:, 1].max() +1xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plotZ = Z.reshape(xx.shape)plt.figure(figsize=(6, 4))plt.pcolormesh(xx, yy, Z, cmap=cmap_light, shading='auto')# Plot also the training pointsplt.scatter(X[:, 0], X[:, 1], c=y_train, cmap=cmap_bold)plt.xlim(xx.min(), xx.max())plt.ylim(yy.min(), yy.max())plt.title(f"3-Class $k$-NN classification ($k$ = {k})")plt.show()
There are a few artifacts, though overall this looks like a reasonably smooth set of decision boundaries.
MNIST dataset
NIST used to be called the “National Bureau of Standards.” These are the folks who bring you the reference meter, reference kilogram, etc.
NIST constructed datasets for machine learning of handwritten digits. These were collected from Census Bureau employees and also from high-school students.
This dataset has been used repeatedly for many years to evaluate classifiers. For a peek at some of the work done with this dataset you can visit this link.
plt.figure(figsize = (6, 4))plt.errorbar(kvals, accy1, stds1, label =f'{test_fraction1:.0%} Used for Testing; accuracy {np.max(accy1):.03f}')plt.errorbar(kvals, accy2, stds2, label =f'{test_fraction2:.0%} Used for Testing; accuracy {np.max(accy2):.03f}')plt.xlabel('k')plt.ylabel('Accuracy')plt.legend(loc ='best')plt.xticks(kvals)best_k = kvals[np.argmax(accy1)]plt.title(f'Test Accuracy with $\\pm 1\\sigma$ Error Bars')plt.show()
These plots illustrate show two important principles.
With more training data the classifier performs better.
With less testing data, the testing results are more variable. Though testing is faster.
The key decision here is what value to choose for \(k\). So it makes sense to use the 33% test split, because the smaller error bars give us better confidence in our decision.
We can get a sense of why \(k\)-NN can succeed at this task by looking at the nearest neighbors of some points.
plt.rc("image", cmap="binary") # this sets a black on white colormapfor t inrange(3): plt.figure(figsize=(8, 2)) plt.subplot(1, 4, 1) plt.imshow(X[t].reshape(8, 8)) plt.axis('off') plt.title("Query")# plot three nearest neighbors from the training setfor i in [0, 1, 2]: plt.subplot(1, 4, 2+ i) plt.title("neighbor {}".format(i)) plt.imshow(X[neighbors[t, i]].reshape(8, 8)) plt.axis('off')plt.show()