\(k\)-Nearest Neighbors and Model Selection

In this lecture we’ll introduce another classification technique, \(k\)-Nearest Neighbors.

First we’ll introduce the notion of parametric and nonparametric models.

Parametric vs. Nonparametric Models

There are many ways to define models (whether supervised or unsupervised).

However a key distinction is this:

  • does the model have a fixed number of parameters or
  • does the number of parameters grow with the training data.
  • Parametric: If the model has a fixed number of parameters, it is called parametric.
  • Nonparametric: If the number of parameters grows with the data, the model is called nonparametric.

Parametric models

  • have fixed parameters independent of the training data
  • have the advantage of (often) being faster to use
  • have the disadvantage of making strong assumptions about the nature of the underlying data distributions.

Nonparametric models

  • have parameters that grow with the size of the training data
  • are more flexible
  • can be computationally intractable for large datasets

Parametric Models

  • Linear Regression: Assumes a linear relationship between the input variables and the output.
  • Logistic Regression: Used for binary classification problems.
  • Polynomial Regression: Extends linear regression by considering polynomial relationships.
  • Support Vector Machines (SVM): With a linear kernel.
  • Naive Bayes: Based on Bayes’ theorem with strong independence assumptions.
  • Generalized Linear Models (GLM): Extends linear models to allow for response variables that have error distribution models other than a normal distribution.

Nonparametric Models

  • K-Nearest Neighbors (KNN): Classifies a data point based on how its neighbors are classified.
  • Decision Trees: Splits the data into subsets based on the value of input features.
  • Random Forest: An ensemble of decision trees.
  • Support Vector Machines (SVM): With non-linear kernels like RBF.

\(k\)-Nearest Neighbors

When I see a bird that walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck.

–James Whitcomb Riley (1849 - 1916)

Like any classifier, \(k\)-Nearest Neighbors is trained by providing it a set of labeled data.

However, at training time, the classifier does very little. It just stores away the training data.


Code
demo_y = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
demo_X = np.array([[-3,1], [-2, 4], [-2, 2], [-1.5, 1], [-1, 3], [0, 0], [1, 1.5], [2, 0.5], [2, 3], [2, 0], [3, 1], [4, 4], [0, 1]])
test_X = [-0.3, 0.7]
#
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold, s=80)
plt.axis('equal')
plt.axis('off')
plt.title('Training Points: 2 Classes')
plt.show()


At test time, simply

  • “look at” the \(k\) points in the training set that are nearest to the test input \(x\), and
  • make a decision based on the labels on those points, e.g. majority vote

By “nearest” we usually mean in Euclidean distance.

Code
plt.scatter(demo_X[:, 0], demo_X[:, 1], c=demo_y, cmap=cmap_bold, s=80)
plt.plot(test_X[0], test_X[1], 'ok', markersize=10)
plt.annotate('Test Point', test_X, [75, 25], 
             textcoords = 'offset points', fontsize = 14, 
             arrowprops = {'arrowstyle': '->'})
plt.axis('equal')
plt.axis('off')
plt.title('Training Points: 2 Classes')
plt.show()

1-Nearest Neighbor

Code
plt.scatter(demo_X[:, 0], demo_X[:, 1], c=demo_y, cmap=cmap_bold, s=80)
plt.plot(test_X[0], test_X[1], 'ok', markersize=10)
ax=plt.gcf().gca()
circle = mp.patches.Circle(test_X, 0.5, facecolor = 'red', alpha = 0.2)
plt.axis('equal')
plt.axis('off')
ax.add_artist(circle)
plt.title('1-Nearest-Neighbor: Classification: Red')
plt.show()

2-Nearest Neighbors

Code
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold, s=80)
test_X = [-0.3, 0.7]
plt.plot(test_X[0], test_X[1], 'ok', markersize=10)
ax=plt.gcf().gca()
circle = mp.patches.Circle(test_X, 0.9, facecolor = 'gray', alpha = 0.3)
plt.axis('equal')
plt.axis('off')
ax.add_artist(circle)
plt.title('2-Nearest-Neighbor')
plt.show()

3-Nearest Neighbors

Code
import matplotlib.pyplot as plt
plt.figure(figsize=(7, 6))
ax=plt.gcf().gca()
circle = mp.patches.Circle(test_X, 1.4, facecolor = 'blue', alpha = 0.2)
ax.add_artist(circle)
plt.scatter(demo_X[:,0], demo_X[:,1], c=demo_y, cmap=cmap_bold, s=80)
test_X = [-0.3, 0.7]
plt.plot(test_X[0], test_X[1], 'ok', markersize=10)
plt.axis('equal')
plt.axis('off')
plt.title('3-Nearest-Neighbor: Classification: Blue')
plt.tight_layout()
plt.show()

Hard vs. Soft Classification

\(k\)-Nearest Neighbors can do either hard or soft classification.

  • As a hard classifier, it returns the majority vote of the labels on the \(k\) Nearest Neighbors.

Which may be indeterminate, as above.

  • As a soft classifier, it returns the fraction of points in the neighborhood with label \(c\).

\[ p(x = c\,|\,\mathbf{x}, k) = \frac{\text{number of points in neighborhood with label } c}{k} \]

Model Selection for \(k\)-NN

Each value of \(k\) results in a different “model”.

The complexity of the resulting model is therefore controlled by the hyperparameter \(k\).

We will want to select \(k\) using held-out data to avoid over-fitting.

Varying \(k\)

Consider this dataset where items fall into three classes:

Code
import sklearn.datasets as sk_data
X, y = sk_data.make_blobs(n_samples=150, 
                          centers=[[-2, 0],[1, 5], [2.5, 1.5]],
                          cluster_std = [2, 2, 3],
                          n_features=2,
                          center_box=(-10.0, 10.0),random_state=0)
plt.figure(figsize = (5, 5))
plt.axis('equal')
plt.axis('off')
plt.scatter(X[:, 0], X[:, 1], c = y, cmap = cmap_bold, s = 80)
plt.show()

Let’s observe how the complexity of the resulting model changes as we vary \(k\).

We’ll do this by plotting the decision regions. These show how the method would classify each potential test point in the space.


Code
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
h = .1  # step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                      np.arange(y_min, y_max, h))
Code
f, axs = plt.subplots(1, 3, figsize=(15, 5))
for i, k in enumerate([1, 5, 25]):
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X, y)
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    axs[i].pcolormesh(xx, yy, Z, cmap = cmap_light, shading = 'auto')
    axs[i].axis('equal')
    axs[i].axis('off')
    axs[i].set_title(f'Decision Regions for $k$ = {k}')
plt.show()

Notice how increasing \(k\) results in smoother decision boundaries.

These are more likely to show good generalization ability.

Challenges for \(k\)-NN

Working with a \(k\)-NN classifier can involve some challenges.

  1. The computational cost of classification grows linearly with the size of the training data. Note that the training step is trivial, but the classification step can be prohibitively expensive.

  2. Euclidean distance is the most common distance function used in \(k\)-NN so data scaling is important. As previously discussed, features should be scaled to prevent distance measures from being dominated by a potentially small subset of features.

  3. The curse of dimensionality. If the training data lives in a high dimensional space, Euclidean distance measures become less effective.

This last point is subtle but important, so we will now look at the curse of dimensionality more closely.

The Curse of Dimensionality

The Curse of Dimensionality

The curse of dimensionality is a somewhat tongue-in-cheek term for serious problems that arise when we use geometric algorithms in high dimensions.

There are various aspects of the curse that affect \(k\)-NN.

The phrase `curse of dimensionality’ was coined by the mathematician Richard Bellman in 1957 in his book Dynamic Programming.

Points are far apart in high D

\(k\)-NN relies on there being one or more “close” points to the test point \(x\).

In other words, we need the training data to be relatively dense, so there are “close” points everywhere.

Unfortunately, the amount of space we work in grows exponentially with the dimension \(d\).

10x10 grid with \(10^2\) bins

10x10x10 grid with \(10^3\) bins
  • Imagine we have 10,000 (e.g. \(10^4\)) data points each with 40 features (e.g. \(d = 40\)).
  • Consider quantizing each dimension into 10 bins.
  • So in 40-D we have \(10^{40}\) possible bins.
  • On average, we will only have 1 data point per \(10^{36}\) bins. Most will be empty!!
  • So the amount of data we need to maintain a given density also grows exponentially with dimension \(d\).
Important

Hence, points in high-dimensional spaces tend not to be close to one another at all.

One very intuitive way to think about it is this.

In order for two points to be close in \(\mathbb{R}^d\), they must be close in each of the \(d\) dimensions. As the number of dimensions grows, it becomes harder and harder for a pair of points to be close in each dimension.

Distances shrink in high dimensions

Let’s see how distances shrink in high dimensions by way of an example.

Let’s generate 1000 normally distributed points in 2D, 3D, 100D, and 1000D and look at the ratio of the largest to smallest distance between points.

First, let’s generate the data.

Code
import numpy as np

# Fix the random seed so we all have the same random numbers
np.random.seed(0)

n_data = 1000

# Create 1000 data examples (columns) each with 2 dimensions (rows)
n_dim = 2
x_2D = np.random.normal(size=(n_data, n_dim))

# Create 1000 data examples (columns) each with 3 dimensions (rows)
n_dim = 3
x_3D = np.random.normal(size=(n_data, n_dim))

# Create 1000 data examples (columns) each with 100 dimensions (rows)
n_dim = 100
x_100D = np.random.normal(size=(n_data, n_dim))

# Create 1000 data examples (columns) each with 1000 dimensions (rows)
n_dim = 1000
x_1000D = np.random.normal(size=(n_data, n_dim))

Scatter plot of 2D data

Code
import matplotlib.pyplot as plt

# scatter plot of the 2D data
plt.scatter(x_2D[:,0], x_2D[:,1])
plt.title('2D data')
plt.show()

Scatter plot of 3D data

Code
# make an interactive scatter plot of the 3D data
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x_3D[:,0], x_3D[:,1], x_3D[:,2])
plt.title('3D data')
plt.show()

Closest/farthest distance ratios

Define a function to calculate the ratio of the largest to smallest distance between points in a dataset.

Code
from scipy.spatial import distance
def distance_ratio(x, metric='euclidean'):

    if metric == 'euclidean':
        ord = 2
    elif metric == 'manhattan':
        ord = 1
    elif metric == 'cosine':
        pass
    else:
        raise ValueError(f"Metric {metric} not supported")

    smallest_dist = np.inf
    largest_dist = 0
    for i in range(x.shape[0]):
        for j in range(i + 1, x.shape[0]): # start from i+1 to avoid redundant calcuations
            if i != j:
                if metric == 'euclidean' or metric == 'manhattan':
                    dist = np.linalg.norm(x[i,:] - x[j,:], ord=ord) 
                elif metric == 'cosine':
                    dist = distance.cosine(x[i,:].flatten(), x[j,:].flatten())
            if dist < smallest_dist:
              smallest_dist = dist
            if dist > largest_dist:
              largest_dist = dist

    # print(f"smallest_dist = {smallest_dist}, largest_dist = {largest_dist}")

    # Calculate the ratio and return
    dist_ratio = largest_dist / smallest_dist
    return dist_ratio

And then calculate the ratio for each dataset.

Code
dist_ratio_2d = distance_ratio(x_2D)
print('Ratio of largest to smallest distance 2D: %3.3f'%(dist_ratio_2d))

dist_ratio_3d = distance_ratio(x_3D)
print('Ratio of largest to smallest distance 3D: %3.3f'%(dist_ratio_3d))

dist_ratio_100d = distance_ratio(x_100D)
print('Ratio of largest to smallest distance 100D: %3.3f'%(dist_ratio_100d))

dist_ratio_1000d = distance_ratio(x_1000D)
print('Ratio of largest to smallest distance 1000D: %3.3f'%(dist_ratio_1000d))
Ratio of largest to smallest distance 2D: 2699.982
Ratio of largest to smallest distance 3D: 356.654
Ratio of largest to smallest distance 100D: 1.974
Ratio of largest to smallest distance 1000D: 1.228

Plot the closest/farthest distance ratios.

Code
plt.scatter([2,3,100,1000], [dist_ratio_2d, dist_ratio_3d, dist_ratio_100d, dist_ratio_1000d])
plt.plot([2,3,100,1000], [dist_ratio_2d, dist_ratio_3d, dist_ratio_100d, dist_ratio_1000d], '--', color='lightgray')
plt.xscale('log')
plt.title('Euclidean Distance ratio')
plt.xlabel('Dimension')
plt.ylabel('Distance ratio')
plt.show()

You see that the ratio of the largest to smallest distance drops dramatically as the dimension increases.

Volume of a hypersphere

The above phenomenon is related to the strange phenomenon of the volume of a hypersphere (or n-ball) in increasing dimensions.

For a fascinating exploration of the properties of unit spheres in high dimension, see An Adventure in the Nth Dimension by Brian Hayes in American Scientist.

Now, the volume of a hypersphere in 2D and 3D is:

  • For \(d = 2\), \(\pi r^2\), which is \(\pi\) for a unit radius.
  • For \(d = 3\), \(\frac{4\pi}{3} r^3\), which is \(\frac{4}{3} \pi\) for a unit radius.

The general equation for a hypersphere of radius \(R\) in \(d\) dimensions is:

\[ V_d(R) = \frac{\pi^{d/2}}{\Gamma\bigl(\tfrac d2 + 1\bigr)}R^d, \]

where \(\Gamma\) is Euler’s gamma function and simplifies to \(\Gamma(d) = (d - 1)!\) for all positive integers \(d\).

Hypersphere volume in increasing dimensions

Here’s a function to calculate the volume of a hypersphere of a given radius and dimension.

Code
import scipy.special as sci

def volume_of_hypersphere(radius, dimensions):
    pi = np.pi
    volume = (pi ** (dimensions / 2)) / (sci.gamma(dimensions / 2 + 1)) * (radius ** dimensions)
    return volume

And now let’s calculate the volume of a hypersphere of unit radius for dimensions 1 to 20 and store it in a list.

Code
radius = 1.0
vols = []
for c_dim in range(1,21):
  vols.append(volume_of_hypersphere(radius, c_dim))
  # print("Volume of unit radius hypersphere in %d dimensions is %3.3f"%(c_dim, volume_of_hypersphere(radius, c_dim)))

Plot volumes of hyperspheres

And plot the results.

Code
# plot vols
plt.scatter(range(1,21), vols)
plt.xlabel('Dimensions')
plt.ylabel('Volume')
plt.title('Volume of unit radius hypersphere')
plt.show()

The volume decreases to almost nothing in high dimensions.

Ratio of unit hypersphere volume and enclosing hypercube volume

Another strange phenomenon to think about is via another example.

Assume you have data uniformly distributed in a hypercube with side length 2 which will exactly fit a unit hypersphere.

As the dimension grows, we want to know what fraction of the data is within unit distance of the center.

Code
side_length = 2
cube_volumes = [(lambda side, dim: side ** dim)(side_length, dim) for dim in range(1, 21)]

vol_ratios = [vols[i] / cube_volumes[i] for i in range(len(vols))]

Plot volume ratios

Code
plt.scatter(range(1,21), vol_ratios)
plt.xlabel('Dimensions')
plt.ylabel('Volume ratio')
plt.title('Volume of unit radius hypersphere / Volume of enclosing hypercube')
plt.show()

The fraction quickly goes to zero as the dimension increases.

Perhaps an intuition is to think about the space in the “corners” of the hypercube. As the dimension grows, there are more corners.

Proportion of hypersphere in outer shell

Another strange thing about high dimensions is that most of the volume of a hypersphere is in the outer shell.

We can compare the volumes of hyperspheres of radius \(1\) and \(1-\epsilon\) for a small \(\epsilon\).

Code
ax = plt.figure(figsize = (4, 4)).add_subplot(projection = '3d')
# coordinates of sphere surface
u, v = np.mgrid[0:2*np.pi:50j, 0:np.pi:50j]
x = np.cos(u)*np.sin(v)
y = np.sin(u)*np.sin(v)
z = np.cos(v)
#
ax.plot_surface(x, y, z, color='r', alpha = 0.2)
s3 = 1/np.sqrt(3)
ax.quiver(0, 0, 0, s3, s3, s3, color = 'b')
ax.text(s3/2, s3/2, s3/2-0.2, '1', size = 14)
#
eps = 0.9
#
ax.plot_surface(eps * x, eps * y, eps * z, color='b', alpha = 0.2)
ax.quiver(0, 0, 0, eps, 0, 0, color = 'k')
ax.text(1/2-0.2, 0, -0.4, r'$1-\epsilon$', size = 14)
ax.set_axis_off()
plt.title('Inner and Outer Hyperspheres')
plt.show()

Proportion of volume in outer shell

Let’s create a function to calculate the proportion of the volume of a hypersphere that is in the outer \(\epsilon\) of the radius.

Code
def get_prop_of_volume_in_outer_epsilon(dimension, epsilon):
  
  outer_radius = 1.0
  outer_volume = volume_of_hypersphere(outer_radius, dimension)

  inner_radius = 1 - epsilon
  inner_volume = volume_of_hypersphere(inner_radius, dimension)
  proportion = (outer_volume - inner_volume) / outer_volume

  # print(f"Outer volume: {outer_volume}, Inner volume: {inner_volume}")
  return proportion

Let’s calculate the proportion of the volume in the outer 1% of the radius for dimensions 1 to 300.

Code
propvols = []
for c_dim in [1,2,10,20,50,100,150,200,250,300]:
  propvols.append(get_prop_of_volume_in_outer_epsilon(c_dim, 0.01))
  print('Proportion of volume in outer 1 percent of radius in %d dimensions =%3.3f'%(c_dim, get_prop_of_volume_in_outer_epsilon(c_dim, 0.01)))
Proportion of volume in outer 1 percent of radius in 1 dimensions =0.010
Proportion of volume in outer 1 percent of radius in 2 dimensions =0.020
Proportion of volume in outer 1 percent of radius in 10 dimensions =0.096
Proportion of volume in outer 1 percent of radius in 20 dimensions =0.182
Proportion of volume in outer 1 percent of radius in 50 dimensions =0.395
Proportion of volume in outer 1 percent of radius in 100 dimensions =0.634
Proportion of volume in outer 1 percent of radius in 150 dimensions =0.779
Proportion of volume in outer 1 percent of radius in 200 dimensions =0.866
Proportion of volume in outer 1 percent of radius in 250 dimensions =0.919
Proportion of volume in outer 1 percent of radius in 300 dimensions =0.951

Plot outer shell proportion

Code
# plot propvols
plt.scatter([1,2,10,20,50,100,150,200,250,300], propvols)
plt.xlabel('Dimensions')
plt.ylabel('Proportion of volume in outer 1%')
plt.title('Proportion of volume in outer 1% of diameter of hypersphere')
plt.show()

By the time we get to 300 dimensions most of the volume is in the outer 1 percent.

Fraction of points in outer shell

What is the fraction \(f_d\) of all the points that are within a unit distance, but not within a distance of \(1-\epsilon\)?

Let

\[ K_d = \frac{\pi^{d/2}}{\Gamma\bigl(\tfrac d2 + 1\bigr)} \]

so that the volume of a hypersphere of radius \(R\) is \(K_d \times R^d\).

Then for \(R=1\) and \(R=1-\epsilon\) we have:

\[ \begin{align*} f_d &= \frac{\text{Volume of Shell}}{\text{Volume of unit hypersphere}} \\ &= \frac{K_d\times(1)^d - K_d\times(1-\epsilon)^d}{K_d\times(1)^d} \\ &= 1 - (1-\epsilon)^d \end{align*} \]

Observe that \((1-\epsilon)^d\) goes to 0 as \(d \rightarrow \infty\) and \(f_d\rightarrow 1\) as \(d \rightarrow \infty\).

This means that as \(d\rightarrow \infty\), all of the points that are within 1 unit of our location, are almost exactly 1 unit from our location.

Curse of Dimensionality Example

The following example is based on Data Science from Scratch, Joel Grus, Second Edition, Chapter 12.

Here’s another example.

We create 100 points, scattered at random within a \(d\)-dimensional space.

We will look at two quantities as we vary \(d\).

  • The minimum distance between any two points.
  • The average distance between any two points.

Code
import sklearn.metrics as metrics

nsamples = 1000
unif_X = np.random.default_rng().uniform(0, 1, nsamples).reshape(-1, 1)
euclidean_dists = metrics.euclidean_distances(unif_X)
# extract the values above the diagonal
dists = euclidean_dists[np.triu_indices(nsamples, 1)]
mean_dists = [np.mean(dists)]
min_dists = [np.min(dists)]
for d in range(2, 101):
    unif_X = np.column_stack([unif_X, np.random.default_rng().uniform(0, 1, nsamples)])
    euclidean_dists = metrics.euclidean_distances(unif_X)
    dists = euclidean_dists[np.triu_indices(nsamples, 1)]
    mean_dists.append(np.mean(dists))
    min_dists.append(np.min(dists))
Code
plt.figure(figsize = (6, 3))
plt.plot(min_dists, label = "Minimum Distance")
plt.plot(mean_dists, label = "Average Distance")
plt.xlabel(r'Number of dimensions ($d$)')
plt.ylabel('Distance')
plt.legend(loc = 'best')
plt.title(f'Comparison of Minimum Versus Average Distance Between {nsamples} Points\nAs Dimension Grows')
plt.show()

The average distance between points grows. However we also observe that the minimum distance between points grows at a similar rate.


Let’s look at the ratio between the average distance between points and the minimum distance between points.

Code
plt.figure(figsize = (6, 3))
plt.plot([a/b for a, b in zip(min_dists, mean_dists)])
plt.xlabel(r'Number of dimensions ($d$)')
plt.ylabel('Ratio')
plt.title(f'Ratio of Minimum to Average Distance Between {nsamples} Points\nAs Dimension Grows')
plt.show()

This shows that, for any test point \(x\), the distance to the closest point to \(x\), gets closer and closer to the average distance between points.

If we used a point at the average distance for classifying \(x\) we’d get a very poor classifier.

Implications of the Curse

For \(k\)-NN, the Curse of Dimensionality means that in high dimensions most points are nearly the same distance from the test point.

This makes \(k\)-NN ineffective. It cannot reliably tell which are the \(k\) nearest neighbors and its performance degrades as \(d\) increases.

What Can be Done?

The problem is that you simply cannot have enough data to do a good job using \(k\)-NN in high dimensions.

Alternative approaches to mitigate this issue

  • Use a different dissimilarity metric
    • For example cosine distance behaves a little better, but you will have to decide if it is a good distance function for your problem.
  • Reduce the dimension of your data.
    • We will discuss dimensionality reduction techniques in lectures SVD I and SVD II.

More on Model Selection

Recall that each value of \(k\) provides a different model.

To better understand model selection, we need a way to evaluate our different models.

How do we evaluate a classifier?

Binary classification

In the simple case of a binary classifier, we can call

  • one class the ‘Positive’ class or index 1
  • the other class the ‘Negative’ class or index 0.

The most basic measure of success for a classifier is accuracy:

  • Accuracy is the fraction of test points that are correctly classified.

Accuracy is important, however it may not convey enough useful information.

For example, let’s say we have a dataset showing class imbalance, e.g., 90% of the data are the Positive class and 10% are the Negative class.

For this dataset, consider a classifier that always predicts ‘Positive’.

Its accuracy is 90%, but it is not an effective classifier.

Precision and Recall

A better way to measure the classifier’s performance is using a Confusion Matrix.

Diagonal elements represent successes and off diagonals represent errors.

Using the confusion matrix we can define some more useful measures:

  • Recall - defined as the fraction of actual positives correctly classified
    • TP/(TP + FN)
  • Precision - defined as the fraction of classified positives correctly classified
    • TP/(TP + FP)

Precision and Recall Illustration

Evaluating \(k\)- Nearest Neighbors

First we’ll generate some synthetic data to work with.

Code
X, y = datasets.make_circles(noise=.1, factor=.5, random_state=1)
print('Shape of data: {}'.format(X.shape))
print('Unique labels: {}'.format(np.unique(y)))
Shape of data: (100, 2)
Unique labels: [0 1]

Here is what the data looks like.

Code
plt.figure(figsize = (4, 4))
plt.prism()  # this sets a nice color map
plt.scatter(X[:, 0], X[:, 1], c=y, s = 80)
plt.axis('off')
plt.axis('equal')
plt.show()


Recall that we always want to test on data separate from our training data.

For now, we will do something very simple: take the first 50 examples for training and the rest for testing.

Code
X_train = X[:50]
y_train = y[:50]
X_test = X[50:]
y_test = y[50:]
Code
plt.figure(figsize = (10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_train[:, 0], X_train[:, 1], c = y_train, s = 80)
plt.axis('equal')
plt.axis('off')
plt.title('Training Data')

plt.subplot(1, 2, 2)
plt.scatter(X_test[:, 0], X_test[:, 1], c = y_test, s = 80)
plt.title('Test Data')
plt.axis('equal')
plt.axis('off')
plt.show()


For our first example, we will classify the points (in the two classes) using a \(k\)-NN classifier.

We will specify that \(k=5\), i.e., we will classify based on the majority vote of the 5 nearest neighbors.

Code
k = 5
knn5 = KNeighborsClassifier(n_neighbors = k)    

As we have seen previously, the scikit-learn fit() function corresponds to training and the predict() function corresponds to testing.

Code
knn5.fit(X_train,y_train)
print(f'Accuracy on test data: {knn5.score(X_test, y_test)}')
Accuracy on test data: 0.72

Accuracy of 72% sounds good – but let’s dig deeper.

Confusion Matrix

We’ll call the red points the Positive class and the green points the Negative class.

Here is the confusion matrix:

Code
y_pred_test = knn5.predict(X_test)
pd.DataFrame(metrics.confusion_matrix(y_test, y_pred_test), 
             columns = ['Predicted 1', 'Predicted 0'], 
             index = ['Actual 1', 'Actual 0'])
Predicted 1 Predicted 0
Actual 1 14 14
Actual 0 0 22

Looks like the classifier is getting all of the Negative class correct, but only achieving accuracy of 50% on the Positive class.

That is, its precision is 100%, but its recall is only 50%.

Let’s visualize the results.


Code
k = 5

plt.figure(figsize = (10, 4))
plt.subplot(1, 2, 1)

plt.scatter(X_train[:, 0], X_train[:, 1], c = y_train, s = 80)
plt.axis('equal')
plt.title('Training')
plt.axis('off')

plt.subplot(1, 2, 2)
plt.scatter(X_test[:, 0], X_test[:, 1], c = y_pred_test, s = 80)
plt.title(f'Testing $k$={k}\nAccuracy: {knn5.score(X_test, y_test)}')
plt.axis('off')
plt.axis('equal')
plt.show()

Indeed, the Positive (red) points in the upper half of the test data are all classified incorrectly.


Let’s look at one of the points that the classifier got wrong.

Code
k=5 
test_point = np.argmax(X_test[:, 1])
neighbors = knn5.kneighbors([X_test[test_point]])[1]

plt.figure(figsize = (10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_train[:, 0], X_train[:, 1], c = y_train, s = 80)
plt.scatter(X_train[neighbors,0], X_train[neighbors,1],
            c = y_train[neighbors], marker='o', 
            facecolors='none', edgecolors='b', s = 80)
radius = np.max(metrics.euclidean_distances(X_test[test_point].reshape(1, -1), X_train[neighbors][0]))
ax = plt.gcf().gca()
circle = mp.patches.Circle(X_test[test_point], radius, facecolor = 'blue', alpha = 0.2)
ax.add_artist(circle)
plt.axis('equal')
plt.axis('off')
plt.tight_layout()
plt.title(r'Training')

plt.subplot(1, 2, 2)
plt.scatter(X_test[:, 0], X_test[:, 1], c = y_pred_test, s = 80)
plt.scatter(X_test[test_point, 0], X_test[test_point, 1], marker='o', 
            facecolors='none', edgecolors='b', s = 80)
plt.title('Testing $k$={}\nAccuracy: {}'.format(k,knn5.score(X_test, y_test)))
plt.axis('equal')
plt.axis('off')
plt.show()


For comparison purposes, let’s try \(k\) = 3.

Code
k = 3
knn3 = KNeighborsClassifier(n_neighbors=k)    
knn3.fit(X_train,y_train)
y_pred_test = knn3.predict(X_test)

plt.figure(figsize = (10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s = 80)
plt.axis('equal')
plt.axis('off')
plt.title(r'Training')

plt.subplot(1, 2, 2)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred_test, s = 80)
plt.title(f'Testing $k$={k}\nAccuracy: {knn3.score(X_test, y_test)}')
plt.axis('off')
plt.axis('equal')
plt.show()


And let’s look at the same individual point as before.

Code
k = 3
test_point = np.argmax(X_test[:,1])
X_test[test_point]
neighbors = knn3.kneighbors([X_test[test_point]])[1]

plt.figure(figsize = (10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s = 80)
plt.scatter(X_train[neighbors, 0], X_train[neighbors, 1], marker = 'o', 
            facecolors = 'none', edgecolors = 'b', s = 80)
radius = np.max(metrics.euclidean_distances(X_test[test_point].reshape(1, -1), 
                                            X_train[neighbors][0]))
ax = plt.gcf().gca()
circle = mp.patches.Circle(X_test[test_point], radius, facecolor = 'blue', alpha = 0.2)
ax.add_artist(circle)
plt.axis('equal')
plt.axis('off')
plt.title(r'Training')
plt.subplot(1, 2, 2)
plt.scatter(X_test[:, 0], X_test[:, 1], c = y_pred_test, s = 80)
plt.scatter(X_test[test_point,0], X_test[test_point,1], marker = 'o', 
            facecolors = 'none', edgecolors = 'b', s = 80)
plt.title(f'Testing $k$={k}\nAccuracy: {knn3.score(X_test, y_test)}')
plt.axis('off')
plt.axis('equal')
plt.show()

Train-Test Splitting

So how confident can we be that the test accuracy is 92% in general?

What we really need to do is consider many different train/test splits.

Thus, the proper way to evaluate generalization ability (accuracy on the test data) is:

  1. Form a random train/test split
  2. Train the classifier on the training split
  3. Test the classifier on the testing split
  4. Accumulate statistics
  5. Repeat from step 1 until enough statistics have been collected.

Code
import sklearn.model_selection as model_selection

nreps = 50
kvals = range(1, 10)
acc = []
np.random.seed(4)
for k in kvals:
    test_rep = []
    train_rep = []
    for i in range(nreps):
        X_train, X_test, y_train, y_test = model_selection.train_test_split(X, 
                                                                            y, 
                                                                            test_size = 0.5)
        knn = KNeighborsClassifier(n_neighbors = k)    
        knn.fit(X_train, y_train)
        train_rep.append(knn.score(X_train, y_train))
        test_rep.append(knn.score(X_test, y_test))
    acc.append([np.mean(np.array(test_rep)), np.mean(np.array(train_rep))])
accy = np.array(acc)
Code
plt.plot(kvals, accy[:, 0], '.-', label = 'Accuracy on Test Data')
plt.plot(kvals, accy[:, 1], '.-', label = 'Accuracy on Training Data')
plt.xlabel(r'$k$')
plt.ylabel('Accuracy')
plt.title('Train/Test Comparision of $k$-NN')
plt.legend(loc = 'best')
plt.show()


Based on the generalization error, i.e., accuracy on the held-out test data, it looks like \(k = 2\) is the best choice.

Here is the decision boundary for \(k\)-NN with \(k = 2\).

Code
x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
plot_step = 0.02
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                     np.arange(y_min, y_max, plot_step))
Code
np.random.seed(1)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.5)

k = 2
knn = KNeighborsClassifier(n_neighbors = k)  
knn.fit(X_train, y_train)
y_pred_train = knn.predict(X_train)
y_pred_test = knn.predict(X_test)

Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize = (12, 5))
plt.subplot(1, 2, 1)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.3)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=30)
plt.axis('equal')
plt.axis('off')
plt.xlim((x_min, x_max))
plt.ylim((y_min, y_max))
plt.title(f'{k}-NN - Training Data\nAccuracy: {knn.score(X_train, y_train)}')

plt.subplot(1, 2, 2)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.3)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=30)
plt.axis('equal')
plt.axis('off')
plt.xlim((x_min, x_max))
plt.ylim((y_min, y_max))
plt.title(f'{k}-NN - Test Data\nAccuracy: {knn.score(X_test, y_test)}')
plt.show()

Real Data

To explore a few more issues, we’ll now turn to some famous datasets that have been extensively studied in the past.

The Iris Dataset

The Iris dataset is a famous dataset used by Ronald Fisher in a classic 1936 paper on classification.

R. A. Fisher

Quoting from Wikipedia:

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.


I. setosa

I. versicolor

I. virginica

Load Iris Dataset

iris = datasets.load_iris()
X = iris.data
y = iris.target
ynames = iris.target_names
print(X.shape, y.shape)
print(X[1, :])
print(iris.target_names)
print(y)
(150, 4) (150,)
[4.9 3.  1.4 0.2]
['setosa' 'versicolor' 'virginica']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

Hyperparameters: k-NN

First, we’ll explore setting the hyperparameters. We start with \(k\)-NN.

To set the hyperparameter \(k\), we evaluate error on the test set for many train/test splits:

Code
X = iris.data
y = iris.target

kvals = range(2, 20)
nreps = 50

acc = []
std = []
np.random.seed(0)
for k in kvals:
    test_rep = []
    train_rep = []
    for i in range(nreps):
        X_train, X_test, y_train, y_test = model_selection.train_test_split(
            X, y, test_size = 0.33)
        knn = KNeighborsClassifier(n_neighbors = k)    
        knn.fit(X_train, y_train)
        train_rep.append(knn.score(X_train, y_train))
        test_rep.append(knn.score(X_test, y_test))
    acc.append([np.mean(np.array(test_rep)), np.mean(np.array(train_rep))])
    std.append([np.std(np.array(test_rep)), np.std(np.array(train_rep))])
Code
plt.figure(figsize= (6, 4))
accy = np.array(acc)
stds = np.array(std)/np.sqrt(nreps)
print(f'Max Test Accuracy at k = {kvals[np.argmax(accy[:, 0])]} with accuracy {np.max(accy[:, 0]):.03f}')
plt.plot(kvals, accy[:, 0], '.-', label = 'Accuracy on Test Data')
plt.plot(kvals, accy[:, 1], '.-', label = 'Accuracy on Training Data')
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.xticks(kvals)
plt.legend(loc = 'best')
plt.show()
Max Test Accuracy at k = 13 with accuracy 0.969


It looks like \(k\) = 13 is the best-performing value of the hyperparameter.

Can we be sure?

Be careful! Each point in the above plot is the mean of 50 random train/test splits!

If we are going to be sure that \(k\) = 13 is best, then it should be be statistically distinguishable from the other values.

To make this call, let’s plot \(\pm 1 \sigma\) confidence intervals on the mean values.

See the Probability Refresher for details on the proper formula.


Code
plt.errorbar(kvals, accy[:, 0], stds[:, 0], label = 'Accuracy on Test Data')
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.legend(loc = 'lower center')
plt.xticks(kvals)
plt.title(r'Test Accuracy with $\pm 1\sigma$ Errorbars')
plt.show()


It looks like \(k\) = 13 is a reasonable value, although a case can be made that 9 and 11 are not statistically distinguishable from 13.

To gain insight into the complexity of the model for \(k\) = 13, let’s look at the decision boundary.

We will re-run the classifier using only two (of four) features for visualization purposes.


Code
# Create color maps
from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# we will use only the first two (of four) features, so we can visualize
X = X_train[:, :2] 
h = .02  # step size in the mesh
k = 13
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X, y_train)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                      np.arange(y_min, y_max, h))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(6, 4))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light, shading='auto')

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y_train, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title(f"3-Class $k$-NN classification ($k$ = {k})")
plt.show()

There are a few artifacts, though overall this looks like a reasonably smooth set of decision boundaries.

MNIST dataset

NIST used to be called the “National Bureau of Standards.” These are the folks who bring you the reference meter, reference kilogram, etc.

NIST constructed datasets for machine learning of handwritten digits. These were collected from Census Bureau employees and also from high-school students.

This dataset has been used repeatedly for many years to evaluate classifiers. For a peek at some of the work done with this dataset you can visit this link.

MNIST Digits

Code
import sklearn.utils as utils

digits = datasets.load_digits()
X, y = utils.shuffle(digits.data, digits.target, random_state = 1)

print ('Data shape: {}'.format(X.shape))
print ('Data labels: {}'.format(y))
print ('Unique labels: {}'.format(digits.target_names))
Data shape: (1797, 64)
Data labels: [1 5 0 ... 9 1 5]
Unique labels: [0 1 2 3 4 5 6 7 8 9]

An individual item is an \(8 \times 8\) image, encoded as a matrix:

Code
digits.images[3]
array([[ 0.,  0.,  7., 15., 13.,  1.,  0.,  0.],
       [ 0.,  8., 13.,  6., 15.,  4.,  0.,  0.],
       [ 0.,  2.,  1., 13., 13.,  0.,  0.,  0.],
       [ 0.,  0.,  2., 15., 11.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1., 12., 12.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1., 10.,  8.,  0.],
       [ 0.,  0.,  8.,  4.,  5., 14.,  9.,  0.],
       [ 0.,  0.,  7., 13., 13.,  9.,  0.,  0.]])

Here we show the matrix as an image.

Code
fig, ax = plt.subplots()
ax.matshow(digits.images[3], cmap='gray_r')
plt.show()


It is easier to visualize if we blur the pixels a little bit.

Code
plt.rc('image', cmap = 'binary', interpolation = 'bilinear')
plt.figure(figsize = (5, 5))
plt.axis('off')
plt.imshow(digits.images[3])
plt.show()


Here are some more samples from the dataset.

Code
for t in range(3):
    plt.figure(figsize = (8, 2))
    for j in range(4):
        plt.subplot(1, 4, 1 + j)
        plt.imshow(X[4*t + j].reshape(8, 8))
        plt.axis('off')
plt.show()


Although this is an 8 \(\times\) 8 image, we can reshape it to a vector of length 64.

To do model selection, we will again average over many train/test splits.

However, since the performance of \(k\)-NN degrades on large sets of data we may decide to limit the size of our testing set to speed up testing.

How does the train/test split affect results?

Let’s consider two cases:

  1. Train: 90% of data, Test: 10% of data
  2. Train: 67% of data, Test: 33% of data

Code
def test_knn(kvals, test_fraction, nreps):
    acc = []
    std = []
    np.random.seed(0)
    #
    for k in kvals:
        test_rep = []
        train_rep = []
        for i in range(nreps):
            X_train, X_test, y_train, y_test = model_selection.train_test_split(
                X, y, test_size = test_fraction)
            knn = KNeighborsClassifier(n_neighbors = k)    
            knn.fit(X_train, y_train)
            test_rep.append(knn.score(X_test, y_test))
        acc.append(np.mean(np.array(test_rep)))
        std.append(np.std(np.array(test_rep)))
    return(np.array(acc), np.array(std)/np.sqrt(nreps))

test_fraction1 = 0.33
accy1, stds1 = test_knn(range(2, 20), test_fraction1, 50)
test_fraction2 = 0.10
accy2, stds2 = test_knn(range(2, 20), test_fraction2, 50)
Code
plt.figure(figsize = (6, 4))
plt.errorbar(kvals, accy1, stds1, 
             label = f'{test_fraction1:.0%} Used for Testing; accuracy {np.max(accy1):.03f}')
plt.errorbar(kvals, accy2, stds2, 
             label = f'{test_fraction2:.0%} Used for Testing; accuracy {np.max(accy2):.03f}')
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.legend(loc = 'best')
plt.xticks(kvals)
best_k = kvals[np.argmax(accy1)]
plt.title(f'Test Accuracy with $\\pm 1\\sigma$ Error Bars')
plt.show()


These plots illustrate show two important principles.

  • With more training data the classifier performs better.
  • With less testing data, the testing results are more variable. Though testing is faster.

The key decision here is what value to choose for \(k\). So it makes sense to use the 33% test split, because the smaller error bars give us better confidence in our decision.


We can get a sense of why \(k\)-NN can succeed at this task by looking at the nearest neighbors of some points.

Code
knn = KNeighborsClassifier(n_neighbors = 3)    
knn.fit(X, y)
neighbors = knn.kneighbors(X[:3,:], n_neighbors=3, return_distance=False)
Code
plt.rc("image", cmap="binary")  # this sets a black on white colormap
for t in range(3):
    plt.figure(figsize=(8, 2))
    plt.subplot(1, 4, 1)
    plt.imshow(X[t].reshape(8, 8))
    plt.axis('off')
    plt.title("Query")
    # plot three nearest neighbors from the training set
    for i in [0, 1, 2]:
        plt.subplot(1, 4, 2 + i)
        plt.title("neighbor {}".format(i))
        plt.imshow(X[neighbors[t, i]].reshape(8, 8))
        plt.axis('off')
plt.show()

Back to top