\(k\)-Means Clustering

Victorian England

1854 Cholera Outbreak

London cholera outbreak 1

  • There was a horrific cholera outbreak in 1854 Soho, London.
  • Common wisdom at the time was that disease spread by breathing “foul air” (miasma).
  • The London sewer system had not yet reached Soho.
  • Most homes had cesspits under the floor, which often overflowed.
  • “Night Soil Men” would regularly collect and sell to farmers or dump in the Thames.

John Snow

  • John Snow, a local physician, extensively studied the patterns of illness across Soho due to cholera.
  • In the course of his studies, his attention was drawn to one neighborhood around Broad Street.
  • In 10 days, 500 people in the area died.

John Snow

John’s Snow Map

London cholera outbreak
  • In uncovering the source of this outbreak, Snow prepared this map.
  • From this map he could clearly see the deaths were clustered around an area.
  • The neighborhood was all served by a water pump on Broad St.
  • The pump handle was removed and illnesses decreased dramatically.

Broad St. water pump

J. Snow Book
  • He later published his results2
  • Results from the cluster map.
  • Results from a double blind study of two neighborhoods drawing water upriver and downriver of the polluted portion of the Thames.
  • Other anecdotes of visitors to Soho, etc.

References and Further Reading

Images and information taken from wikipedia, National Library of Medicine and the Internet Archive and MSU’s John Snow Archive.

John Snow’s original data is recreated here.

Clustering

Clustering

Clustering is a very important way of discovering structure in data.

It is so important because it is common for data to show clusters.

  • Locations where millionaires live
  • The number of hours people work each week
  • Demographics (“soccer parents”, “bored retirees”, “unemployed millenials”, etc)
  • We can often simplify or compress our data if we recognize the existence of clusters.
  • Further, we can often interpret clusters by assigning them labels.
  • However, note that these categories or “labels” are assigned after the fact.
  • And, we may not be able to interpret clusters or assign them labels in some cases.
  • That is, clustering represents the first example we will see of unsupervised learning.

Supervised vs Unsupervised

Supervised methods: Data items have labels, and we want to learn a function that correctly assigns labels to new data items.

Unsupervised methods: Data items do not have labels, and we want to learn a function that extracts important patterns from the data.

Applications of Clustering

  • Image Processing
    • Cluster images based on their visual content.
    • Compress images based on color clusters.
  • Web Mining
    • Cluster groups of users based on webpage access patterns.
    • Cluster web pages based on their content.
  • Bioinformatics
    • Cluster similar proteins together (by structure or function).
    • Cluster cell types (by gene activity).
  • And many more …

The Clustering Problem

When we apply clustering, what problem are we trying to solve?

We will answer this question informally at first.

(But soon we will look at formal criteria!)

Informally, a clustering is:

a grouping of data objects, such that the objects within a group are similar (or near) to one another and dissimilar (or far) from the objects in other groups.

(keep in mind that if we use a distance function as a dissimilarity measure, then “far” implies “different”)

So we want our clustering algorithm to:

  • minimize intra-cluster distances.
  • maximize inter-cluster distances.

Here are the basic questions we need to ask about clustering:

  • What is the right kind of “similarity” to use?
  • What is a good partition of objects?
    • i.e., how is the quality of a solution measured?
  • How to find a good partition?
    • Are there efficient algorithms?
    • Are there algorithms that are guaranteed to find good clusters?

Now note that even with our more-formal discussion, the criteria for deciding on a “best” clustering can still be ambiguous.

To accommodate the ambiguity here, one approach is to seek a hierarchical clustering.

That is, as set of nested clusters organized in a tree.

We’ll discuss hierarchical cluster in an upcoming lecture.


For today, we’ll focus on partitional clustering.

In a partitional clustering, the points are divided into a set of non-overlapping groups.

In a partitional clustering.

  • Each object belongs to one, and only one, cluster.
  • The set of clusters covers all the objects.

We are going to assume for now that the number of clusters is given in advance.

We will denote the number of clusters as \(k\).

The \(k\)-means Algorithm

Assumptions

Now, we are ready to state our first formalization of the clustering problem.

We will assume that

  • data items are represented by points in \(d\)-dimensional space, \(\mathbb{R}^d\), i.e., has \(d\) features,
  • the number of points, \(N\), is given, and
  • the number of clusters \(K\) is given.

Minimizing a Cost Function

Find \(K\) disjoint clusters \(C_k\), each described by points \(c_1, \dots, c_K\) (called centers, centroids, or means) that minimizes

\[ \sum_{k=1}^K \sum_{x\in C_k} \Vert x-c_k\Vert^2. \]

The literature sometimes calls this the Inertia of the clustering.

See Norms in Distances and Timeseries for a refresher.


We now have a formal definition of a clustering.

This is not the only definition possible, but it is an intuitive and simple one.

How hard is it to solve this problem?

  • \(k=1\) and \(k=n\) are easy special cases. Why?
  • But, this problem is NP-hard if the dimension of the data is at least 2.
    • We don’t expect that there is any exact, efficient algorithm in general.

Nonetheless, there is a simple algorithm that works quite well in practice!

\(k\)-means as NP-hard

For context, NP-hard is a term used in the study of computational complexity.

Problems in P are those that can be solved in polynomial time.

Problems in NP are those that can be verified in polynomial time but not necessarily solved in polynomial time.

NP-hard problems are those that are at least as hard as the hardest problems in NP.

NP-complete problems are those that are both NP-hard and in NP.

You can prove that the \(k\)-means problem is NP-hard by finding a reduction from a known NP-hard problem such as the Partition Problem.

Note

The \(k\)-means problem is NP-hard.

The \(k\)-means Algorithm

  • There is a “classic” algorithm for this problem.
  • It was voted among the top-10 algorithms in data mining!3
  • It is such a good idea that it has been independently discovered multiple times.
  • It was first discovered by Lloyd in 1957, so it is often called Lloyd’s algorithm.
  • It is called the “\(k\)-means algorithm.”
  • Not to be confused with the \(k\)-means problem of which this is a heuristic solution.

The other from the top 10 were SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART.

  1. Pick \(K\) cluster centers \(\{c_1, \dots, c_K\}\).
    • These can be randomly chosen data points, or by some other method such as \(k\)-means++
  2. For each \(j\), define the cluster \(C_j\) as the set of points in \(X\) that are closest to center \(c_j\).
    • Nearer to \(c_j\) than to any other center.
  3. For each \(j\), update \(c_j\) to be the center of mass of cluster \(C_j\).
    • In other words, \(c_j\) is the mean of the vectors in \(C_j\).
  4. Repeat (i.e., go to Step 2) until convergence.
    • Either the cluster centers change below a threshold, or inertia changes below a threshold or a maximum number of iterations is reached.

Let’s see this in practice with well separated clusters and also look at the Within-Cluster Sum of Square (WCSS).

Code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Set random seed for reproducibility
np.random.seed(42)

# Generate sample data
n_samples = 300
X, _ = make_blobs(n_samples=n_samples, centers=3, cluster_std=0.60, random_state=42)

# Initialize centroids randomly
k = 3
centroids = X[np.random.choice(n_samples, k, replace=False)]

# Function to assign points to clusters
def assign_clusters(X, centroids):
    distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2))
    return np.argmin(distances, axis=0)

# Function to update centroids
def update_centroids(X, labels, k):
    return np.array([X[labels == i].mean(axis=0) for i in range(k)])

# Function to calculate within-cluster sum of squares
def calculate_wcss(X, labels, centroids):
    return sum(np.sum((X[labels == i] - centroids[i])**2) for i in range(k))

# Set up the clustering progress plot
fig1, axs = plt.subplots(2, 3, figsize=(10, 6))
axs = axs.ravel()

# Colors for each cluster
colors = ['#1f77b4', '#ff7f0e', '#2ca02c']

# List to store WCSS values
wcss_values = []

# Run k-means iterations and plot
for i in range(6):
    # Assign clusters
    labels = assign_clusters(X, centroids)
    
    # Calculate WCSS
    wcss = calculate_wcss(X, labels, centroids)
    wcss_values.append(wcss)
    
    # Plot the current clustering state
    axs[i].scatter(X[:, 0], X[:, 1], c=[colors[l] for l in labels], alpha=0.6)
    axs[i].scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x', s=200, linewidths=3)
    axs[i].set_title(f'Iteration {i if i < 5 else "Final"}, WCSS: {wcss:.2f}')
    axs[i].set_xlim(X[:, 0].min() - 1, X[:, 0].max() + 1)
    axs[i].set_ylim(X[:, 1].min() - 1, X[:, 1].max() + 1)
    
    # Update centroids (except for the last iteration)
    if i < 5:
        centroids = update_centroids(X, labels, k)

plt.tight_layout()

# Create a separate plot for WCSS progress
fig2, ax = plt.subplots(figsize=(10, 6))
ax.plot(range(6), wcss_values, marker='o')
ax.set_title('WCSS Progress')
ax.set_xlabel('Iteration')
ax.set_ylabel('Within-Cluster Sum of Squares')
ax.grid(True)

plt.tight_layout()
plt.show()


Let’s see this in practice with overlapping clusters and also look at the WCSS.

Code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Set random seed for reproducibility
np.random.seed(42)

# Generate sample data
n_samples = 300
X, _ = make_blobs(n_samples=n_samples, centers=3, cluster_std=3.00, random_state=2)

# Initialize centroids randomly
k = 3
centroids = X[np.random.choice(n_samples, k, replace=False)]

# Function to assign points to clusters
def assign_clusters(X, centroids):
    distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2))
    return np.argmin(distances, axis=0)

# Function to update centroids
def update_centroids(X, labels, k):
    return np.array([X[labels == i].mean(axis=0) for i in range(k)])

# Function to calculate within-cluster sum of squares
def calculate_wcss(X, labels, centroids):
    return sum(np.sum((X[labels == i] - centroids[i])**2) for i in range(k))

# Set up the clustering progress plot
fig1, axs = plt.subplots(2, 3, figsize=(10, 6))
axs = axs.ravel()

# Colors for each cluster
colors = ['#1f77b4', '#ff7f0e', '#2ca02c']

# List to store WCSS values
wcss_values = []

# Run k-means iterations and plot
for i in range(6):
    # Assign clusters
    labels = assign_clusters(X, centroids)
    
    # Calculate WCSS
    wcss = calculate_wcss(X, labels, centroids)
    wcss_values.append(wcss)
    
    # Plot the current clustering state
    axs[i].scatter(X[:, 0], X[:, 1], c=[colors[l] for l in labels], alpha=0.6)
    axs[i].scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x', s=200, linewidths=3)
    axs[i].set_title(f'Iteration {i if i < 5 else "Final"}, WCSS: {wcss:.2f}')
    axs[i].set_xlim(X[:, 0].min() - 1, X[:, 0].max() + 1)
    axs[i].set_ylim(X[:, 1].min() - 1, X[:, 1].max() + 1)
    
    # Update centroids (except for the last iteration)
    if i < 5:
        centroids = update_centroids(X, labels, k)

plt.tight_layout()

# Create a separate plot for WCSS progress
fig2, ax = plt.subplots(figsize=(10, 6))
ax.plot(range(6), wcss_values, marker='o')
ax.set_title('WCSS Progress')
ax.set_xlabel('Iteration')
ax.set_ylabel('Within-Cluster Sum of Squares')
ax.grid(True)

plt.tight_layout()
plt.show()


Limitations of \(k\)-means

As you can see, \(k\)-means can work very well.

However, we don’t have any guarantees on the performance of \(k\)-means.

In particular, there are various settings in which \(k\)-means can fail to do a good job.


  1. \(k\)-means tries to find spherical clusters.

Because each point is assigned to its closest center, the points in a cluster are implicitly assumed to be arranged in a sphere around the center.

Code
# Author: Phil Roth <mr.phil.roth@gmail.com>
#         Arturo Amor <david-arturo.amor-quiroz@inria.fr>
# License: BSD 3 clause
#
# From https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html

import numpy as np
from sklearn.datasets import make_blobs, make_moons

n_samples = 1500
random_state = 170
transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]

X, y = make_blobs(n_samples=n_samples, random_state=random_state)
X_aniso = np.dot(X, transformation)  # Anisotropic blobs
X_varied, y_varied = make_blobs(
    n_samples=n_samples, cluster_std=[1.0, 2.5, 0.5], random_state=random_state
    )  # Unequal variance
X_filtered = np.vstack(
    (X[y == 0][:500], X[y == 1][:100], X[y == 2][:10])
    )  # Unevenly sized blobs
y_filtered = [0] * 500 + [1] * 100 + [2] * 10

# Generate two half-moon clusters
X_moons, y_moons_true = make_moons(n_samples=200, noise=0.1, random_state=42)
Code
# Run above cell first

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Perform k-means clustering
kmeans = KMeans(n_clusters=2, random_state=42)
y_moons_pred = kmeans.fit_predict(X_moons)

# Create a figure with two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))

# Plot ground truth
ax1.scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons_true, cmap='viridis')
ax1.set_title('Ground Truth')
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')

# Plot k-means clustering result
ax2.scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons_pred, cmap='viridis')
ax2.set_title('K-means Clustering')
ax2.set_xlabel('Feature 1')
ax2.set_ylabel('Feature 2')

# Add cluster centers to the k-means plot
ax2.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            marker='x', s=200, linewidths=3, color='r', zorder=10)

plt.tight_layout()
plt.show()

K-means clustering on a dataset with anisotropic clusters.

Code
# run previous two cells first

common_params = {
    "n_init": "auto",
    "random_state": random_state,
}

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

axs[0].scatter(X_aniso[:, 0], X_aniso[:, 1], c=y)
axs[0].set_title("Ground Truth")

kmeans = KMeans(n_clusters=3, **common_params)
y_pred = kmeans.fit_predict(X_aniso)
axs[1].scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
axs[1].set_title("K-means Clustering")

# Add cluster centers to the k-means plot
axs[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            marker='x', s=200, linewidths=3, color='r', zorder=10)

# plt.suptitle("Unexpected KMeans clusters").set_y(0.95)
plt.show()

K-means clustering on a dataset with anisotropic gaussian clusters.

  1. \(k\)-means tries to find equal-sized clusters.

For the same reason, \(k\)-means tends to try to balance the sizes of the clusters.

Code
# run previous three cells first

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

axs[0].scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_filtered)
axs[0].set_title("Ground Truth")

kmeans = KMeans(n_clusters=3, **common_params)
y_pred = kmeans.fit_predict(X_filtered)
axs[1].scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)
axs[1].set_title("K-means Clustering")

# Add cluster centers to the k-means plot
axs[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            marker='x', s=200, linewidths=3, color='r', zorder=10)

# plt.suptitle("Unexpected KMeans clusters").set_y(0.95)
plt.show()

K-means clustering on a dataset with unequal-sized clusters.

  1. \(k\)-means is sensitive to variance of the clusters.
Code
# run previous four cells first

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

axs[0].scatter(X_varied[:, 0], X_varied[:, 1], c=y_varied)
axs[0].set_title("Ground Truth")

kmeans = KMeans(n_clusters=3, **common_params)
y_pred = kmeans.fit_predict(X_varied)
axs[1].scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
axs[1].set_title("K-means Clustering")

# Add cluster centers to the k-means plot
axs[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            marker='x', s=200, linewidths=3, color='r', zorder=10)

# plt.suptitle("Unexpected KMeans clusters").set_y(0.95)
plt.show()

K-means clustering on a dataset with unequal variance clusters.

  1. \(k\)-means is sensitive to the starting cluster centers.

If the initial guess (Step 1) is a bad one, \(k\)-means may get “stuck” in a bad solution.


  1. \(k\)-means is sensitive to the number of clusters.
Code
# run previous five cells first

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

axs[0].scatter(X[:, 0], X[:, 1], c=y)
axs[0].set_title("Ground Truth")

kmeans = KMeans(n_clusters=2, **common_params)
y_pred = kmeans.fit_predict(X)
axs[1].scatter(X[:, 0], X[:, 1], c=y_pred)
axs[1].set_title("K-means Clustering")

# Add cluster centers to the k-means plot
axs[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            marker='x', s=200, linewidths=3, color='r', zorder=10)

plt.show()

K-means clustering on a dataset with incorrect number of clusters parameter.

Choosing a Good Initialization

  • How can we avoid the kind of bad initialization we just saw?
  • A good strategy is to pick points that are distant to each other.
  • This strategy is called \(k\)-means++”.
  • It works very well in practice, and the scikit-learn implementation uses it by default.
  • We will explore it in more detail in the next lecture.

Choosing the right \(k\)

Generally, we would say that, given some \(K\), the \(k\)-means algorithm “learns” the cluster centers – that is, the parameters of the model.

But we have not yet considered how to choose the right number of clusters.

That’s typically not something one knows in advance.

As an aside:

  • This parameter (\(K\)) is the first example we have seen of a hyperparameter.
  • A hyperparameter is a parameter that must be set before the model parameters can be learned.

Our basic strategy will be to:

  • Iterate through different \(K\) and use some criterion to decide which \(K\) is most appropriate.
  • We will discuss this more in the next lecture.

Feature Scales

Finally, given the tendency of \(k\)-means to look for spherical clusters, we should consider the scales of the various features.

In fact, in general when constructing or selecting a distance metric, one needs to think carefully about the scale of the features being used.

Unscaled Features

For example, consider the case where we are clustering people based on their age, income, and gender.

We might use age in years, income in dollars, and assign gender to the values \(\{0, 1\}\).

Thus, the following records:

  • Joe Smith, age 27, income USD 75,000, male
  • Eve Jones, age 45, income USD 42,000, female

Would be encoded in feature space as:

\[ \begin{bmatrix}27\\75000\\0\end{bmatrix},\begin{bmatrix}45\\42000\\1\end{bmatrix} \]

Unscaled Features, Continued

What would happen if we used Euclidean distance as our dissimilarity metric in this feature space?

(This is what \(k\)-means uses.)

Clearly, the influence of income would dominate the other two features. For example, a difference of gender is about as significant as a difference of one dollar of yearly income.

We are unlikely to expose gender-based differences if we cluster using this representation.

The most common way to handle this is feature scaling.

Feature Scaling

The basic idea is to rescale each feature separately, so that its range of values is about the same as all other features.

For example, one may choose to:

  • shift each feature independently by subtracting the mean over all observed values.
    • This means that each feature is now centered on zero.
  • Then rescale each feature so that the standard deviation overall observed values is 1.
    • This means that the feature will have about the same range of values as all the others.

Feature Scaling Example

For example, let’s work with Bortkiewicz’s famous horse-kick data which is the the number of soilders in the Prussian cavalry killed by horse kicks over the 20 years between 1875 and 1894, inclusive.

Ladislaus Bortkiewicz

Ladislaus Bortkiewicz4 Law of Small Numbers Book5

  • Ladislaus Bortkiewicz (1868 – 1931)
  • Wrote book “Law of Small Numbers” in 1898
  • Showed how the horse kick data fits the Poisson model
  • More generally that rare events in large populations can be statistically modeled using the Poisson distribution

Feature Scaling Example, Continued

Here is the horse kick data:

Code
# source: http://www.randomservices.org/random/data/HorseKicks.html
import pandas as pd
df = pd.read_table('data/HorseKicks.txt',index_col='Year',dtype='float')
counts = df.sum(axis=1)
counts
Year
1875.0     3.0
1876.0     5.0
1877.0     7.0
1878.0     9.0
1879.0    10.0
1880.0    18.0
1881.0     6.0
1882.0    14.0
1883.0    11.0
1884.0     9.0
1885.0     5.0
1886.0    11.0
1887.0    15.0
1888.0     6.0
1889.0    11.0
1890.0    17.0
1891.0    12.0
1892.0    15.0
1893.0     8.0
1894.0     4.0
dtype: float64

And here is the histogram by year number.

Code
counts.hist(bins=25,xlabelsize=16);
plt.xlabel('# of Kick Deaths')
plt.ylabel('Count')
plt.title('Histogram of Kick Deaths')
plt.show()

The average:

Code
counts.mean()
np.float64(9.8)

To standardize to zero mean and unit standard deviation, we can use pre-processing tools from the scikit-learn library.

The distribution after rescaling:

Code
from sklearn import preprocessing
counts_scaled = pd.DataFrame(preprocessing.scale(counts))
counts_scaled.hist(bins=25,xlabelsize=16);

With a new mean:

Code
counts_scaled.mean().values
array([-1.33226763e-16])

Notice that values that used to be positive have now become negative.

In some situations it may not be sensible to change non-negative values into something else. It may make more sense to map all values into a fixed range, for example \([0, 1]\).

Code
min_max_scaler = preprocessing.MinMaxScaler()
counts_minmax = min_max_scaler.fit_transform(counts.values.reshape(-1,1))
counts_minmax = pd.DataFrame(counts_minmax)
counts_minmax.hist(bins=25,xlabelsize=16);

Example Application of k-means

Here is a simple example of how \(k\)-means can be used to reduce color space and compress data.

Consider the following image.

  • Each color in the image is represented by an integer.
  • Typically we might use 24 bits for each integer (8 bits for R, G, and B).

Now find \(k=16\) clusters of pixels in three dimensional \((R, G, B)\) space and replace each pixel by its cluster center.

Because there are 16 centroids, we can represent by a 4-bit mapping for a compression ratio of \(24/4=6\times\).

Here we cluster into 8 groups (3 bits) for a compression ratio \(24/3=8\times\).

Here we cluster into 4 groups (2 bits) for a compression ratio around \(24/2=12\times\).

Finally, we use 1 bit (2 color groups) for a compression ratio of \(24\times\).

Recap and Next

Today we covered:

  • \(k\)-means clustering
  • strengths and weaknesses
  • Importance of initialization and cluster number
  • Feature scaling
  • Example application

Coming up next, we’ll look at some practical aspects of applying \(k\)-means.

Back to top

Footnotes

  1. Public Domain, https://commons.wikimedia.org/w/index.php?curid=680455↩︎

  2. By John Snow - Published by C.F. Cheffins, Lith, Southhampton Buildings, London, England, 1854 in Snow, John. On the Mode of Communication of Cholera, 2nd Ed, John Churchill, New Burlington Street, London, England, 1855.↩︎

  3. As determined at the 2006 IEEE International Conference on Data Mining↩︎

  4. By no conegut - MacTutor History of Mathematics: http://www-history.mcs.st-andrews.ac.uk/PictDisplay/Bortkiewicz.html, Public Domain, https://commons.wikimedia.org/w/index.php?curid=79219622↩︎

  5. Law of Small Numbers↩︎