\(k\)-Means Clustering

Victorian England

1854 Cholera Outbreak

London cholera outbreak 1

  • There was a horrific cholera outbreak in 1854 Soho, London.
  • Common wisdom at the time was that disease spread by breathing “foul air” (miasma).
  • The London sewer system had not yet reached Soho.
  • Most homes had cesspits under the floor, which often overflowed.
  • “Night Soil Men” would regularly collect and sell to farmers or dump in the Thames.

John Snow

  • John Snow, a local physician, extensively studied the patterns of illness across Soho due to cholera.
  • In the course of his studies, his attention was drawn to one neighborhood around Broad Street.
  • In 10 days, 500 people in the area died.

John Snow

John’s Snow Map

London cholera outbreak
  • In uncovering the source of this outbreak, Snow prepared this map.
  • From this map he could clearly see the deaths were clustered around an area.
  • The neighborhood was all served by a water pump on Broad St.
  • The pump handle was removed and illnesses decreased dramatically.

Broad St. water pump

J. Snow Book
  • He later published his results2
  • Results from the cluster map.
  • Results from a double blind study of two neighborhoods drawing water upriver and downriver of the polluted portion of the Thames.
  • Other anecdotes of visitors to Soho, etc.

References and Further Reading

Images and information taken from wikipedia, National Library of Medicine and the Internet Archive and MSU’s John Snow Archive.

John Snow’s original data is recreated here.

Clustering

Clustering

Clustering is a very important way of discovering structure in data.

It is so important because it is common for data to show clusters.

  • Locations where millionaires live
  • The number of hours people work each week
  • Demographics (“soccer parents”, “bored retirees”, “unemployed millenials”, etc)
  • We can often simplify or compress our data if we recognize the existence of clusters.
  • Further, we can often interpret clusters by assigning them labels.
  • However, note that these categories or “labels” are assigned after the fact.
  • And, we may not be able to interpret clusters or assign them labels in some cases.
  • That is, clustering represents the first example we will see of unsupervised learning.

Supervised vs Unsupervised

Supervised methods: Data items have labels, and we want to learn a function that correctly assigns labels to new data items.

Unsupervised methods: Data items do not have labels, and we want to learn a function that extracts important patterns from the data.

Applications of Clustering

  • Image Processing
    • Cluster images based on their visual content.
    • Compress images based on color clusters.
  • Web Mining
    • Cluster groups of users based on webpage access patterns.
    • Cluster web pages based on their content.
  • Bioinformatics
    • Cluster similar proteins together (by structure or function).
    • Cluster cell types (by gene activity).
  • And many more …

The Clustering Problem

When we apply clustering, what problem are we trying to solve?

We will answer this question informally at first, then apply formal criteria.

Informally, a clustering is:

a grouping of data objects, such that the objects within a group are similar (or near) to one another and dissimilar (or far) from the objects in other groups.

(keep in mind that if we use a distance function as a dissimilarity measure, then “far” implies “different”)

So we want our clustering algorithm to:

  • minimize intra-cluster distances.
  • maximize inter-cluster distances.

The Clustering Problem, continued

Here are the basic questions we need to ask about clustering:

  • What is the right kind of “similarity” to use?
  • What is a good partition of objects?
    • i.e., how is the quality of a solution measured?
  • How to find a good partition?
    • Are there efficient algorithms?
    • Are there algorithms that are guaranteed to find good clusters?

Now note that even with our more-formal discussion, the criteria for deciding on a “best” clustering can still be ambiguous.

Later, we will discuss hierarchical clustering which let’s us consider clustering at multiple scales.

Partitional Clustering

For now, we’ll focus on partitional clustering.

Points are divided into a set of non-overlapping groups:

  • Each object belongs to one, and only one, cluster. (mutually exclusive)
  • The set of clusters covers all the objects. (exhaustive)

We are going to assume for now that the number of clusters, \(k\), is given in advance.

The \(k\)-means Algorithm

Assumptions

Now, we are ready to state our first formalization of the clustering problem.

We will assume that

  • data items are represented by points in \(d\)-dimensional space, \(\mathbb{R}^d\), i.e., has \(d\) features,
  • the number of points, \(N\), is given, and
  • the number of clusters \(K\) is given.

Minimizing a Cost Function

Find \(K\) disjoint clusters \(C_k\), each described by points \(c_1, \dots, c_K\) (called centers, centroids, or means) that minimizes

\[ \underset{C_1, \dots, C_K}{\arg\min} \hspace{1em} \sum_{k=1}^K \hspace{1em} \sum_{x\in C_k} \Vert x-c_k\Vert^2. \]

  • The Within-Cluster Sum of Squares (WCSS) or the Inertia of the clustering.

See Norms in Distances and Timeseries for a refresher.


We now have a formal definition of a clustering.

This is not the only definition possible, but it is an intuitive and simple one.

How hard is it to solve this problem?

  • \(k=1\) and \(k=n\) are easy special cases. Why?
  • But, this problem is NP-hard if the dimension of the data is at least 2.
    • We don’t expect that there is any exact, efficient algorithm in general.

Nonetheless, there is a simple algorithm that works quite well in practice!

Aside: \(k\)-means as NP-hard

For context, NP-hard is a term used in the study of computational complexity.

Problems in P are those that can be solved in polynomial time, e.g. for \(n\) items, the time to solve the problem is \(O(n^2)\) or \(O(n^3)\).

Problems in NP are those that can be verified in polynomial time but not necessarily solved in polynomial time.

NP-hard problems are those that are at least as hard as the hardest problems in NP, not necessarily verifiable in polynomial time.

NP-complete problems are those that are both NP-hard and in NP, e.g. hardest problems in NP and verifiable in polynomial time.

You can prove that the \(k\)-means problem is NP-hard by finding a reduction from a known NP-hard problem such as the Partition Problem.

The \(k\)-means Algorithm

  • There is a “classic” algorithm for this problem.
  • It was voted among the top-10 algorithms in data mining!3
  • It is such a good idea that it has been independently discovered multiple times.
  • It was first discovered by Lloyd in 1957, so it is often called Lloyd’s algorithm.
  • It is called the “\(k\)-means algorithm.”
  • Not to be confused with the \(k\)-means problem of which this is a heuristic solution.

The other from the top 10 were SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART.

  1. Pick \(K\) cluster centers \(\{c_1, \dots, c_K\}\).
    • These can be randomly chosen data points, or by some other method such as \(k\)-means++
  2. For each \(j\), define the cluster \(C_j\) as the set of points in \(X\) that are closest to center \(c_j\).
    • Nearer to \(c_j\) than to any other center.
  3. For each \(j\), update \(c_j\) to be the center of mass of cluster \(C_j\).
    • In other words, \(c_j\) is the mean of the vectors in \(C_j\).
  4. Repeat (i.e., go to Step 2) until convergence.
    • Either the cluster centers change below a threshold, or inertia changes below a threshold or a maximum number of iterations is reached.

Let’s see this in practice with well separated clusters and also look at the Within-Cluster Sum of Square (WCSS).

Code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Set random seed for reproducibility
np.random.seed(42)

# Generate sample data
n_samples = 300
X, _ = make_blobs(n_samples=n_samples, centers=3, cluster_std=0.60, random_state=42)

# Initialize centroids randomly
k = 3
centroids = X[np.random.choice(n_samples, k, replace=False)]

# Function to assign points to clusters
def assign_clusters(X, centroids):
    distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2))
    return np.argmin(distances, axis=0)

# Function to update centroids
def update_centroids(X, labels, k):
    return np.array([X[labels == i].mean(axis=0) for i in range(k)])

# Function to calculate within-cluster sum of squares
def calculate_wcss(X, labels, centroids):
    return sum(np.sum((X[labels == i] - centroids[i])**2) for i in range(k))

# Set up the clustering progress plot
fig1, axs = plt.subplots(2, 3, figsize=(10, 6))
axs = axs.ravel()

# Colors for each cluster
colors = ['#1f77b4', '#ff7f0e', '#2ca02c']

# List to store WCSS values
wcss_values = []

# Run k-means iterations and plot
for i in range(6):
    # Assign clusters
    labels = assign_clusters(X, centroids)
    
    # Calculate WCSS
    wcss = calculate_wcss(X, labels, centroids)
    wcss_values.append(wcss)
    
    # Plot the current clustering state
    axs[i].scatter(X[:, 0], X[:, 1], c=[colors[l] for l in labels], alpha=0.6)
    axs[i].scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x', s=200, linewidths=3)
    axs[i].set_title(f'Iteration {i if i < 5 else "Final"}, WCSS: {wcss:.2f}')
    axs[i].set_xlim(X[:, 0].min() - 1, X[:, 0].max() + 1)
    axs[i].set_ylim(X[:, 1].min() - 1, X[:, 1].max() + 1)
    
    # Update centroids (except for the last iteration)
    if i < 5:
        centroids = update_centroids(X, labels, k)

plt.tight_layout()

# Create a separate plot for WCSS progress
fig2, ax = plt.subplots(figsize=(10, 6))
ax.plot(range(6), wcss_values, marker='o')
ax.set_title('WCSS Progress')
ax.set_xlabel('Iteration')
ax.set_ylabel('Within-Cluster Sum of Squares')
ax.grid(True)

plt.tight_layout()
plt.show()


Let’s see this in practice with overlapping clusters and also look at the WCSS.

Code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Set random seed for reproducibility
np.random.seed(42)

# Generate sample data
n_samples = 300
X, _ = make_blobs(n_samples=n_samples, centers=3, cluster_std=3.00, random_state=2)

# Initialize centroids randomly
k = 3
centroids = X[np.random.choice(n_samples, k, replace=False)]

# Function to assign points to clusters
def assign_clusters(X, centroids):
    distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2))
    return np.argmin(distances, axis=0)

# Function to update centroids
def update_centroids(X, labels, k):
    return np.array([X[labels == i].mean(axis=0) for i in range(k)])

# Function to calculate within-cluster sum of squares
def calculate_wcss(X, labels, centroids):
    return sum(np.sum((X[labels == i] - centroids[i])**2) for i in range(k))

# Set up the clustering progress plot
fig1, axs = plt.subplots(2, 3, figsize=(10, 6))
axs = axs.ravel()

# Colors for each cluster
colors = ['#1f77b4', '#ff7f0e', '#2ca02c']

# List to store WCSS values
wcss_values = []

# Run k-means iterations and plot
for i in range(6):
    # Assign clusters
    labels = assign_clusters(X, centroids)
    
    # Calculate WCSS
    wcss = calculate_wcss(X, labels, centroids)
    wcss_values.append(wcss)
    
    # Plot the current clustering state
    axs[i].scatter(X[:, 0], X[:, 1], c=[colors[l] for l in labels], alpha=0.6)
    axs[i].scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x', s=200, linewidths=3)
    axs[i].set_title(f'Iteration {i if i < 5 else "Final"}, WCSS: {wcss:.2f}')
    axs[i].set_xlim(X[:, 0].min() - 1, X[:, 0].max() + 1)
    axs[i].set_ylim(X[:, 1].min() - 1, X[:, 1].max() + 1)
    
    # Update centroids (except for the last iteration)
    if i < 5:
        centroids = update_centroids(X, labels, k)

plt.tight_layout()

# Create a separate plot for WCSS progress
fig2, ax = plt.subplots(figsize=(10, 6))
ax.plot(range(6), wcss_values, marker='o')
ax.set_title('WCSS Progress')
ax.set_xlabel('Iteration')
ax.set_ylabel('Within-Cluster Sum of Squares')
ax.grid(True)

plt.tight_layout()
plt.show()


Limitations of \(k\)-means

As you can see, \(k\)-means can work very well.

However, we don’t have any guarantees on the performance of \(k\)-means.

In particular, there are various settings in which \(k\)-means can fail to do a good job.

Limitation: non-spherical clusters

Because each point is assigned to its closest center, the points in a cluster are implicitly assumed to be arranged in a sphere around the center.

Code
# Author: Phil Roth <mr.phil.roth@gmail.com>
#         Arturo Amor <david-arturo.amor-quiroz@inria.fr>
# License: BSD 3 clause
#
# From https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html

import numpy as np
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Generate two half-moon clusters
X_moons, y_moons_true = make_moons(n_samples=200, noise=0.1, random_state=42)

# Perform k-means clustering
kmeans = KMeans(n_clusters=2, random_state=42)
y_moons_pred = kmeans.fit_predict(X_moons)

# Create a figure with two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))

# Plot ground truth
ax1.scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons_true, cmap='viridis')
ax1.set_title('Ground Truth')
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')

# Plot k-means clustering result
ax2.scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons_pred, cmap='viridis')
ax2.set_title('K-means Clustering')
ax2.set_xlabel('Feature 1')
ax2.set_ylabel('Feature 2')

# Add cluster centers to the k-means plot
ax2.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            marker='x', s=200, linewidths=3, color='r', zorder=10)

plt.tight_layout()
plt.show()

K-means clustering on a dataset with anisotropic clusters.

Limitation: anisotropic clusters

Code
import numpy as np
from sklearn.datasets import make_blobs

n_samples = 1500
random_state = 170
transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]

X, y = make_blobs(n_samples=n_samples, random_state=random_state)
X_aniso = np.dot(X, transformation)  # Anisotropic blobs
X_varied, y_varied = make_blobs(
    n_samples=n_samples, cluster_std=[1.0, 2.5, 0.5], random_state=random_state
    )  # Unequal variance
X_filtered = np.vstack(
    (X[y == 0][:500], X[y == 1][:100], X[y == 2][:10])
    )  # Unevenly sized blobs
y_filtered = [0] * 500 + [1] * 100 + [2] * 10

# run previous two cells first

common_params = {
    "n_init": "auto",
    "random_state": random_state,
}

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

axs[0].scatter(X_aniso[:, 0], X_aniso[:, 1], c=y)
axs[0].set_title("Ground Truth")

kmeans = KMeans(n_clusters=3, **common_params)
y_pred = kmeans.fit_predict(X_aniso)
axs[1].scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
axs[1].set_title("K-means Clustering")

# Add cluster centers to the k-means plot
axs[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            marker='x', s=200, linewidths=3, color='r', zorder=10)

# plt.suptitle("Unexpected KMeans clusters").set_y(0.95)
plt.show()

K-means clustering on a dataset with anisotropic gaussian clusters.

Limitation: unequal-sized clusters

For the same reason, \(k\)-means tends to try to balance the sizes of the clusters.

Code
# run previous three cells first

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

axs[0].scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_filtered)
axs[0].set_title("Ground Truth")

kmeans = KMeans(n_clusters=3, **common_params)
y_pred = kmeans.fit_predict(X_filtered)
axs[1].scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)
axs[1].set_title("K-means Clustering")

# Add cluster centers to the k-means plot
axs[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            marker='x', s=200, linewidths=3, color='r', zorder=10)

# plt.suptitle("Unexpected KMeans clusters").set_y(0.95)
plt.show()

K-means clustering on a dataset with unequal-sized clusters.

Limitation: unequal variance clusters

Code
# run previous four cells first

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

axs[0].scatter(X_varied[:, 0], X_varied[:, 1], c=y_varied)
axs[0].set_title("Ground Truth")

kmeans = KMeans(n_clusters=3, **common_params)
y_pred = kmeans.fit_predict(X_varied)
axs[1].scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
axs[1].set_title("K-means Clustering")

# Add cluster centers to the k-means plot
axs[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            marker='x', s=200, linewidths=3, color='r', zorder=10)

# plt.suptitle("Unexpected KMeans clusters").set_y(0.95)
plt.show()

K-means clustering on a dataset with unequal variance clusters.

Limitation: sensitive to starting cluster centers

If the initial guess (Step 1) is a bad one, \(k\)-means may get “stuck” in a bad solution.

Limitation: sensitive to number of clusters

Code
# run previous five cells first

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

axs[0].scatter(X[:, 0], X[:, 1], c=y)
axs[0].set_title("Ground Truth")

kmeans = KMeans(n_clusters=2, **common_params)
y_pred = kmeans.fit_predict(X)
axs[1].scatter(X[:, 0], X[:, 1], c=y_pred)
axs[1].set_title("K-means Clustering")

# Add cluster centers to the k-means plot
axs[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            marker='x', s=200, linewidths=3, color='r', zorder=10)

plt.show()

K-means clustering on a dataset with incorrect number of clusters parameter.

Choosing a Good Initialization

  • How can we avoid the kind of bad initialization we just saw?
  • A good strategy is to pick points that are distant to each other.
  • This strategy is called \(k\)-means++”.
  • It works very well in practice, and the scikit-learn implementation uses it by default.
  • We will explore it in more detail in the next lecture.

Choosing the right \(k\)

Given some \(K\), the \(k\)-means algorithm “learns” the cluster centers.

How do we choose the right number of clusters, \(K\)?

As an aside:

  • This parameter (\(K\)) is the first example we have seen of a hyperparameter.
  • A hyperparameter is a parameter that must be set before the model parameters can be learned.

Our basic strategy will be to:

  • Iterate through different \(K\) and use some criterion to decide which \(K\) is most appropriate.
  • We will discuss this more in the next lecture.

Feature Scales

Finally, given the tendency of \(k\)-means to look for spherical clusters, we should consider the scales of the various features.

In fact, in general when constructing or selecting a distance metric, one needs to think carefully about the scale of the features being used.

Unscaled Features

For example, consider the case where we are clustering people based on their age, income, and gender.

We might use age in years, income in dollars, and assign gender to the values \(\{0, 1\}\).

Thus, the following records:

  • Joe Smith, age 27, income USD 75,000, male
  • Eve Jones, age 45, income USD 42,000, female

Would be encoded in feature space as:

\[ \begin{bmatrix}27\\75000\\0\end{bmatrix},\begin{bmatrix}45\\42000\\1\end{bmatrix} \]

Unscaled Features, Continued

What would happen if we used Euclidean distance as our dissimilarity metric in this feature space?

(This is what \(k\)-means uses.)

Clearly, the influence of income would dominate the other two features. For example, a difference of gender is about as significant as a difference of one dollar of yearly income.

We are unlikely to expose gender-based differences if we cluster using this representation.

The most common way to handle this is feature scaling.

Feature Scaling

The basic idea is to rescale each feature separately, so that its range of values is about the same as all other features.

For example, one may choose to:

  • shift each feature independently by subtracting the mean over all observed values.
    • This means that each feature is now centered on zero.
  • Then rescale each feature so that the standard deviation overall observed values is 1.
    • This means that the feature will have about the same range of values as all the others.

\(k\)-means Clustering Examples

Example 1: Customer Segmentation

Customer Segmentation

Another powerful application of \(k\)-means clustering is customer segmentation in e-commerce.

  • Businesses want to understand their customers to provide better service and targeted marketing
  • Group customers with similar behaviors and characteristics \(\rightarrow\) personas
  • This enables personalized marketing campaigns, product recommendations, and customer service strategies

E-commerce Customer Data

Let’s synthesize e-commerce data to segment customers based on their purchasing behavior.

High-value, frequent buyers; Moderate buyers; Bargain hunters; Occasional buyers.

Segment Percentage Annual Spending Average Order Value Total Orders
High-value 20% \(120 \pm 20\) \(15 \pm 3\) \(450 \pm 80\)
Moderate 40% \(60 \pm 15\) \(8 \pm 2\) \(250 \pm 50\)
Bargain 25% \(25 \pm 8\) \(12 \pm 4\) \(150 \pm 30\)
Occasional 15% \(15 \pm 5\) \(3 \pm 1\) \(300 \pm 60\)

E-commerce Customer Data

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Create realistic e-commerce customer data
np.random.seed(42)
n_customers = 1000

# Generate customer segments with different characteristics
# Segment 1: High-value, frequent buyers (20% of customers)
high_value = np.random.normal([120, 15, 450], [20, 3, 80], (200, 3))

# Segment 2: Moderate buyers (40% of customers)
moderate = np.random.normal([60, 8, 250], [15, 2, 50], (400, 3))

# Segment 3: Bargain hunters (25% of customers)
bargain = np.random.normal([25, 12, 150], [8, 4, 30], (250, 3))

# Segment 4: Occasional buyers (15% of customers)
occasional = np.random.normal([15, 3, 300], [5, 1, 60], (150, 3))

# Combine all segments
customer_data = np.vstack([high_value, moderate, bargain, occasional])

# Add some noise and shuffle
customer_data += np.random.normal(0, 5, customer_data.shape)
np.random.shuffle(customer_data)

# Create DataFrame
df = pd.DataFrame(customer_data, columns=['Annual_Spending', 'Avg_Order_Value', 'Total_Orders'])
df['Customer_ID'] = range(1, len(df) + 1)

print("E-commerce Customer Data Head:")
print(df.head().round(2))


print("\n\nE-commerce Customer Data Summary    :")
print(df.describe().round(2))
E-commerce Customer Data Head:
   Annual_Spending  Avg_Order_Value  Total_Orders  Customer_ID
0             0.27             1.67        325.07            1
1            27.74             4.78        127.63            2
2            43.86             6.75        262.73            3
3           132.78             9.23        345.36            4
4            83.19             9.52        277.84            5


E-commerce Customer Data Summary    :
       Annual_Spending  Avg_Order_Value  Total_Orders  Customer_ID
count          1000.00          1000.00       1000.00      1000.00
mean             57.12             9.39        275.20       500.50
std              39.33             6.80        116.85       288.82
min              -8.21           -10.90         63.74         1.00
25%              23.26             4.87        178.72       250.75
50%              50.96             9.26        257.75       500.50
75%              78.22            13.76        337.32       750.25
max             166.33            35.01        760.47      1000.00

Understanding Customer Features

Code
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Annual Spending
axes[0].hist(df['Annual_Spending'], bins=30, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].set_title('Distribution of Annual Spending')
axes[0].set_xlabel('Annual Spending ($)')
axes[0].set_ylabel('Number of Customers')

# Average Order Value
axes[1].hist(df['Avg_Order_Value'], bins=30, alpha=0.7, color='lightgreen', edgecolor='black')
axes[1].set_title('Distribution of Average Order Value')
axes[1].set_xlabel('Average Order Value ($)')
axes[1].set_ylabel('Number of Customers')

# Total Orders
axes[2].hist(df['Total_Orders'], bins=30, alpha=0.7, color='salmon', edgecolor='black')
axes[2].set_title('Distribution of Total Orders')
axes[2].set_xlabel('Total Orders')
axes[2].set_ylabel('Number of Customers')

plt.tight_layout()
plt.show()

Distribution of customer features

Can you see clusters?

Feature Scaling is Critical

Before clustering, we need to scale our features since they have very different ranges:

Code
# Prepare features for clustering
features = ['Annual_Spending', 'Avg_Order_Value', 'Total_Orders']
X = df[features].values

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Original feature ranges:")
print(f"Annual Spending: ${X[:, 0].min():.0f} - ${X[:, 0].max():.0f}")
print(f"Avg Order Value: ${X[:, 1].min():.0f} - ${X[:, 1].max():.0f}")
print(f"Total Orders: {X[:, 2].min():.0f} - {X[:, 2].max():.0f}")

print("\nScaled feature ranges:")
print(f"Annual Spending: {X_scaled[:, 0].min():.2f} - {X_scaled[:, 0].max():.2f}")
print(f"Avg Order Value: {X_scaled[:, 1].min():.2f} - {X_scaled[:, 1].max():.2f}")
print(f"Total Orders: {X_scaled[:, 2].min():.2f} - {X_scaled[:, 2].max():.2f}")
Original feature ranges:
Annual Spending: $-8 - $166
Avg Order Value: $-11 - $35
Total Orders: 64 - 760

Scaled feature ranges:
Annual Spending: -1.66 - 2.78
Avg Order Value: -2.99 - 3.77
Total Orders: -1.81 - 4.16

Searching Cluster Numbers

What cluster number should we pick?

Code
# Calculate WCSS for different numbers of clusters
wcss = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)

# Plot the elbow curve
plt.figure(figsize=(10, 6))
plt.plot(K_range, wcss, 'bo-', linewidth=2, markersize=8)
plt.title('WCSS vs Number of Clusters')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Elbow method for determining optimal number of clusters

Customer Segmentation Results

Visualize in 2D using PCA which we’ll cover later.

Code
# Perform k-means clustering with k=4
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_scaled)

# Add cluster labels to the dataframe
df['Cluster'] = cluster_labels

# Visualize clusters in 2D using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(12, 5))

# Plot 1: PCA visualization
plt.subplot(1, 2, 1)
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='viridis', alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            c='red', marker='x', s=200, linewidths=3, label='Centroids')
plt.title('Customer Segments (PCA View)')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.legend()

# Plot 2: Feature space visualization
plt.subplot(1, 2, 2)
plt.scatter(df['Annual_Spending'], df['Avg_Order_Value'], c=cluster_labels, cmap='viridis', alpha=0.6)
plt.title('Customer Segments (Annual Spending vs Avg Order Value)')
plt.xlabel('Annual Spending ($)')
plt.ylabel('Average Order Value ($)')

plt.tight_layout()
plt.show()

K-means customer segmentation results

Customer Segment Profiles

Code
# Analyze each cluster
cluster_summary = df.groupby('Cluster').agg({
    'Annual_Spending': ['mean', 'std'],
    'Avg_Order_Value': ['mean', 'std'],
    'Total_Orders': ['mean', 'std'],
    'Customer_ID': 'count'
}).round(2)

cluster_summary.columns = ['Avg_Annual_Spending', 'Std_Annual_Spending', 
                          'Avg_Order_Value', 'Std_Order_Value',
                          'Avg_Total_Orders', 'Std_Total_Orders', 'Count']

print("Customer Segment Profiles:")
print("=" * 60)
for cluster in sorted(df['Cluster'].unique()):
    cluster_data = df[df['Cluster'] == cluster]
    print(f"\nSegment {cluster} ({len(cluster_data)} customers):")
    print(f"  • Average Annual Spending: ${cluster_data['Annual_Spending'].mean():.0f}")
    print(f"  • Average Order Value: ${cluster_data['Avg_Order_Value'].mean():.0f}")
    print(f"  • Average Total Orders: {cluster_data['Total_Orders'].mean():.1f}")
Customer Segment Profiles:
============================================================

Segment 0 (193 customers):
  • Average Annual Spending: $121
  • Average Order Value: $15
  • Average Total Orders: 458.9

Segment 1 (251 customers):
  • Average Annual Spending: $28
  • Average Order Value: $14
  • Average Total Orders: 161.1

Segment 2 (255 customers):
  • Average Annual Spending: $26
  • Average Order Value: $2
  • Average Total Orders: 268.5

Segment 3 (301 customers):
  • Average Annual Spending: $67
  • Average Order Value: $8
  • Average Total Orders: 258.3
Segment Percentage Annual Spending Average Order Value Total Orders
High-value 20% \(120 \pm 20\) \(15 \pm 3\) \(450 \pm 80\)
Moderate 40% \(60 \pm 15\) \(8 \pm 2\) \(250 \pm 50\)
Bargain 25% \(25 \pm 8\) \(12 \pm 4\) \(150 \pm 30\)
Occasional 15% \(15 \pm 5\) \(3 \pm 1\) \(300 \pm 60\)

Example 2: Plant Classification Based on Morphology

Plant Species Classification with Iris Data

Let’s explore another real-world application using the famous Iris dataset from scikit-learn.

  • The Iris dataset contains measurements of 150 iris flowers from 3 species
  • Each flower has 4 features: sepal length, sepal width, petal length, and petal width
  • This is a classic example where we know the true species but want to see if k-means can discover these natural groupings

Loading and Exploring the Iris Dataset

Code
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

print("Iris Dataset Information:")
print("=" * 40)
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print(f"Feature names: {feature_names}")
print(f"Target names: {target_names}")
print(f"Number of species: {len(target_names)}")

# Display first few samples
print(f"\nFirst 5 samples:")
print(f"{'Sepal Length':<12} {'Sepal Width':<11} {'Petal Length':<12} {'Petal Width':<11} {'Species':<8}")
print("-" * 60)
for i in range(5):
    print(f"{X[i, 0]:<12.1f} {X[i, 1]:<11.1f} {X[i, 2]:<12.1f} {X[i, 3]:<11.1f} {target_names[y[i]]:<8}")
Iris Dataset Information:
========================================
Number of samples: 150
Number of features: 4
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
Number of species: 3

First 5 samples:
Sepal Length Sepal Width Petal Length Petal Width Species 
------------------------------------------------------------
5.1          3.5         1.4          0.2         setosa  
4.9          3.0         1.4          0.2         setosa  
4.7          3.2         1.3          0.2         setosa  
4.6          3.1         1.5          0.2         setosa  
5.0          3.6         1.4          0.2         setosa  

Visualizing Iris Data in 2D

Let’s compare features in pairs. Do you see clusters?

Code
fig, axes = plt.subplots(2, 2, figsize=(8, 6))
axes = axes.ravel()

# Plot all pairwise combinations of features
feature_pairs = [(0, 1), (0, 2), (0, 3), (2, 3)]
titles = ['Sepal Length vs Width', 'Sepal Length vs Petal Length', 
          'Sepal Length vs Petal Width', 'Petal Length vs Width']

for i, (feat1, feat2) in enumerate(feature_pairs):
    axes[i].scatter(X[:, feat1], X[:, feat2], 
                   alpha=0.7, s=50)
    
    axes[i].set_xlabel(feature_names[feat1])
    axes[i].set_ylabel(feature_names[feat2])
    axes[i].set_title(titles[i])
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Visualizing Iris Data in 2D

Feature pairs with ground truth labels.

Code
fig, axes = plt.subplots(2, 2, figsize=(8, 6))
axes = axes.ravel()

# Color mapping for species
colors = ['red', 'green', 'blue']

# Plot all pairwise combinations of features
feature_pairs = [(0, 1), (0, 2), (0, 3), (2, 3)]
titles = ['Sepal Length vs Width', 'Sepal Length vs Petal Length', 
          'Sepal Length vs Petal Width', 'Petal Length vs Width']

for i, (feat1, feat2) in enumerate(feature_pairs):
    for species in range(3):
        mask = y == species
        axes[i].scatter(X[mask, feat1], X[mask, feat2], 
                       c=colors[species], label=target_names[species], 
                       alpha=0.7, s=50)
    
    axes[i].set_xlabel(feature_names[feat1])
    axes[i].set_ylabel(feature_names[feat2])
    axes[i].set_title(titles[i])
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Question: What observations can you make?

Try different cluster numbers

Code
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

# Calculate WCSS for different numbers of clusters
wcss = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

# Plot the elbow curve
plt.figure(figsize=(10, 6))
plt.plot(K_range, wcss, 'bo-', linewidth=2, markersize=8)
plt.title('Elbow Method for Iris Dataset')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

K-means Clustering on Iris Data

Results for \(k=3\) clusters.

Code
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, confusion_matrix
import seaborn as sns

# Perform k-means clustering with k=3 (we know there are 3 species)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X)

# Create confusion matrix
cm = confusion_matrix(y, cluster_labels)

# Convert to DataFrame for better visualization
cm_df = pd.DataFrame(cm, 
                    index=target_names, 
                    columns=[f'Cluster {i}' for i in range(3)])

print(f"\nConfusion Matrix:")
print("Rows: True species, Columns: Predicted clusters")
print(cm_df)

# Visualize clustering results in 2D
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: True species
scatter1 = axes[0].scatter(X[:, 2], X[:, 3], c=y, cmap='viridis', alpha=0.7, s=50)
axes[0].set_xlabel('Petal Length')
axes[0].set_ylabel('Petal Width')
axes[0].set_title('True Species')
axes[0].grid(True, alpha=0.3)

# Add species labels
for i, species in enumerate(target_names):
    mask = y == i
    centroid_x = X[mask, 2].mean()
    centroid_y = X[mask, 3].mean()
    axes[0].annotate(species, (centroid_x, centroid_y), 
                    xytext=(5, 5), textcoords='offset points',
                    fontsize=10, fontweight='bold')

# Plot 2: K-means clusters
scatter2 = axes[1].scatter(X[:, 2], X[:, 3], c=cluster_labels, cmap='viridis', alpha=0.7, s=50)
axes[1].scatter(kmeans.cluster_centers_[:, 2], kmeans.cluster_centers_[:, 3], 
               c='red', marker='x', s=200, linewidths=3, label='Centroids')
axes[1].set_xlabel('Petal Length')
axes[1].set_ylabel('Petal Width')
axes[1].set_title(f'K-means Clusters')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Confusion Matrix:
Rows: True species, Columns: Predicted clusters
            Cluster 0  Cluster 1  Cluster 2
setosa              0         50          0
versicolor         48          0          2
virginica          14          0         36

Analysis

Code
# Analyze cluster assignments
print("Cluster Analysis:")
print("=" * 50)

for cluster in range(3):
    mask = cluster_labels == cluster
    cluster_data = X[mask]
    true_species = y[mask]
    
    # Find the most common species in this cluster
    species_counts = np.bincount(true_species)
    dominant_species = np.argmax(species_counts)
    purity = species_counts[dominant_species] / len(true_species)
    
    print(f"\nCluster {cluster}:")
    print(f"  • Size: {len(cluster_data)} samples")
    print(f"  • Dominant species: {target_names[dominant_species]} ({purity:.1%})")
    print(f"  • Average petal length: {cluster_data[:, 2].mean():.1f} cm")
    print(f"  • Average petal width: {cluster_data[:, 3].mean():.1f} cm")
    print(f"  • Average sepal length: {cluster_data[:, 0].mean():.1f} cm")
    print(f"  • Average sepal width: {cluster_data[:, 1].mean():.1f} cm")
Cluster Analysis:
==================================================

Cluster 0:
  • Size: 62 samples
  • Dominant species: versicolor (77.4%)
  • Average petal length: 4.4 cm
  • Average petal width: 1.4 cm
  • Average sepal length: 5.9 cm
  • Average sepal width: 2.7 cm

Cluster 1:
  • Size: 50 samples
  • Dominant species: setosa (100.0%)
  • Average petal length: 1.5 cm
  • Average petal width: 0.2 cm
  • Average sepal length: 5.0 cm
  • Average sepal width: 3.4 cm

Cluster 2:
  • Size: 38 samples
  • Dominant species: virginica (94.7%)
  • Average petal length: 5.7 cm
  • Average petal width: 2.1 cm
  • Average sepal length: 6.9 cm
  • Average sepal width: 3.1 cm

Key Insights:

  • K-means successfully identified the three iris species
  • Setosa is perfectly separated from the other species
  • Versicolor and Virginica show some overlap, which is realistic
  • Petal measurements are more discriminative than sepal measurements
  • Overall clustering quality pretty good!

Why K-means Works Well on Iris Data

  • Well-separated clusters: The three iris species have distinct characteristics
  • Spherical cluster shapes: Each species forms roughly spherical groups in feature space
  • Appropriate feature scaling: All measurements are in the same units (centimeters)
  • A Priori knowledge: We know the true number of species (cluster number)

Example 3: k-means image compression

k-means image compression

Here is a simple example of how \(k\)-means can be used to reduce color space and compress data.

Consider the following image.

  • Each color in the image is represented by an integer.
  • Typically we might use 24 bits for each integer (8 bits for R, G, and B).

Now find \(k=16\) clusters of pixels in three dimensional \((R, G, B)\) space and replace each pixel by its cluster center.

Because there are 16 centroids, we can represent by a 4-bit mapping for a compression ratio of \(24/4=6\times\).

Here we cluster into 8 groups (3 bits) for a compression ratio \(24/3=8\times\).

Here we cluster into 4 groups (2 bits) for a compression ratio around \(24/2=12\times\).

Finally, we use 1 bit (2 color groups) for a compression ratio of \(24\times\).

In-Class Activity: k-Means Clustering

Time: 10 minutes
Materials: Pencil and paper
Objective: Understand the k-means algorithm through 1D visualization

Activity: “Clustering 1D Data”

The Data

We have 8 data points on a number line that form 2 natural clusters:

Data Points: 1.2, 1.8, 2.1, 7.3, 7.9, 8.2, 9.1, 10.5

Code
# Data points
data_points = np.array([1.2, 1.8, 2.1, 7.3, 7.9, 8.2, 9.1, 10.5])

# Create the plot
fig, ax = plt.subplots(figsize=(12, 4))

# Plot data points with numbers
for i, point in enumerate(data_points):
    ax.scatter(point, 0, s=100, c='blue', alpha=0.7, zorder=3)
    ax.annotate(f'{i+1}', (point, 0), xytext=(0, 15), 
                textcoords='offset points', ha='center', fontsize=10, fontweight='bold')

# Add horizontal x-axis line
ax.axhline(y=0, color='black', linewidth=2, zorder=1)

# Formatting
ax.set_xlim(0, 12)
ax.set_ylim(-0.5, 0.5)
ax.set_xlabel('Value')
ax.set_title('Step 1: Initial Data Points')
ax.grid(True, alpha=0.3)
ax.set_yticks([])
ax.legend(loc='upper right')

plt.tight_layout()
plt.show()

The Steps

  1. Pick \(K\) cluster centers \(\{c_1, \dots, c_K\}\).
  2. For each \(j\), define the cluster \(C_j\) as the set of points in \(X\) that are closest to center \(c_j\).
  3. For each \(j\), update \(c_j\) to be the center of mass of cluster \(C_j\).
  4. Repeat (i.e., go to Step 2) until convergence.

The Steps

  1. Pick \(K\) cluster centers \(\{c_1, \dots, c_K\}\).
  2. For each \(j\), define the cluster \(C_j\) as the set of points in \(X\) that are closest to center \(c_j\).
  3. For each \(j\), update \(c_j\) to be the center of mass of cluster \(C_j\).
  4. Repeat (i.e., go to Step 2) until convergence.

Step 1: Cluster Centers Initialization

Assume our randomly chosen initial centers are:

  • Center A = 4.0, Center B = 11.0
Code
# Data points
data_points = np.array([1.2, 1.8, 2.1, 7.3, 7.9, 8.2, 9.1, 10.5])
initial_centers = np.array([4.0, 11.0])

# Create the plot
fig, ax = plt.subplots(figsize=(12, 4))

# Plot data points with numbers
for i, point in enumerate(data_points):
    ax.scatter(point, 0, s=100, c='blue', alpha=0.7, zorder=3)
    ax.annotate(f'{i+1}', (point, 0), xytext=(0, 15), 
                textcoords='offset points', ha='center', fontsize=10, fontweight='bold')

# Plot initial centers
ax.scatter(initial_centers[0], 0, s=200, c='red', marker='x', linewidths=3, zorder=4, label='Center A')
ax.scatter(initial_centers[1], 0, s=200, c='green', marker='x', linewidths=3, zorder=4, label='Center B')

# Add labels for centers
ax.annotate('A', (initial_centers[0], 0), xytext=(0, -25), 
            textcoords='offset points', ha='center', fontsize=12, fontweight='bold', color='red')
ax.annotate('B', (initial_centers[1], 0), xytext=(0, -25), 
            textcoords='offset points', ha='center', fontsize=12, fontweight='bold', color='green')

# Add horizontal x-axis line
ax.axhline(y=0, color='black', linewidth=2, zorder=1)

# Formatting
ax.set_xlim(0, 12)
ax.set_ylim(-0.5, 0.5)
ax.set_xlabel('Value')
ax.set_title('Step 1: Initial Data Points and Centers')
ax.grid(True, alpha=0.3)
ax.set_yticks([])
ax.legend(loc='upper right')

plt.tight_layout()
plt.show()

The Steps

  1. Pick \(K\) cluster centers \(\{c_1, \dots, c_K\}\).
  2. For each \(j\), define the cluster \(C_j\) as the set of points in \(X\) that are closest to center \(c_j\).
  3. For each \(j\), update \(c_j\) to be the center of mass of cluster \(C_j\).
  4. Repeat (i.e., go to Step 2) until convergence.

Step 2: Assignment (2 minutes)

For each data point, calculate the distance to both centers and assign it to the closer one.

Complete the table:

Point Value Distance to A (4.0) Distance to B (11.0) Assigned to
1 1.2 2.8 9.8 A
2 1.8
3 2.1
4 7.3
5 7.9
6 8.2
7 9.1
8 10.5

Assignment Visualization

Code
# Calculate assignments
assignments = []
for point in data_points:
    dist_a = abs(point - initial_centers[0])
    dist_b = abs(point - initial_centers[1])
    if dist_a < dist_b:
        assignments.append('A')
    else:
        assignments.append('B')

# Create the plot
fig, ax = plt.subplots(figsize=(12, 4))

# Plot data points with cluster colors
colors = ['red' if a == 'A' else 'green' for a in assignments]
for i, (point, color) in enumerate(zip(data_points, colors)):
    ax.scatter(point, 0, s=100, c=color, alpha=0.7, zorder=3)
    ax.annotate(f'{i+1}', (point, 0), xytext=(0, 15), 
                textcoords='offset points', ha='center', fontsize=10, fontweight='bold')

# Plot initial centers
ax.scatter(initial_centers[0], 0, s=200, c='red', marker='x', linewidths=3, zorder=4, label='Center A')
ax.scatter(initial_centers[1], 0, s=200, c='green', marker='x', linewidths=3, zorder=4, label='Center B')

# Add labels for centers
ax.annotate('A', (initial_centers[0], 0), xytext=(0, -25), 
            textcoords='offset points', ha='center', fontsize=12, fontweight='bold', color='red')
ax.annotate('B', (initial_centers[1], 0), xytext=(0, -25), 
            textcoords='offset points', ha='center', fontsize=12, fontweight='bold', color='green')

# Add horizontal x-axis line
ax.axhline(y=0, color='black', linewidth=2, zorder=1)

# Formatting
ax.set_xlim(0, 12)
ax.set_ylim(-0.5, 0.5)
ax.set_xlabel('Value')
ax.set_title('Step 2: Assignment to Clusters')
ax.grid(True, alpha=0.3)
ax.set_yticks([])
ax.legend(loc='upper right')

plt.tight_layout()
plt.show()

Assignment visualization with cluster colors

The Steps

  1. Pick \(K\) cluster centers \(\{c_1, \dots, c_K\}\).
  2. For each \(j\), define the cluster \(C_j\) as the set of points in \(X\) that are closest to center \(c_j\).
  3. For each \(j\), update \(c_j\) to be the center of mass of cluster \(C_j\).
  4. Repeat (i.e., go to Step 2) until convergence.

Step 3: Update Centers (2 minutes)

Calculate the new centers based on the assigned points.

Cluster A (points: 1.2, 1.8, 2.1, 7.3):

  • New center A =

Cluster B (points: 7.9, 8.2, 9.1, 10.5):

  • New center B =

Step 3: Update Centers

Calculate the new centers based on the assigned points.

Cluster A (points: 1.2, 1.8, 2.1, 7.3):

  • New center A = (1.2 + 1.8 + 2.1 + 7.3) ÷ 4 = 3.1

Cluster B (points: 7.9, 8.2, 9.1, 10.5):

  • New center B = (7.9 + 8.2 + 9.1 + 10.5) ÷ 4 = 8.9

Step 3: Update Centers Visualization

Code
# Calculate new centers
cluster_a_points = [data_points[i] for i, a in enumerate(assignments) if a == 'A']
cluster_b_points = [data_points[i] for i, a in enumerate(assignments) if a == 'B']

new_center_a = np.mean(cluster_a_points)
new_center_b = np.mean(cluster_b_points)

new_centers = np.array([new_center_a, new_center_b])

# Create the plot
fig, ax = plt.subplots(figsize=(12, 4))

# Plot data points with cluster colors
colors = ['red' if a == 'A' else 'green' for a in assignments]
for i, (point, color) in enumerate(zip(data_points, colors)):
    ax.scatter(point, 0, s=100, c=color, alpha=0.7, zorder=3)
    ax.annotate(f'{i+1}', (point, 0), xytext=(0, 15), 
                textcoords='offset points', ha='center', fontsize=10, fontweight='bold')

# Plot old centers (faded)
ax.scatter(initial_centers[0], 0, s=150, c='red', marker='x', linewidths=2, 
           zorder=2, alpha=0.5, label='Old Center A')
ax.scatter(initial_centers[1], 0, s=150, c='green', marker='x', linewidths=2, 
           zorder=2, alpha=0.5, label='Old Center B')

# Plot new centers
ax.scatter(new_centers[0], 0, s=200, c='red', marker='*', linewidths=3, 
           zorder=4, label='New Center A')
ax.scatter(new_centers[1], 0, s=200, c='green', marker='*', linewidths=3, 
           zorder=4, label='New Center B')

# Add labels for new centers
ax.annotate('A\'', (new_centers[0], 0), xytext=(0, -25), 
            textcoords='offset points', ha='center', fontsize=12, fontweight='bold', color='red')
ax.annotate('B\'', (new_centers[1], 0), xytext=(0, -25), 
            textcoords='offset points', ha='center', fontsize=12, fontweight='bold', color='green')

# Add horizontal x-axis line
ax.axhline(y=0, color='black', linewidth=2, zorder=1)

# Formatting
ax.set_xlim(0, 12)
ax.set_ylim(-0.5, 0.5)
ax.set_xlabel('Value')
ax.set_title('Step 3: Updated Centers (A\' = 1.7, B\' = 8.6)')
ax.grid(True, alpha=0.3)
ax.set_yticks([])
ax.legend(loc='upper right')

plt.tight_layout()
plt.show()

print(f"New Center A: {new_center_a:.1f}")
print(f"New Center B: {new_center_b:.1f}")

New Center A: 3.1
New Center B: 8.9

The Steps

  1. Pick \(K\) cluster centers \(\{c_1, \dots, c_K\}\).
  2. For each \(j\), define the cluster \(C_j\) as the set of points in \(X\) that are closest to center \(c_j\).
  3. For each \(j\), update \(c_j\) to be the center of mass of cluster \(C_j\).
  4. Repeat (i.e., go to Step 2) until convergence.

The Steps

  1. Pick \(K\) cluster centers \(\{c_1, \dots, c_K\}\).
  2. For each \(j\), define the cluster \(C_j\) as the set of points in \(X\) that are closest to center \(c_j\).
  3. For each \(j\), update \(c_j\) to be the center of mass of cluster \(C_j\).
  4. Repeat (i.e., go to Step 2) until convergence.

Step 4: Second Iteration Cluster Assignment (2 minutes)

Repeat the assignment with the new centers (\(A' = 3.1, B' = 8.9\)).

Complete the table:

Point Value Distance to \(A'\) Distance to \(B'\) Assigned to
1 1.2 1.9 8.3 A
2 1.8
3 2.1
4 7.3
5 7.9
6 8.2
7 9.1
8 10.5

Step 4: Second Iteration Cluster Assignment Visualization

Code
# Calculate second iteration assignments
assignments_2 = []
for point in data_points:
    dist_a = abs(point - new_centers[0])
    dist_b = abs(point - new_centers[1])
    if dist_a < dist_b:
        assignments_2.append('A')
    else:
        assignments_2.append('B')


cluster_a_points_2 = [data_points[i] for i, a in enumerate(assignments_2) if a == 'A']
cluster_b_points_2 = [data_points[i] for i, a in enumerate(assignments_2) if a == 'B']

# Create the plot
fig, ax = plt.subplots(figsize=(12, 4))

# Plot data points with final cluster colors
colors = ['red' if a == 'A' else 'green' for a in assignments_2]
for i, (point, color) in enumerate(zip(data_points, colors)):
    ax.scatter(point, 0, s=100, c=color, alpha=0.7, zorder=3)
    ax.annotate(f'{i+1}', (point, 0), xytext=(0, 15), 
                textcoords='offset points', ha='center', fontsize=10, fontweight='bold')

# Plot final centers
ax.scatter(new_centers[0], 0, s=200, c='red', marker='*', linewidths=3, 
           zorder=4, label='Final Center A')
ax.scatter(new_centers[1], 0, s=200, c='green', marker='*', linewidths=3, 
           zorder=4, label='Final Center B')

# Add labels for final centers
ax.annotate('A*', (new_centers[0], 0), xytext=(0, -25), 
            textcoords='offset points', ha='center', fontsize=12, fontweight='bold', color='red')
ax.annotate('B*', (new_centers[1], 0), xytext=(0, -25), 
            textcoords='offset points', ha='center', fontsize=12, fontweight='bold', color='green')

# Add horizontal x-axis line
ax.axhline(y=0, color='black', linewidth=2, zorder=1)

# Formatting
ax.set_xlim(0, 12)
ax.set_ylim(-0.5, 0.5)
ax.set_xlabel('Value')
ax.set_title('Step 4: Centers After Second Iteration ($A* = 3.1, B* = 8.9$)')
ax.grid(True, alpha=0.3)
ax.set_yticks([])
ax.legend(loc='upper right')

plt.tight_layout()
plt.show()

print(f"New Center A: {new_centers[0]:.1f}")
print(f"New Center B: {new_centers[1]:.1f}")

New Center A: 3.1
New Center B: 8.9

Convergence Check

One type of convergence is when the centers stop changing.

1st Iteration Center 2nd Iteration Center Abs Difference
A = 4 A = 3.1 0.9
B = 11 B = 8.9 2.1

Should we do another iteration?

The Steps

  1. Pick \(K\) cluster centers \(\{c_1, \dots, c_K\}\).
  2. For each \(j\), define the cluster \(C_j\) as the set of points in \(X\) that are closest to center \(c_j\).
  3. For each \(j\), update \(c_j\) to be the center of mass of cluster \(C_j\).
  4. Repeat (i.e., go to Step 2) until convergence.

2nd Iteration Update Centers

Cluster A (points: 1.2, 1.8, 2.1):

  • New center A = (1.2 + 1.8 + 2.1) ÷ 3 = 1.7

Cluster B (points: 7.3, 7.9, 8.2, 9.1, 10.5):

  • New center B = (7.3 + 7.9 + 8.2 + 9.1 + 10.5) ÷ 5 = 8.6

Step 4: Second Iteration Visualization

Code
iter3_center_a = np.mean(cluster_a_points_2)
iter3_center_b = np.mean(cluster_b_points_2)
iter3_centers = np.array([iter3_center_a, iter3_center_b])


# Create the plot
fig, ax = plt.subplots(figsize=(12, 4))

# Plot data points with final cluster colors
colors = ['red' if a == 'A' else 'green' for a in assignments_2]
for i, (point, color) in enumerate(zip(data_points, colors)):
    ax.scatter(point, 0, s=100, c=color, alpha=0.7, zorder=3)
    ax.annotate(f'{i+1}', (point, 0), xytext=(0, 15), 
                textcoords='offset points', ha='center', fontsize=10, fontweight='bold')

# Plot final centers
ax.scatter(iter3_centers[0], 0, s=200, c='red', marker='*', linewidths=3, 
           zorder=4, label='Final Center A')
ax.scatter(iter3_centers[1], 0, s=200, c='green', marker='*', linewidths=3, 
           zorder=4, label='Final Center B')

# Add labels for final centers
ax.annotate('A*', (iter3_centers[0], 0), xytext=(0, -25), 
            textcoords='offset points', ha='center', fontsize=12, fontweight='bold', color='red')
ax.annotate('B*', (iter3_centers[1], 0), xytext=(0, -25), 
            textcoords='offset points', ha='center', fontsize=12, fontweight='bold', color='green')

# Add horizontal x-axis line
ax.axhline(y=0, color='black', linewidth=2, zorder=1)

# Formatting
ax.set_xlim(0, 12)
ax.set_ylim(-0.5, 0.5)
ax.set_xlabel('Value')
ax.set_title('Step 4: Centers After Third Iteration')
ax.grid(True, alpha=0.3)
ax.set_yticks([])
ax.legend(loc='upper right')

plt.tight_layout()
plt.show()

Step 5: Check for Convergence (1 minute)

Questions:

  • Would cluster assignments change if we continued?
  • Would the centers change if we continued?
  • Have we found the optimal clustering?

Key Learning Points

  1. Algorithm Steps: Initialize → Assign → Update → Repeat
  2. Distance Calculation: |point - center| in 1D
  3. Center Updates: Average of assigned points
  4. Convergence: Centers stop changing when optimal
  5. Initialization Matters: Different starts can lead to different results

Quick Assessment

Answer on your paper:

  1. What is the correct order of the 4 steps of k-means?
  2. How do you update a center?
    • Answer:
  3. What does “convergence” mean?
    • Answer:

Recap and Next

Today we covered:

  • \(k\)-means clustering
  • strengths and weaknesses
  • Importance of initialization and cluster number
  • Feature scaling
  • Example applications:
    • Customer segmentation
    • Plant species classification with Iris data
    • Image color compression

Coming up next, we’ll look at some practical aspects of applying \(k\)-means.

Back to top

Footnotes

  1. Public Domain, https://commons.wikimedia.org/w/index.php?curid=680455↩︎

  2. By John Snow - Published by C.F. Cheffins, Lith, Southhampton Buildings, London, England, 1854 in Snow, John. On the Mode of Communication of Cholera, 2nd Ed, John Churchill, New Burlington Street, London, England, 1855.↩︎

  3. As determined at the 2006 IEEE International Conference on Data Mining↩︎