We can often simplify or compress our data if we recognize the existence of clusters.
Further, we can often interpret clusters by assigning them labels.
However, note that these categories or “labels” are assigned after the fact.
And, we may not be able to interpret clusters or assign them labels in some cases.
That is, clustering represents the first example we will see of unsupervised learning.
Supervised vs Unsupervised
Supervised methods: Data items have labels, and we want to learn a function that correctly assigns labels to new data items.
Unsupervised methods: Data items do not have labels, and we want to learn a function that extracts important patterns from the data.
Applications of Clustering
Image Processing
Cluster images based on their visual content.
Compress images based on color clusters.
Web Mining
Cluster groups of users based on webpage access patterns.
Cluster web pages based on their content.
Bioinformatics
Cluster similar proteins together (by structure or function).
Cluster cell types (by gene activity).
And many more …
The Clustering Problem
When we apply clustering, what problem are we trying to solve?
We will answer this question informally at first.
(But soon we will look at formal criteria!)
Informally, a clustering is:
a grouping of data objects, such that the objects within a group are similar (or near) to one another and dissimilar (or far) from the objects in other groups.
(keep in mind that if we use a distance function as a dissimilarity measure, then “far” implies “different”)
So we want our clustering algorithm to:
minimize intra-cluster distances.
maximize inter-cluster distances.
Here are the basic questions we need to ask about clustering:
What is the right kind of “similarity” to use?
What is a good partition of objects?
i.e., how is the quality of a solution measured?
How to find a good partition?
Are there efficient algorithms?
Are there algorithms that are guaranteed to find good clusters?
Now note that even with our more-formal discussion, the criteria for deciding on a “best” clustering can still be ambiguous.
To accommodate the ambiguity here, one approach is to seek a hierarchical clustering.
That is, as set of nested clusters organized in a tree.
We’ll discuss hierarchical cluster in an upcoming lecture.
For today, we’ll focus on partitional clustering.
In a partitional clustering, the points are divided into a set of non-overlapping groups.
In a partitional clustering.
Each object belongs to one, and only one, cluster.
The set of clusters covers all the objects.
We are going to assume for now that the number of clusters is given in advance.
We will denote the number of clusters as \(k\).
The \(k\)-means Algorithm
Assumptions
Now, we are ready to state our first formalization of the clustering problem.
We will assume that
data items are represented by points in \(d\)-dimensional space, \(\mathbb{R}^d\), i.e., has \(d\) features,
the number of points, \(N\), is given, and
the number of clusters \(K\) is given.
Minimizing a Cost Function
Find \(K\) disjoint clusters \(C_k\), each described by points \(c_1, \dots, c_K\) (called centers, centroids, or means) that minimizes
These can be randomly chosen data points, or by some other method such as \(k\)-means++
For each \(j\), define the cluster \(C_j\) as the set of points in \(X\) that are closest to center \(c_j\).
Nearer to \(c_j\) than to any other center.
For each \(j\), update \(c_j\) to be the center of mass of cluster \(C_j\).
In other words, \(c_j\) is the mean of the vectors in \(C_j\).
Repeat (i.e., go to Step 2) until convergence.
Either the cluster centers change below a threshold, or inertia changes below a threshold or a maximum number of iterations is reached.
Let’s see this in practice with well separated clusters and also look at the Within-Cluster Sum of Square (WCSS).
Code
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import make_blobs# Set random seed for reproducibilitynp.random.seed(42)# Generate sample datan_samples =300X, _ = make_blobs(n_samples=n_samples, centers=3, cluster_std=0.60, random_state=42)# Initialize centroids randomlyk =3centroids = X[np.random.choice(n_samples, k, replace=False)]# Function to assign points to clustersdef assign_clusters(X, centroids): distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2))return np.argmin(distances, axis=0)# Function to update centroidsdef update_centroids(X, labels, k):return np.array([X[labels == i].mean(axis=0) for i inrange(k)])# Function to calculate within-cluster sum of squaresdef calculate_wcss(X, labels, centroids):returnsum(np.sum((X[labels == i] - centroids[i])**2) for i inrange(k))# Set up the clustering progress plotfig1, axs = plt.subplots(2, 3, figsize=(10, 6))axs = axs.ravel()# Colors for each clustercolors = ['#1f77b4', '#ff7f0e', '#2ca02c']# List to store WCSS valueswcss_values = []# Run k-means iterations and plotfor i inrange(6):# Assign clusters labels = assign_clusters(X, centroids)# Calculate WCSS wcss = calculate_wcss(X, labels, centroids) wcss_values.append(wcss)# Plot the current clustering state axs[i].scatter(X[:, 0], X[:, 1], c=[colors[l] for l in labels], alpha=0.6) axs[i].scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x', s=200, linewidths=3) axs[i].set_title(f'Iteration {i if i <5else"Final"}, WCSS: {wcss:.2f}') axs[i].set_xlim(X[:, 0].min() -1, X[:, 0].max() +1) axs[i].set_ylim(X[:, 1].min() -1, X[:, 1].max() +1)# Update centroids (except for the last iteration)if i <5: centroids = update_centroids(X, labels, k)plt.tight_layout()# Create a separate plot for WCSS progressfig2, ax = plt.subplots(figsize=(10, 6))ax.plot(range(6), wcss_values, marker='o')ax.set_title('WCSS Progress')ax.set_xlabel('Iteration')ax.set_ylabel('Within-Cluster Sum of Squares')ax.grid(True)plt.tight_layout()plt.show()
Let’s see this in practice with overlapping clusters and also look at the WCSS.
Code
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import make_blobs# Set random seed for reproducibilitynp.random.seed(42)# Generate sample datan_samples =300X, _ = make_blobs(n_samples=n_samples, centers=3, cluster_std=3.00, random_state=2)# Initialize centroids randomlyk =3centroids = X[np.random.choice(n_samples, k, replace=False)]# Function to assign points to clustersdef assign_clusters(X, centroids): distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2))return np.argmin(distances, axis=0)# Function to update centroidsdef update_centroids(X, labels, k):return np.array([X[labels == i].mean(axis=0) for i inrange(k)])# Function to calculate within-cluster sum of squaresdef calculate_wcss(X, labels, centroids):returnsum(np.sum((X[labels == i] - centroids[i])**2) for i inrange(k))# Set up the clustering progress plotfig1, axs = plt.subplots(2, 3, figsize=(10, 6))axs = axs.ravel()# Colors for each clustercolors = ['#1f77b4', '#ff7f0e', '#2ca02c']# List to store WCSS valueswcss_values = []# Run k-means iterations and plotfor i inrange(6):# Assign clusters labels = assign_clusters(X, centroids)# Calculate WCSS wcss = calculate_wcss(X, labels, centroids) wcss_values.append(wcss)# Plot the current clustering state axs[i].scatter(X[:, 0], X[:, 1], c=[colors[l] for l in labels], alpha=0.6) axs[i].scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x', s=200, linewidths=3) axs[i].set_title(f'Iteration {i if i <5else"Final"}, WCSS: {wcss:.2f}') axs[i].set_xlim(X[:, 0].min() -1, X[:, 0].max() +1) axs[i].set_ylim(X[:, 1].min() -1, X[:, 1].max() +1)# Update centroids (except for the last iteration)if i <5: centroids = update_centroids(X, labels, k)plt.tight_layout()# Create a separate plot for WCSS progressfig2, ax = plt.subplots(figsize=(10, 6))ax.plot(range(6), wcss_values, marker='o')ax.set_title('WCSS Progress')ax.set_xlabel('Iteration')ax.set_ylabel('Within-Cluster Sum of Squares')ax.grid(True)plt.tight_layout()plt.show()
Limitations of \(k\)-means
As you can see, \(k\)-means can work very well.
However, we don’t have any guarantees on the performance of \(k\)-means.
In particular, there are various settings in which \(k\)-means can fail to do a good job.
\(k\)-means tries to find spherical clusters.
Because each point is assigned to its closest center, the points in a cluster are implicitly assumed to be arranged in a sphere around the center.
What would happen if we used Euclidean distance as our dissimilarity metric in this feature space?
(This is what \(k\)-means uses.)
Clearly, the influence of income would dominate the other two features. For example, a difference of gender is about as significant as a difference of one dollar of yearly income.
We are unlikely to expose gender-based differences if we cluster using this representation.
The most common way to handle this is feature scaling.
Feature Scaling
The basic idea is to rescale each feature separately, so that its range of values is about the same as all other features.
For example, one may choose to:
shift each feature independently by subtracting the mean over all observed values.
This means that each feature is now centered on zero.
Then rescale each feature so that the standard deviation overall observed values is 1.
This means that the feature will have about the same range of values as all the others.
Feature Scaling Example
For example, let’s work with Bortkiewicz’s famous horse-kick data which is the the number of soilders in the Prussian cavalry killed by horse kicks over the 20 years between 1875 and 1894, inclusive.
counts.hist(bins=25,xlabelsize=16);plt.xlabel('# of Kick Deaths')plt.ylabel('Count')plt.title('Histogram of Kick Deaths')plt.show()
The average:
Code
counts.mean()
np.float64(9.8)
To standardize to zero mean and unit standard deviation, we can use pre-processing tools from the scikit-learn library.
The distribution after rescaling:
Code
from sklearn import preprocessingcounts_scaled = pd.DataFrame(preprocessing.scale(counts))counts_scaled.hist(bins=25,xlabelsize=16);
With a new mean:
Code
counts_scaled.mean().values
array([-1.33226763e-16])
Notice that values that used to be positive have now become negative.
In some situations it may not be sensible to change non-negative values into something else. It may make more sense to map all values into a fixed range, for example \([0, 1]\).
Public Domain, https://commons.wikimedia.org/w/index.php?curid=680455↩︎
By John Snow - Published by C.F. Cheffins, Lith, Southhampton Buildings, London, England, 1854 in Snow, John. On the Mode of Communication of Cholera, 2nd Ed, John Churchill, New Burlington Street, London, England, 1855.↩︎
As determined at the 2006 IEEE International Conference on Data Mining↩︎
By no conegut - MacTutor History of Mathematics: http://www-history.mcs.st-andrews.ac.uk/PictDisplay/Bortkiewicz.html, Public Domain, https://commons.wikimedia.org/w/index.php?curid=79219622↩︎