import sklearn.datasets as sk_data
= sk_data.make_blobs(n_samples=100, centers=3, n_features=30,
X, y, gt_centers =(-10.0, 10.0), random_state=0, return_centers=True) center_box
Clustering In Practice
Clustering in Practice
…featuring \(k\)-means
Today we’ll do an extended example showing \(k\)-means clustering in practice and in the context of the python libraries scikit-learn.
scikit-learn
is the main python library for machine learning functions.
Our goals are to learn:
- How clustering is used in practice
- Tools for evaluating the quality of a clustering
- Tools for assigning meaning or labels to a cluster
- Important visualizations
- A little bit about feature extraction for text
Visualization
Training wheels: Synthetic data
Generally, when learning about or developing a new unsupervised method, it’s a good idea to try it out on a dataset in which you already know the “right” answer.
One way to do that is to generate synthetic data that has some known properties.
Among other things, scikit-learn
contains tools for generating synthetic data for testing.
We’ll use datasets.make_blobs.
Let’s check the shapes of the returned values:
print("X.shape: ", X.shape)
print("y.shape: ", y.shape)
print("gt_centers: ", gt_centers.shape)
X.shape: (100, 30)
y.shape: (100,)
gt_centers: (3, 30)
datasets.makeblobs
takes as arguments:
n_samples
: The number of samples to generaten_features
: The number of features, or in other words the dimensionality of each samplecenter_box
: The bounds of the cluster centersrandom_state
: The random seed for reproducibilityreturn_centers
: A boolean flag, True to return the centers so that we have ground truth
Visualize the Data
To get a sense of the raw data we can inspect it.
For statistical visualization, a good library is Seaborn.
Let’s plot the X
data as a matrix heatmap, where every row is a data point and the columns are the features.
Code
import seaborn as sns
import matplotlib.pyplot as plt
# Set the figure size to make the plot smaller
=(7, 5)) # Adjust the width and height as needed
plt.figure(figsize= False, yticklabels = False, linewidths = 0, cbar = True)
sns.heatmap(X, xticklabels plt.show()
Geometrically, these points live in a 30 dimensional space, so we cannot directly visualize their geometry.
This is a big problem that you will run into time and again!
We will discuss methods for visualizing high dimensional data later on.
For now, we will use a method that can turn a set of pairwise distances into an approximate 2-D representation in some cases.
So let’s compute the pairwise distances, in 30 dimensions, for visualization purposes.
We can compute all pairwise distances in a single step using the scikit-learn
metrics.euclidean_distances
function:
import sklearn.metrics as metrics
= metrics.euclidean_distances(X)
euclidean_dists # euclidean_dists
Matrix shape is: (100, 100)
Let’s look at the upper left and lower right corners of the distances matrix.
Upper left 3x3:
[[ 0. 47.74 45.19]
[47.74 0. 43.67]
[45.19 43.67 0. ]]
...
Lower right 3x3:
[[ 0. 8.19 41.82]
[ 8.19 0. 43.41]
[41.82 43.41 0. ]]
Note that this produces a \(100\times100\) symmetric matrix where the diagonal is all zeros (distance from itself).
Let’s look at a histogram of the distances.
Code
import numpy as np
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
# Compute the pairwise distances
= metrics.euclidean_distances(X)
euclidean_dists
# Extract the lower triangular part of the matrix, excluding the diagonal
= euclidean_dists[np.tril_indices_from(euclidean_dists, k=-1)]
lower_triangular_values
# Plot the histogram
=(10, 6))
plt.figure(figsize=30, edgecolor='black')
plt.hist(lower_triangular_values, bins'Histogram of Lower Diagonal Values of Euclidean Distances')
plt.title('Distance')
plt.xlabel('Frequency')
plt.ylabel( plt.show()
Remember that these are the pairwise distances, in 30 dimensions. So at least with this dataset we see a clean separation presumably between inter-cluster distances and intra-cluster distances.
How would the curse of dimensionality affect this? We discuss the curse of dimensionality later in the course.
Visualizing with MDS
The idea behind Multidimensional Scaling (MDS) is given a pairwise distance (or dissimilarity) matrix:
- Find a set of coordinates in 1, 2 or 3-D space that approximates those distances as well as possible.
- The points that are close (or far) in high dimensional space should be close (or far) in the reduced dimension space.
Note that there are two forms of MDS:
- Metric MDS, of which Classical MDS is a special case, and has a closed form solution based on the eigenvectors of the centered distance matrix.
- Non-Metric MDS, which tries to find a non-parametric monotonic relationship between the dissimilarities and the target coordinates through an iterative approach.
MDS may not always work well if, for example the dissimilarities are not well modeled by a metric like Euclidean distance.
MDS Visualization Result
Code
import sklearn.manifold
import matplotlib.pyplot as plt
= sklearn.manifold.MDS(n_components=2, max_iter=3000, eps=1e-9, random_state=0,
mds = "precomputed", n_jobs = 1)
dissimilarity = mds.fit(euclidean_dists)
fit = fit.embedding_
pos 0], pos[:, 1], s=8)
plt.scatter(pos[:, 'square'); plt.axis(
So we can see that, although our data lives in 30 dimensions, we can get a sense of how the points are clustered by approximately placing the points into two dimensions.
Out of curiosity, we can visualize the pairwise distance matrix using a heatmap.
=False, yticklabels=False, linewidths=0,
sns.heatmap(euclidean_dists, xticklabels=True )
square plt.show()
Applying \(k\)-means
Applying \(k\)-Means
The Python package scikit-learn
has a huge set of tools for unsupervised learning generally, and clustering specifically.
These are in sklearn.cluster.
There are 3 functions in all the clustering classes,
fit()
: builds the model from the training data.- For \(k\)-means, this function find the centroids.
predict()
: assigns labels to the data after building the models.- For \(k\)-means this assigns the cluster number to a point.
fit_predict()
: calls fit and predict in a single step.
Let’s go back to the original 30-D synthetic dataset and apply \(k\)-means and show the cluster numbers.
Code
from sklearn.cluster import KMeans
= KMeans(init = 'k-means++', n_clusters = 3, n_init = 100, random_state=0)
kmeans = kmeans.fit_predict(X)
y_prime print(y_prime)
[0 1 2 2 1 0 1 1 0 1 2 0 2 0 1 2 2 0 2 1 2 1 2 0 1 1 0 2 2 2 2 0 2 0 2 1 2
1 2 1 1 1 0 2 0 2 0 1 0 2 2 0 0 0 0 2 0 1 1 2 0 2 1 0 0 1 2 0 0 1 1 1 0 0
2 0 1 0 1 0 1 1 1 0 2 2 1 0 0 2 0 1 1 2 0 2 2 1 1 2]
For comparisons, here are the original cluster numbers.
Code
print(y)
[0 2 1 1 2 0 2 2 0 2 1 0 1 0 2 1 1 0 1 2 1 2 1 0 2 2 0 1 1 1 1 0 1 0 1 2 1
2 1 2 2 2 0 1 0 1 0 2 0 1 1 0 0 0 0 1 0 2 2 1 0 1 2 0 0 2 1 0 0 2 2 2 0 0
1 0 2 0 2 0 2 2 2 0 1 1 2 0 0 1 0 2 2 1 0 1 1 2 2 1]
Note that the \(k\)-means labels are different than the ground truth labels.
We can remap the values according to
\[ \begin{align*} 0 &\rightarrow 0 \\ 1 &\rightarrow 2 \\ 2 &\rightarrow 1. \end{align*} \]
Code
# Remap y_prime
= {0: 0, 1: 2, 2: 1}
remap = np.vectorize(remap.get)(y_prime)
y_prime_remapped print("Remapped y_prime:")
print(y_prime_remapped)
Remapped y_prime:
[0 2 1 1 2 0 2 2 0 2 1 0 1 0 2 1 1 0 1 2 1 2 1 0 2 2 0 1 1 1 1 0 1 0 1 2 1
2 1 2 2 2 0 1 0 1 0 2 0 1 1 0 0 0 0 1 0 2 2 1 0 1 2 0 0 2 1 0 0 2 2 2 0 0
1 0 2 0 2 0 2 2 2 0 1 1 2 0 0 1 0 2 2 1 0 1 1 2 2 1]
Code
# Calculate the number of mismatched entries
= np.sum(y != y_prime_remapped)
mismatches
# Calculate the percentage of mismatched entries
= len(y)
total_entries = (mismatches / total_entries) * 100
percentage_mismatched
print(f"Percentage of mismatched entries: {percentage_mismatched:.2f}%")
Percentage of mismatched entries: 0.00%
To reiterate, mismatches are 0.00%.
All the tools in scikit-learn
are implemented as Python objects.
Recall that the general sequence for using a tool from scikit-learn
is:
- create the object, probably with some hyperparameter settings or initialization,
- run the method, generally by using the
fit()
function, and - examine the results, which are generally property variables of the object.
= kmeans.cluster_centers_
centroids = kmeans.labels_
labels = kmeans.inertia_ inertia
And we see the resulting cluster inertia.
Code
print(f'Clustering inertia: {inertia:0.1f}.')
Clustering inertia: 2733.8.
Visualizing the Results of Clustering
Let’s visualize the results. We’ll do that by reordering the data items according to their cluster.
Code
import numpy as np
= np.argsort(labels)
idx = X[idx, :]
rX = False, yticklabels = False, linewidths = 0)
sns.heatmap(rX, xticklabels plt.show()
You can clearly see the feature similarities of the samples in the same cluster.
We can also sort the pairwise distance matrix.
Code
= euclidean_dists[idx,:][:,idx]
rearranged_dists = False, yticklabels = False, linewidths = 0, square = True)
sns.heatmap(rearranged_dists, xticklabels plt.show()
Here again, you can see the inter-cluster sample distances are small.
Cluster Evaluation
Cluster Evaluation
How do we know whether the clusters we get represent “real” structure in our data?
Consider a dataset of 100 points uniformly distributed in the unit square.
Code
import pandas as pd
42)
np.random.seed(= np.random.default_rng().uniform(0, 1, 500)
unif_X = np.random.default_rng().uniform(0, 1, 500)
unif_Y = pd.DataFrame(np.column_stack([unif_X, unif_Y]), columns = ['X', 'Y'])
df 'X', 'Y', kind = 'scatter',
df.plot(= False, xlim = (0, 1), ylim = (0, 1),
colorbar = (4, 4)) figsize
After running \(k\)-means on this data:
Code
= KMeans(init = 'k-means++', n_clusters = 3, n_init = 500, random_state=0)
kmeans 'label'] = kmeans.fit_predict(df[['X', 'Y']])
df['X', 'Y', kind = 'scatter', c = 'label',
df.plot(='viridis', colorbar = False,
colormap= (0, 1), ylim = (0, 1),
xlim = (4, 4)) figsize
The point is: clustering algorithms output some “clustering” of the data.
The question is, does the clustering reflect real structure?
Generally we encounter two problems:
- Are there real clusters in the data?
- If so, how many clusters are there?
There is often no definitive answer to either of these questions.
You will often need to use your judgment in answering them.
Clustering Metrics
Broadly separated into two categories.
- Ground truth based metrics
We’ll look at (Adjusted) Rand Index.
Even with ground truth labels, we can’t apply accuracy measures like we will see in supervised learning approaches since clustering algorithms don’t try to match exact labels, but rather group similar items together.
- Internal cluster metrics (no ground truth external labels)
We’ll look at Silhouette Coefficient.
Rand Index
Named after William M. Rand.
In some cases we’ll have ground truth cluster labels, call them \(T\).
In that case, what we need is a way to compare a proposed clustering, \(C\), with \(T\). To make this comparison we use the Rand Index.
The formula for the Rand Index is:
\[ \text{RI}(T,C) = \frac{a+b}{n \choose 2}, \]
- where \(a\) is the number of pairs of points that are in the same cluster in both \(T\) and \(C\), and
- \(b\) is the number of pairs of points that are in different clusters in both \(T\) and \(C\).
We normalize the sum by the combinatorial number of pairs of points, \(n \choose 2\).
The Rand Index is a value that falls in the range \([0, 1]\).
Rand Index Example 1
Let’s proceed by way of example.
We’ll create another synthetic dataset of 3 clusters, so we have ground truth labels \(T\).
Code
= sk_data.make_blobs(n_samples=[100, 250, 150],
X_rand, y_rand = [[1, 2],[1.5, 3], [2, 4]],
centers = 2,
n_features = (-10.0, 10.0),
center_box = [.2, .3, .2],
cluster_std = 0)
random_state = pd.DataFrame(np.column_stack([X_rand[:, 0], X_rand[:, 1], y_rand]), columns = ['X', 'Y', 'label'])
df_rand_gt = df_rand_gt.copy()
df_rand_clust = KMeans(init = 'k-means++', n_clusters = 3, n_init = 100, random_state=0)
kmeans 'label'] = kmeans.fit_predict(df_rand_gt[['X', 'Y']])
df_rand_clust['label'] = df_rand_clust['label'].replace({0: 1, 1: 2, 2: 0}) df_rand_clust[
We then run \(k\)-means with 3 clusters to on the datasets to get labels \(C\) and plot the results:
Code
= plt.subplots(1, 2, figsize = (12, 5))
figs, axs 'X', 'Y',
df_rand_gt.plot(= 'scatter',
kind = 'label',
c ='viridis',
colormap= axs[0],
ax = False)
colorbar 0].set_title('Ground Truth (T)')
axs[0].set_axis_off()
axs[
'X', 'Y',
df_rand_clust.plot(= 'scatter',
kind = 'label',
c ='viridis',
colormap= axs[1],
ax = False)
colorbar 1].set_title('Clustering (C)')
axs[1].set_axis_off()
axs[ plt.show()
Code
print("The Rand index is: ", metrics.rand_score(df_rand_gt["label"], df_rand_clust["label"]))
The Rand index is: 0.9444889779559118
How do we know whether a particular Rand Index (RI) score is significant?
We might compare it to the RI for a random assignment of points to labels.
This leads to the Adjusted Rand Index.
The Adjusted Rand Index
To calibrate the Rand Index this way, we use the expected Rand Index of random labels, denoted \(E[\text{RI}]\).
The Expected Rand Index considers \(C\) to be a clustering that has the same cluster sizes as \(T\), but the labels are assigned at random.
Using that, we define the adjusted Rand index as a simple rescaling of RI:
\[ \text{ARI} = \frac{\text{RI} - E[\text{RI}]}{\max(\text{RI}) - E[\text{RI}]} \]
The values of \(E[\text{RI}]\) and \(\max(\text{RI})\) can be computed using combinatorics (we’ll omit the derivation).
The below code block computes and prints the adjusted Rand index for the example on the previous slide.
Code
print("The adjusted Rand index is: ", metrics.adjusted_rand_score(df_rand_gt["label"], df_rand_clust["label"]))
The adjusted Rand index is: 0.8808487125268196
Rand Index Example 2
Code
import matplotlib.pyplot as plt
=(4, 3)) # Adjust the width and height as needed
plt.figure(figsize= False, yticklabels = False, linewidths = 0)
sns.heatmap(X, xticklabels plt.show()
Let’s consider again our 3-cluster dataset with known labels y
.
Here is the Rand Index and adjusted Rand Index, when using \(k\)-means to cluster this dataset for 1 to 10 clusters:
Code
def ri_evaluate_clusters(X, max_clusters, ground_truth):
= np.zeros(max_clusters+1)
ri 0] = 0
ri[= np.zeros(max_clusters+1)
ari 0] = 0
ari[for k in range(1,max_clusters+1):
= KMeans(init='k-means++', n_clusters=k, n_init=10)
kmeans
kmeans.fit_predict(X)
= metrics.rand_score(kmeans.labels_, ground_truth)
ri[k] = metrics.adjusted_rand_score(kmeans.labels_, ground_truth)
ari[k] return ri, ari
= ri_evaluate_clusters(X, 10, y)
ri, ari range(1, len(ri)), ri[1:], 'ro-', range(1, len(ari)), ari[1:], 'b*-')
plt.plot('Number of clusters')
plt.xlabel('$k$-means Clustering Compared to Known Labels')
plt.title('Index value')
plt.ylabel('RI', 'ARI'])
plt.legend([ plt.show()
RI vs ARI
The ARI provides a more nuanced evaluation of clustering similarity compared to the RI by accounting for the chance grouping of elements. Here are the key differences:
Rand Index (RI)
- Measures Agreement: The RI measures the agreement between two clusterings by considering all pairs of elements and counting pairs that are either in the same or different clusters in both clusterings.
- Range: The RI ranges from 0 to 1, where 1 indicates perfect agreement and 0 indicates no agreement.
Adjusted Rand Index (ARI)
- Adjusts for Chance: The ARI adjusts the RI by considering the expected similarity that might occur by random chance. This adjustment makes the ARI a more reliable measure, especially when dealing with random or unbalanced clusterings.
- Range: The ARI can take values from -1 to 1. A value of 0 indicates that the clustering is no better than random, while a value of 1 indicates perfect agreement. Negative values indicate that the clustering is worse than random.
Information Provided by ARI
The ARI corrects for the fact that some level of agreement between clusterings can occur purely by chance. This makes it a more accurate measure of the true similarity between clusterings.
The ARI’s range allows for a clearer interpretation of clustering performance. An ARI of 0 means the clustering is no better than random, which is more informative than an RI of 0.5, which might still be influenced by chance.
The ARI allows for better comparison between different clustering results, as it normalizes the index to account for the chance agreement.
Deciding on the Number of Clusters
The second question we face in evaluating a clustering is how many clusters are present.
In practice, to use \(k\)-means or most other clustering methods, one must choose \(k\), the number of clusters, via some process.
Inspecting Clustering Error
The first thing you might do is to look at the \(k\)-means objective function and see if it levels off after a certain point.
Recall that the \(k\)-means objective can be considered the clustering “error”.
If the error stops going down, that would suggest that the clustering is not improving as the number of clusters is increased.
Let’s calculate the error for 1-11 clusters.
= np.zeros(11)
error for k in range(1,11):
= KMeans(init='k-means++', n_clusters = k, n_init = 10)
kmeans
kmeans.fit_predict(X)= kmeans.inertia_ error[k]
For our synthetic data, here is the \(k\)-means objective, as a function of \(k\):
Code
range(1, len(error)), error[1:], 'o-')
plt.plot('Number of clusters')
plt.xlabel(r'$k$-means clustering performance of synthetic data')
plt.title('Error')
plt.ylabel( plt.show()
This synthetic data is not at all typical. You will almost never see such a sharp change in the error function as we see here.
Let’s create a function to evaluate clusters for later use.
def evaluate_clusters(X,max_clusters):
= np.zeros(max_clusters+1)
error 0] = 0
error[for k in range(1,max_clusters+1):
= KMeans(init='k-means++', n_clusters=k, n_init=10)
kmeans
kmeans.fit_predict(X)= kmeans.inertia_
error[k] return error
Silhouette Coefficient
Usually, the ground truth labels are not known.
In that case, evaluation must be performed using the model itself.
Recall our definition of clustering:
a grouping of data objects, such that the objects within a group are similar (or near) to one another and dissimilar (or far) from the objects in other groups.
This suggests a metric that could evaluate a clustering: comparing the distances between points within a cluster, to the distances between points in different clusters.
The Silhouette Coefficient is an example of such an evaluation, where a higher Silhouette Coefficient score relates to a model with “better defined” clusters.
We’ll use sklearn.metrics.silhouette_score
- Let \(a\) be the mean distance between a data point and all other points in the same cluster.
- Let \(b\) be the mean distance between a data point and all other points in the next nearest cluster.
Then the Silhouette Coefficient for a point is
\[ s = \frac{b-a}{\max(a, b)}. \]
The overall silhouette coefficient for a clustering is the average of the silhouette coefficients for each data point.
We can calculate the Silhouette Score for our synthetic data:
Code
= metrics.silhouette_score(X, labels, metric='euclidean')
sc print('Silhouette Score:', sc)
Silhouette Score: 0.8319348841402535
We can also evaluate it for 2-10 clusters and plot the results:
Code
def sc_evaluate_clusters(X, max_clusters, n_init, seed):
= np.zeros(max_clusters+1)
s 0] = 0
s[1] = 0
s[for k in range(2, max_clusters+1):
= KMeans(init='k-means++', n_clusters = k, n_init = n_init, random_state = seed)
kmeans
kmeans.fit_predict(X)= metrics.silhouette_score(X, kmeans.labels_, metric = 'euclidean')
s[k] return s
= sc_evaluate_clusters(X, 10, 10, 1)
s range(2, len(s)), s[2:], 'o-')
plt.plot('Number of Clusters')
plt.xlabel('$k$-means clustering performance on synthetic data')
plt.title('Silhouette Score')
plt.ylabel( plt.show()
Again, these results are more perfect than typical.
But the general idea is to look for a local maximum in the Silhouette Coefficient as the potential number of clusters.
k-means++
k-means++ initialization
Proposed in 2007 by David Arthur and Sergei Vassilvitskii.
K-means++ improves the initialization of cluster centers to enhance the quality of the final clustering results. Here’s how it works.
- Choose an initial centroid \(c_1\) randomly.
- Choose the next centroid \(c_i\) with probability proportional to the squared distances to each data point.
- This ensures that new centroids are spread out across the data space.
- Repeat this process until \(k\) centroids have been selected.
Once the initial centroids are chosen, the standard k-means algorithm is applied.
The main advantage of k-means++ over randomly assigning points is that it reduces the likelihood of poor clustering results due to suboptimal initial centroids. This often leads to faster convergence and better overall clustering quality.
Real Data Clustering Example
Taking the Training Wheels Off: Real Data
As a classic “real world” example, we’ll use the “20 Newsgroup” data provided as example data in scikit-learn.
We borrow code from this sklearn example.
Let’s load the data and count the number of documents and categories.
Code
from sklearn.datasets import fetch_20newsgroups
# just use the following categories
= ['comp.os.ms-windows.misc', 'sci.space', 'rec.sport.baseball']
categories
= fetch_20newsgroups(
news_data = ('headers', 'footers', 'quotes'),
remove = 'train',
subset = categories,
categories = True,
shuffle = 42)
random_state
= news_data.target
labels = np.unique(labels, return_counts=True)
unique_labels, category_sizes = unique_labels.shape[0]
true_k
print(f"{len(news_data.data)} documents - {true_k} categories")
1781 documents - 3 categories
Here is an example of one of the documents:
Code
print(news_data.data[0])
Early to mid June.
If they think the public wants to see it they will carry it. Why not
write them and ask? You can reach them at:
F: NATIONAL NEWS MEDIA
ABC "World News Tonight" "Face the Nation"
7 West 66th Street CBS News
New York, NY 10023 2020 M Street, NW
212/887-4040 Washington, DC 20036
202/457-4321
Associated Press "Good Morning America"
50 Rockefeller Plaza ABC News
New York, NY 10020 1965 Broadway
National Desk (212/621-1600) New York, NY 10023
Foreign Desk (212/621-1663) 212/496-4800
Washington Bureau (202/828-6400)
Larry King Live TV
"CBS Evening News" CNN
524 W. 57th Street 111 Massachusetts Avenue, NW
New York, NY 10019 Washington, DC 20001
212/975-3693 202/898-7900
"CBS This Morning" Larry King Show--Radio
524 W. 57th Street Mutual Broadcasting
New York, NY 10019 1755 So. Jefferson Davis Highway
212/975-2824 Arlington, VA 22202
703/685-2175
"Christian Science Monitor"
CSM Publishing Society "Los Angeles Times"
One Norway Street Times-Mirror Square
Boston, MA 02115 Los Angeles, CA 90053
800/225-7090 800/528-4637
CNN "MacNeil/Lehrer NewsHour"
One CNN Center P.O. Box 2626
Box 105366 Washington, DC 20013
Atlanta, GA 30348 703/998-2870
404/827-1500
"MacNeil/Lehrer NewsHour"
CNN WNET-TV
Washington Bureau 356 W. 58th Street
111 Massachusetts Avenue, NW New York, NY 10019
Washington, DC 20001 212/560-3113
202/898-7900
"Crossfire" NBC News
CNN 4001 Nebraska Avenue, NW
111 Massachusetts Avenue, NW Washington, DC 20036
Washington, DC 20001 202/885-4200
202/898-7951 202/362-2009 (fax)
"Morning Edition/All Things Considered"
National Public Radio
2025 M Street, NW
Washington, DC 20036
202/822-2000
United Press International
1400 Eye Street, NW
Washington, DC 20006
202/898-8000
"New York Times" "U.S. News & World Report"
229 W. 43rd Street 2400 N Street, NW
New York, NY 10036 Washington, DC 20037
212/556-1234 202/955-2000
212/556-7415
"New York Times" "USA Today"
Washington Bureau 1000 Wilson Boulevard
1627 Eye Street, NW, 7th Floor Arlington, VA 22229
Washington, DC 20006 703/276-3400
202/862-0300
"Newsweek" "Wall Street Journal"
444 Madison Avenue 200 Liberty Street
New York, NY 10022 New York, NY 10281
212/350-4000 212/416-2000
"Nightline" "Washington Post"
ABC News 1150 15th Street, NW
47 W. 66th Street Washington, DC 20071
New York, NY 10023 202/344-6000
212/887-4995
"Nightline" "Washington Week In Review"
Ted Koppel WETA-TV
ABC News P.O. Box 2626
1717 DeSales, NW Washington, DC 20013
Washington, DC 20036 703/998-2626
202/887-7364
"This Week With David Brinkley"
ABC News
1717 DeSales, NW
Washington, DC 20036
202/887-7777
"Time" magazine
Time Warner, Inc.
Time & Life Building
Rockefeller Center
New York, NY 10020
212/522-1212
Feature Extraction
We’ve discussed a bit the challenges of feature engineering.
One of the most basic issues concerns how to encode categorical or text data in a form usable by algorithms that expect numeric input.
The starting point is to note that one can encode a set using a binary vector with one component for each potential set member.
The so-called bag of words encoding for a document is to treat the document as a multiset of words.
A multiset is like a set but allows for multiple instances of each element.
That is, we simply count how many times each word occurs. It is a “bag” because all the order of the words in the document is lost.
Surprisingly, we can still tell a lot about the document even without knowing its word ordering.
Counting the number of times each word occurs in a document yields a vector of term frequencies.
However, simply using the “bag of words” directly has a number of drawbacks. First of all, large documents will have more words than small documents.
Hence it often makes sense to normalize the frequency vectors.
\(\ell_1\) or \(\ell_2\) normalization are common.
Next, as noted in scikit-learn:
In a large text corpus, some words will be very [frequent] (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document.
If we were to feed the direct count data directly to a classifier those very frequent terms would overshadow the frequencies of rarer yet more interesting terms.
In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.
Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency.
This is originally a term weighting scheme developed for information retrieval (as a ranking function for search engines results), that has also found good use in document classification and clustering.
The idea is that rare words are more informative than common words.
(This has connections to information theory).
Hence, the definition of tf-idf is as follows.
First:
\[ \text{tf}(t,d) = \text{Number of times term }t \text{ occurs in document } d~. \]
Next, if \(N\) is the total number of documents in the corpus \(D\) then:
\[ \text{idf}(t,D)=\log{\frac{N}{|\{d\in D : t\in d \}|}}, \]
where the denominator is the number of documents in which the term \(t\) appears.
And finally:
\[ \text{tf-idf}(t,d)=\text{tf}(t,d)\times \text{idf}(t,D). \]
Code
from sklearn.feature_extraction.text import TfidfVectorizer
= TfidfVectorizer(stop_words = 'english', min_df = 4, max_df = 0.8)
vectorizer = vectorizer.fit_transform(news_data.data) data
Getting to know the Data
Code
print(f'The datas type is {type(data)} and the shape is {data.shape}')
The datas type is <class 'scipy.sparse._csr.csr_matrix'> and the shape is (1781, 6613)
For what it’s worth we can look at the data feature matrix:
Code
= plt.subplots(1,1,figsize=(8,4))
fig, ax1 = sns.heatmap(data[1:100,1:200].todense(), xticklabels=False, yticklabels=False,
dum =0, cbar=True, ax=ax1) linewidths
Code
print(news_data.target)
print(news_data.target_names)
[2 0 0 ... 2 1 2]
['comp.os.ms-windows.misc', 'rec.sport.baseball', 'sci.space']
Selecting the Number of Clusters
Now let’s look at the different cluster measures versus number of clusters.
Code
= evaluate_clusters(data, 10)
error range(1, len(error)), error[1:])
plt.plot('$k$-means Clustering Performance on Newsgroup Articles')
plt.title('Number of clusters')
plt.xlabel('Error')
plt.ylabel( plt.show()
Code
= ri_evaluate_clusters(data, 10, news_data.target)
ri, ari range(1, len(ari)), ari[1:], 'o-')
plt.plot('Number of clusters')
plt.xlabel('$k$-means Clustering Compared to Known Labels\nNewsgroup Articles')
plt.title('Adjusted Rand Index')
plt.ylabel( plt.show()
Code
= sc_evaluate_clusters(data, 10, 100, 3)
s range(2, len(s)), s[2:], 'o-')
plt.plot('Number of Clusters')
plt.xlabel('$k$-means clustering performance on Newsgroup Articles')
plt.title('Silhouette Score')
plt.ylabel( plt.show()
Looking into the clusters
Run \(k\)-means with 4 clusters:
Code
= 4
k = KMeans(n_clusters = k, init = 'k-means++', max_iter = 100, n_init = 25, random_state = 3)
kmeans kmeans.fit_predict(data)
array([2, 1, 1, ..., 3, 2, 3], dtype=int32)
Find the top 10 terms per cluster:
Code
= kmeans.cluster_centers_.argsort()#[:, ::-1]
asc_order_centroids = asc_order_centroids[:,::-1]
order_centroids = vectorizer.get_feature_names_out()
terms for i in range(k):
print(f'Cluster {i}:')
for ind in order_centroids[i, :10]:
print(f' {terms[ind]}')
print('')
Cluster 0:
ax
max
b8f
g9v
a86
145
1d9
pl
2di
0t
Cluster 1:
windows
file
dos
files
thanks
use
drivers
card
driver
problem
Cluster 2:
year
think
just
like
game
team
good
don
baseball
games
Cluster 3:
space
nasa
moon
launch
orbit
shuttle
earth
like
people
lunar
Pairwise Distances Matrix
Let’s calculate the pairwise distances matrix.
Code
= metrics.euclidean_distances(data)
euclidean_dists = kmeans.labels_
labels = np.argsort(labels)
idx = euclidean_dists[idx][:,idx]
clustered_dists = plt.subplots(1,1,figsize=(6,6))
fig, ax1 = sns.heatmap(clustered_dists, xticklabels=False, yticklabels=False, linewidths=0, square=True, cbar=True, ax=ax1) dum
MDS Embedding
Let’s visualize with MDS.
Note that MDS is a slow algorithm and we can’t do all 1700+ data points quickly, so we will take a random sample.
Code
import random
= euclidean_dists.shape[0]
n_items 42)
random.seed(= random.sample(range(n_items), 500)
subset
= mds.fit(euclidean_dists[subset][:, subset])
fit = fit.embedding_ pos
We have the labels:
Code
labels
array([2, 1, 1, ..., 3, 2, 3], dtype=int32)
MDS Embedding
Code
= ['#E69F00', '#56B4E9', '#009E73', '#F0E442', '#0072B2']
colorblind_palette = [colorblind_palette[l] for l in labels[subset]]
cols 0], pos[:, 1], s = 12, c = cols)
plt.scatter(pos[:, 'MDS Embedding of Newsgroup Articles')
plt.title(
= np.unique(labels[subset])
unique_labels # Create legend handles and labels based on unique labels
= [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=colorblind_palette[l], markersize=10) for l in unique_labels]
handles = [f'Cluster {l}' for l in unique_labels]
legend_labels # Add the legend to the plot
='Clusters')
plt.legend(handles, legend_labels, title plt.show()
Recap and Next
We’ve covered:
- How clustering is used in practice
- Tools for evaluating the quality of a clustering
- Tools for assigning meaning or labels to a cluster
- Important visualizations
- A little bit about feature extraction for text
Next time, we’ll look at:
- Hierarchical clustering
- Gaussian mixture models