So far, we have seen how to cluster objects using \(k\)-means:
Start with an initial set of cluster centers,
Assign each object to its closest cluster center
Recompute the centers of the new clusters
Repeat 2 \(\rightarrow\) 3 until convergence
In \(k\)-means clustering every object is assigned to a single cluster.
This is called hard assignment.
However, there may be cases where we either cannot use hard assignments or we do not want to do it.
In particular, we might have reason to believe that the best description of the data is a set of overlapping clusters.
Overlapping Clusters
Rich or Poor Example
Let’s consider as an example a society that consists of just two kinds of individuals: poor or rich.
How might we model society as a mixture of poor and rich, when viewed in terms of the single feature age?
Rich or Poor Sampling
Let’s sample 20,000 rich individuals and 20,000 poor individuals. From this sample we get have the following histograms.
Code
# original inspiration for this example from# https://www.cs.cmu.edu/~./awm/tutorials/gmm14.pdffrom scipy.stats import multivariate_normalnp.random.seed(4)df = pd.DataFrame(multivariate_normal.rvs(mean = np.array([37, 45]), cov = np.diag([196, 121]), size =20000), columns = ['poor', 'rich'])df.hist(bins =range(80), sharex =True)plt.show()
We find that ages of the poor set have mean \(\mu=37\) with standard deviation \(\sigma=14\), while the ages of the rich set have mean \(\mu=45\) with standard deviation \(\sigma=11\).
Rich or Poor by Age
Code
from scipy.stats import normx = np.linspace(norm.ppf(0.001, loc =37, scale =14), norm.ppf(0.999, loc =37, scale =14), 100)plt.plot(x, norm.pdf(x, loc =37, scale =14),'b-', lw =5, alpha =0.6, label ='poor')x = np.linspace(norm.ppf(0.001, loc =45, scale =11), norm.ppf(0.999, loc =45, scale =11), 100)plt.plot(x, norm.pdf(x, loc =45, scale =11),'g-', lw =5, alpha =0.6, label ='rich')plt.xlim([15, 70])plt.xlabel('Age', size=14)plt.legend(loc ='best')plt.title('Age Distributions')plt.ylabel(r'$p(x)$', size=14)plt.show()
Soft Clustering
Viewed along the age dimension we observe that there are two overlapping clusters.
Furthermore, given some particular individual at a given age, say 25, we cannot say for sure which cluster they belong to.
Rather, we will use probability to quantify our uncertainty about the cluster that any single individual belongs to.
Thus, we could say that a given individual (“John Smith”, age 25) belongs to the rich cluster with some probability and the poor cluster with a different probability.
Naturally we expect the probabilities for John Smith to sum up to 1.
This is called soft assignment, and a clustering using this principle is called soft clustering.
Conditional Probability
More formally, we say that an object can belong to each particular cluster with some probability, such that the sum of the probabilities adds up to 1 for each object.
For example, assuming that we have two clusters \(C_1\) and \(C_2\), we can have that an object \(x_1\) belongs to \(C_1\) with probability \(0.3\) and to \(C_2\) with probability \(0.7\).
Note that the distribution over \(C_1\) and \(C_2\) only refers to object \(x_1\).
Parameter estimation is the process of determining the parameters of a distribution based on a sample of observed data.
Maximum likelihood estimation is a method to estimate the parameters of an assumed probability distribution given a sample of observed data.
Likelihood Function
The likelihood function \(L(\boldsymbol{\theta}, x)\) represents the probability of observing the given data \(x\) as a function of the parameters \(\boldsymbol{\theta}\) of the distribution.
The primary purpose of the likelihood function is to estimate the parameters that make the observed data \(x\) most probable.
The likelihood function for a set of samples \(x_{n}~\text{for}~n=1, \ldots, N\) drawn from a Gaussian distribution is
Using the log-likelihood we will be able to derive formulas for the maximum likelihood estimates.
Maximizing the log-likelihood
How do we maximize (optimize) a function of parameters?
To find the optimal parameters of a function, we compute partial derivatives of the function and set them equal to zero. The solution to these equations gives us a local optimal value for the parameters.
Given samples \(x_{1}, \ldots, x_n\) drawn from a Gaussian distribution, we compute the maximum likelihood estimates by maximizing the log likelihood function
We use this information to help our understanding of Gaussian Mixture models.
Gaussian Mixture Models
Univariate Gaussians
In GMMs we assume that each of our clusters follows a Gaussian (normal) distribution with their own parameters.
Code
import numpy as npimport matplotlib.pyplot as plt# Define the parameters for the 4 Gaussiansparams = [ {"mean": 0, "variance": 1, "color": "#377eb8"}, # Blue {"mean": 2, "variance": 0.5, "color": "#4daf4a"}, # Green {"mean": -2, "variance": 1.5, "color": "#e41a1c"}, # Red {"mean": 1, "variance": 2, "color": "#984ea3"} # Purple]# Generate x valuesx = np.linspace(-10, 10, 1000)# Plot each Gaussianfor param in params: mean = param["mean"] variance = param["variance"] color = param["color"]# Calculate the Gaussian y = (1/ np.sqrt(2* np.pi * variance)) * np.exp(- (x - mean)**2/ (2* variance))# Plot the Gaussian plt.plot(x, y, color=color, label=f"$\\mu$: {mean}, $\\sigma^2$: {variance}")# Add legendplt.legend()# Add title and labelsplt.title("1D Gaussians with Different Means and Variances")plt.xlabel("x")plt.ylabel("Probability Density")# Show the plotplt.show()
This is a similar situation to our previous example where we considered labeling a person as poor or rich based on the single feature age.
This could be a hypothetical situation with \(K=4\) normally distributed clusters.
Multivariate Gaussians
Given data points \(\mathbf{x}\in\mathbb{R}^{d}\), i.e., \(d\)-dimensional points, then we use the multivariate Gaussian to describe the clusters. The formula for the multivariate Gaussian is
where \(\Sigma\in\mathbb{R}^{d\times d}\) is the covariance matrix, \(\vert \Sigma\vert\), denote the determinant of \(\Sigma\), and \(\boldsymbol{\mu}\) is the \(d\)-dimensional vector of expected values.
Notation
The notation \(x \sim \mathcal{N}(\mu, \sigma^2)\) is used to denote a univariate random variable that is normally distributed with expected value \(\mu\) and variance \(\sigma^{2}.\)
The notation \(\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)\) is used to denote a multivariate random variable that is normally distributed with mean \(\boldsymbol{\mu}\) and covariance matrix \(\Sigma\).
Multivariate Example: Cars
To illustrate a particular model, let us consider the properties of cars produced in the US, Europe, and Asia.
For example, let’s say we are looking to classify cars based on their model year and miles per gallon (mpg).
It seems that the data can be described (roughly) as a mixture of threemultivariate Gaussian distributions.
Intuition
Given \(K\) clusters, the goal with a GMM is to determine the probability for whether a point \(\mathbf{x}\in\mathbb{R}^{d}\) belongs to a particular cluster.
We assume that each cluster is distributed normally, according to some (unknown) parameters \(\boldsymbol{\mu}_i, \Sigma_i\). We also assume that the probability that a point belongs to a particular cluster is given by \(w_i\).
A Gaussian mixture model is defined by these parameters \(w_i, \boldsymbol{\mu}_i, \Sigma_i\) for \(i=1, \ldots, K\).
We can use these parameters to compute the probability that a point \(\mathbf{x}\) belongs to a particular cluster \(C_k\).
Similar to MLE, we must compute the parameters \(w_i, \boldsymbol{\mu}_i, \Sigma_i\).
Learning the Parameters of a GMM
Given some data how do we estimate the
cluster probability \(w_i\),
cluster mean \(\boldsymbol{\mu}_i\), and
cluster covariance \(\Sigma_i\)
for each \(i=1, \ldots, K\)?
We will formulate a likelihood function for these parameters \(\boldsymbol{\theta}_i = (w_i, \boldsymbol{\mu}_i, \Sigma_i)\) and then optimize this function.
Latent Variables
We assume that each data point \(x\) is produced by a latent variable\(z\).
This variable is called latent because it is never actually observed. It is used to help indicate cluster membership and is helpful for the derivation of the likelihood function.
We can view this as a one-hot-encoding that identifies a probability of membership to one of the \(K\) clusters.
We impose a distribution over \(z\) representing a soft assignment
With our distribution weights \(w_i\) for the \(K\) clusters, we can compute the probability density for a GMM at any point \(\mathbf{x}\) via the formula
If this is feeling a bit abstract, let’s see how this applies to the initial example where we were clustering a person as rich or poor based on their age.
Rich vs Poor Example
We have
our single data point \(x=25\), and
we have 2 Gaussians representing our clusters,
\(C_1\) with parameters \(\mu_1 = 37, \sigma_1^{2}=14\) and
\(C_2\) with parameters \(\mu_2 = 45, \sigma_2^{2}=11\),
our latent variables are the probabilities that we classify someone as either rich or poor, i.e., \[
w_1 = P(z = 1) = P(\text{rich})
\] and \[
w_2 = P(z = 2) = P(\text{poor}).
\]
Our interest in this problem is computing the posterior probability, which is,
Unlike the log of a product, the log of a sum does not simplify nicely. As a result, the partial derivatives with respect to \(\boldsymbol{\mu}_k\), depend on the covariances and mixture weights (similarly for the other partial derivatives of \(\Sigma_k\) and \(w_k\)).
To solve this problem we turn to Expectation Maximization.
Expectation Maximization
This is another famous algorithm, in the same “super-algorithm” league as \(k\)-means.
EM is formulated using a probabilistic model for data. It can solve a problem like
Given a set of data points and a parameter \(K\), find the \((w_k, \mu_k, \Sigma_k)~k = 1,\dots,K\) that maximizes the likelihood of the data assuming a GMM with those parameters.
It can also solve lots of other problems involving maximizing the likelihood of data under a different model.
Similar to \(k\)-means, this problem is NP-hard.
Furthermore, EM only guarantees that we will find a local optimum of the objective function.
Expectation Maximization for GMM – The Algorithm
Here is the EM algorithm.
Step 1 Initialization
Initialize the parameters \(w_k, \boldsymbol{\mu}_k, \Sigma_k\) for \(k=1, \dots, K\). The final result will be sensitive to this choice, so a good (and fast) initialization procedure is \(k\)-means.
Step 2 Expectation
Use the current values for \(w_k, \boldsymbol{\mu}_k, \Sigma_k\) and for each of the \(N\) data points \(x_n\), compute the posterior probabilities
where \(f(\mathbf{x}_n \vert \boldsymbol{\mu}_{k}, \Sigma_k)\) is the multivariate Gaussian.
Step 3 Maximization
Using the values \(r_{nk}\) for \(n=1,\ldots, N\) and \(k=1, \ldots, K\). First compute \(N_{k} = \sum_{n=1}^{N} r_{nk}\). Then compute updated \(w_k, \boldsymbol{\mu}_k, \Sigma_k\) according to the formulas
\(\theta^{(t)}\) represents the model parameters (means, covariances, and weights) at iteration \(t\),
\(||\cdot||_2\) is the Euclidean (L2) norm,
\(\text{tol}\) is a small positive number.
Maximum Number of Iterations
The EM algorithm is typically capped at a maximum number of iterations to avoid long runtimes in cases where the log-likelihood or parameters converge very slowly or never fully stabilize.
\[
t > \text{max\_iterations}
\]
Where:
\(\text{max\_iterations}\) is a predefined limit (e.g., 100 or 500 iterations).
Storage and Computational Costs
It is important to be aware of the computational costs of our algorithm.
Storage costs:
There are \(N\)\(d\)-dimensional points
There are \(K\) clusters
There are \(N\times K\) coefficients \(r_{nk}\)
There are \(K\)\(d\)-dimensional cluster centers \(\boldsymbol{\mu}_k\)
There are \(K\)\(d\times d\) covariance matrices \(\Sigma_k\)
There are \(K\) weights \(w_k\)
Computational costs:
Computing each \(r_{nk}\) requires a sum of \(K\) evaluations of the Gaussian PDF.
Updating \(w_k\) requires a division (though we must compute \(N_k\))
k-means vs GMMs
Let’s pause for a minute and compare GMM/EM with \(k\)-means.
GMM/EM
Initialize randomly or using some rule
Compute the probability that each point belongs in each cluster
Update the clusters (weights, means, and variances).
Repeat 2-3 until convergence.
\(k\)-means
Initialize randomly or using some rule
Assign each point to a single cluster
Update the clusters (means).
Repeat 2-3 until convergence.
From a practical standpoint, the main difference is that in GMMs, data points do not belong to a single cluster, but have some probability of belonging to each cluster.
In other words, as stated previously, GMMs use soft assignment.
For that reason, GMMs are also sometimes called soft \(k\)-means.
However, there is also an important conceptual difference.
The GMM starts by making an explicit assumption about how the data were generated.
It says: “the data came from a collection of multivariate Gaussians.”
We made no such assumption when we came up with the \(k\)-means problem. In that case, we simply defined an objective function and declared that it was a good one.
Nonetheless, it appears that we were making a sort of Gaussian assumption when we formulated the \(k\)-means objective function. However, it wasn’t explicitly stated.
The point is that because the GMM makes its assumptions explicit, we can
examine them and think about whether they are valid, and
replace them with different assumptions if we wish.
For example, it is perfectly valid to replace the Gaussian assumption with some other probability distribution. As long as we can estimate the parameters of such distributions from data (e.g., have MLEs), we can use EM in that case as well.
Versatility of EM
A final statement about EM generally. EM is a versatile algorithm that can be used in many other settings. What is the main idea behind it?
Notice that the problem definition only required that we find the clusters, \(C_i\), meaning that we were to find the \((\mu_i, \Sigma_i)\).
However, the EM algorithm posited that we should find as well the \(P(C_j|x_i) = P(z=j | x_i)\), that is, the probability that each point is a member of each cluster.
This is the true heart of what EM does.
By adding parameters to the problem, it actually finds a way to make the problem solvable.
These are the latent parameters we introduced earlier. Latent parameters don’t show up in the solution.
Example
Here is an example using GMM.
We’re going to create two clusters, one spherical, and one highly skewed.
Code
# Number of samples of larger componentn_samples =1000# C is a transfomation that will make a heavily skewed 2-D GaussianC = np.array([[0.1, -0.1], [1.7, .4]])print(f'The covariance matrix of our skewed cluster will be:\n{C.T@C}')
The covariance matrix of our skewed cluster will be:
[[2.9 0.67]
[0.67 0.17]]
Code
rng = np.random.default_rng(0)# now we construct a data matrix that has n_samples from the skewed distribution,# and n_samples/2 from a symmetric distribution offset to position (-4, 2)X = np.r_[(rng.standard_normal((n_samples, 2)) @ C),.7* rng.standard_normal((n_samples//2, 2)) + np.array([-4, 2])]
Code
plt.scatter(X[:, 0], X[:, 1], s =10, alpha =0.8)plt.axis('equal')plt.axis('off')plt.show()
Code
# Fit a mixture of Gaussians with EM using two componentsimport sklearn.mixturegmm = sklearn.mixture.GaussianMixture(n_components=2, covariance_type='full', init_params ='kmeans')y_pred = gmm.fit_predict(X)
Code
colors = ['bg'[p] for p in y_pred]plt.title('Clustering via GMM')plt.axis('off')plt.axis('equal')plt.scatter(X[:, 0], X[:, 1], color = colors, s =10, alpha =0.8)plt.show()
Code
for clus inrange(2):print(f'Cluster {clus}:')print(f' weight: {gmm.weights_[clus]:0.3f}')print(f' mean: {gmm.means_[clus]}')print(f' cov: \n{gmm.covariances_[clus]}\n')
colors = ['bgrky'[p] for p in y_pred_over]plt.title('GMM for overlapping clusters\nNote they have nearly the same center')plt.scatter(X[:, 0], X[:, 1], color = colors, s =10, alpha =0.8)plt.axis('equal')plt.axis('off')plt.plot(gmm.means_[:,0], gmm.means_[:,1], 'ro')plt.show()
colors = ['bgrky'[p] for p in y_pred]plt.scatter(X[:, 0], X[:, 1], color=colors, s=10, alpha=0.8)plt.title('Covariance type = tied')plt.axis('equal')plt.axis('off')plt.plot(gmm.means_[:,0],gmm.means_[:,1], 'ok')plt.show()
Non-Skewed Clusters
Perhaps you believe in even more restricted shapes: all clusters should have their axes aligned with the coordinate axes.
That is, clusters are not skewed.
Then you only need to estimate the diagonals of the covariance matrices - just \(Kd\) parameters.
This is specified by the GMM parameter covariance_type='diag'.
colors = ['bgrky'[p] for p in y_pred]plt.scatter(X[:, 0], X[:, 1], color = colors, s =10, alpha =0.8)plt.axis('equal')plt.axis('off')plt.plot(gmm.means_[:,0], gmm.means_[:,1], 'oc')plt.show()
Round Clusters
Finally, if you believe that all clusters should be round, then you only need to estimate the \(K\) variances.
This is specified by the GMM parameter covariance_type='spherical'.
colors = ['bgrky'[p] for p in y_pred]plt.scatter(X[:, 0], X[:, 1], color = colors, s =10, alpha =0.8)plt.axis('equal')plt.axis('off')plt.plot(gmm.means_[:,0], gmm.means_[:,1], 'oc')plt.show()