Soft Clustering with Gaussian Mixture Models

From Hard to Soft Clustering

So far, we have seen how to cluster objects using \(k\)-means:

  1. Start with an initial set of cluster centers,
  2. Assign each object to its closest cluster center
  3. Recompute the centers of the new clusters
  4. Repeat 2 \(\rightarrow\) 3 until convergence

In \(k\)-means clustering every object is assigned to a single cluster.

This is called hard assignment.

However, there may be cases where we either cannot use hard assignments or we do not want to do it.

In particular, we might have reason to believe that the best description of the data is a set of overlapping clusters.

Overlapping Clusters

Income Level Example

Consider a population in terms of a simple binary income level: higher or lower than some threshold.

How might we model such a population in terms of the single feature age?

Income Level Sampling

Let’s sample 20,000 above income threshold individuals and 20,000 below income threshold individuals.

From this sample we get have the following histograms.

Code
# original inspiration for this example from
# https://www.cs.cmu.edu/~./awm/tutorials/gmm14.pdf
from scipy.stats import multivariate_normal
np.random.seed(4)
df = pd.DataFrame(multivariate_normal.rvs(mean = np.array([37, 45]), cov = np.diag([196, 121]), size = 20000),
                   columns = ['below', 'above'])
df.hist(bins = range(80), sharex = True, sharey=True, figsize=(12, 2))
plt.show()

We find that ages of the below income threshold set have mean \(\mu=37\) with standard deviation \(\sigma=14\), while the ages of the above income threshold set have mean \(\mu=45\) with standard deviation \(\sigma=11\).

Income Level by Age

We can plot gaussian distributions for the two income levels.

Code
from scipy.stats import norm
plt.figure(figsize=(7, 4.5))
x = np.linspace(norm.ppf(0.001, loc = 37, scale = 14), norm.ppf(0.999, loc = 37, scale = 14), 100)
plt.plot(x, norm.pdf(x, loc = 37, scale = 14),'b-', lw = 5, alpha = 0.6, label = 'below')
x = np.linspace(norm.ppf(0.001, loc = 45, scale = 11), norm.ppf(0.999, loc = 45, scale = 11), 100)
plt.plot(x, norm.pdf(x, loc = 45, scale = 11),'g-', lw = 5, alpha = 0.6, label = 'above')
plt.xlim([15, 70])
plt.xlabel('Age', size=14)
plt.legend(loc = 'best')
plt.title('Age Distributions')
plt.ylabel(r'$p(x)$', size=14)
plt.show()

Soft Clustering

  • two overlapping clusters.

  • given some particular individual at a given age, say 25, we cannot say for sure which cluster they belong to.

  • use probability to quantify our uncertainty about cluster membership.

  • Individual belongs to the cluster 1 with some probability \(p_1\) and the cluster 2 with a different probability \(p_2\).

  • Naturally we expect the probabilities to sum up to 1.

  • This is called soft assignment, and a clustering using this principle is called soft clustering.

Conditional Probability

More formally, we say that an object can belong to each particular cluster with some probability, such that the sum of the probabilities adds up to 1 for each object.

For example, assuming that we have two clusters \(C_1\) and \(C_2\), we can have that an object \(x_1\) belongs to \(C_1\) with probability \(0.3\) and to \(C_2\) with probability \(0.7\).

Note that the distribution over \(C_1\) and \(C_2\) only refers to object \(x_1\).

Thus, it is a conditional probability:

\[ \begin{align*} P(C_1 \,|\, x_1 = 25) &= 0.3, \\ P(C_2 \,|\, x_1 = 25) &= 0.7. \end{align*} \]

And to return to our previous example

\[ P(\text{above income threshold}\,|\,\text{age 25}) + P(\text{below income threshold}\,|\,\text{age 25}) = 1. \]

GMM Overview

Our goal with Gaussian mixture models (GMMs) is to show how to compute the probabilities that a data point belongs to a particular cluster.

The main ideas behind GMMs are

  • Assume each cluster is normally distributed
  • Use the Expectation Maximization (EM) algorithm to iteratively determine:
    • the parameters of the Gaussian clusters using MLE, and
    • the probabilities that a data point belongs to a particular cluster

We will see that GMMs are better suited for clustering non-spherical distributions of points.

To help us in understanding GMMs we will first review maximum likelihood estimation.

Maximum Likelihood Estimation (MLE)

Motivation

Probability distributions are specified by their parameters.

The Gaussian (Normal) distribution is determined by the parameters \(\mu\) and \(\sigma^{2}\), i.e.,

\[ \begin{align*} f(x\vert\mu, \sigma^{2}) & = \mathcal{N}(x_n\vert \mu, \sigma^{2}) \\ & = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^{2}}{2\sigma^2}}. \end{align*} \]

Maximum likelihood estimation is a method to estimate the parameters of a probability distribution given a sample of observed data that best fits the data.

Likelihood Function

The likelihood function \(L(\boldsymbol{\theta}, x)\) represents the probability of observing the given data \(x\) as a function of the parameters \(\boldsymbol{\theta}\) of the distribution.

The primary purpose of the likelihood function is to estimate the parameters that make the observed data \(x\) most probable.

The likelihood function for a set of samples \(x_{n}~\text{for}~n=1, \ldots, N\) drawn from an independent and identically distributed (i.i.d.) Gaussian distribution is

\[ L(\mu, \sigma^{2}, x_1, \ldots, x_n) \ = \prod_{n=1}^{N}\mathcal{N}(x_n\vert \mu, \sigma^{2}) \ = \prod_{n=1}^{N}\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x_n-\mu)^{2}}{2\sigma^2}}. \]

Maximizing the Likelihood

For a particular set of parameters \(\mu, \sigma^{2}\)

  • large values of \(L(\mu, \sigma^{2}, x_1, \ldots, x_n)\) indicate the observed data is very probable (high likelihood) and thus well modeled by the parameters
  • small values of \(L(\mu, \sigma^{2}, x_1, \ldots, x_n)\) indicate the observed data is very improbable (low likelihood) and thus poorly modeled by the parameters

The parameters that maximize the likelihood function are called the maximum likelihood estimates.

Log-likelihood

A common manipulation to obtain a more useful form of the likelihood function is to take its natural logarithm.

Advantages of the log-likelihood:

  • The log function is monotonically increasing, so the MLE is the same as the log-likelihood estimate
  • The product of probabilities becomes a sum of logarithms, which is more numerically stable
  • The log-likelihood is easier to work with

Applying the Log

\[ \begin{align*} \ell(\mu, \sigma^{2}, x_1, \ldots, x_n) \ & = \log{\left(L(\mu, \sigma^{2}, x_1, \ldots, x_n)\right)} \\ & = \log{\left(\prod_{n=1}^{N}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x_n-\mu)^{2}}{2\sigma^2}}\right)} \\ & = \sum_{n=1}^{N}\log{\left(\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x_n-\mu)^{2}}{2\sigma^2}}\right)} \\ & = \sum_{n=1}^{N}\left(\log{\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)} + \log{\left(e^{-\frac{(x_n-\mu)^{2}}{2\sigma^2}}\right)}\right) \\ & = -\frac{N}{2}\log{(2\pi\sigma^2)} - \frac{1}{2\sigma^{2}}\sum_{n=1}^{N}(x_n -\mu)^{2}. \end{align*} \]

Using the log-likelihood we will be able to derive formulas for the maximum likelihood estimates.

Maximizing the log-likelihood

How do we maximize (optimize) a function of parameters?

To find the optimal parameters of a function, we compute partial derivatives of the function and set them equal to zero. The solution to these equations gives us a local optimal value for the parameters.

For the case of the Gaussian we compute

\[ \begin{align*} \nabla_{\mu} \ell(\mu, \sigma^{2}, x_1, \ldots, x_n) &= 0, \\ \nabla_{\sigma} \ell(\mu, \sigma^{2}, x_1, \ldots, x_n) &= 0. \\ \end{align*} \]

Gaussian MLEs

The maximum log-likelihood estimates for a Gaussian distribution are given by

\[ \begin{align*} \bar{\mu} &= \frac{1}{N}\sum_{n=1}^{N} x_{n}, \\ \bar{\sigma}^2 &= \frac{1}{N}\sum_{n=1}^{N}(x_{n} - \bar{\mu})^{2}. \end{align*} \]

The full derivation of these results is provided here.

Tip: Try deriving these results yourself!

Summary of MLE

Given samples \(x_{1}, \ldots, x_n\) drawn from a Gaussian distribution, we compute the maximum likelihood estimates by maximizing the log likelihood function

\[ \ell (\mu, \sigma^{2}, x_1, \ldots, x_n) = -\frac{N}{2}\log{2\pi} - N\log{\sigma} - \frac{1}{2\sigma^{2}}\sum_{n=1}^{N}(x_n -\mu)^{2}. \]

This gives us the maximum likelihood estimates

\[ \begin{align*} \bar{\mu} &= \frac{1}{N}\sum_{n=1}^{N} x_{n}, \\ \bar{\sigma}^2 &= \frac{1}{N}\sum_{n=1}^{N}(x_{n} - \bar{\mu})^{2}. \end{align*} \]

We use this information to help our understanding of Gaussian Mixture models.

Gaussian Mixture Models

Univariate Gaussians

In GMMs we assume that each of our clusters follows a Gaussian (normal) distribution with their own parameters.

Code
import numpy as np
import matplotlib.pyplot as plt

# Define the parameters for the 4 Gaussians
params = [
    {"mean": 0, "variance": 1, "color": "#377eb8"},  # Blue
    {"mean": 2, "variance": 0.5, "color": "#4daf4a"},  # Green
    {"mean": -2, "variance": 1.5, "color": "#e41a1c"},  # Red
    {"mean": 1, "variance": 2, "color": "#984ea3"}  # Purple
]

# Generate x values
x = np.linspace(-10, 10, 1000)

# Plot each Gaussian
for param in params:
    mean = param["mean"]
    variance = param["variance"]
    color = param["color"]
    
    # Calculate the Gaussian
    y = (1 / np.sqrt(2 * np.pi * variance)) * np.exp(- (x - mean)**2 / (2 * variance))
    
    # Plot the Gaussian
    plt.plot(x, y, color=color, label=f"$\\mu$: {mean}, $\\sigma^2$: {variance}")

# Add legend
plt.legend()

# Add title and labels
plt.title("1D Gaussians with Different Means and Variances")
plt.xlabel("x")
plt.ylabel("Probability Density")

# Show the plot
plt.show()

This is a similar situation to our previous example where we considered labeling a person as above or below an income threshold based on the single feature age.

This could be a hypothetical situation with \(K=4\) normally distributed clusters.

Multivariate Gaussians

Given data points \(\mathbf{x}\in\mathbb{R}^{d}\), i.e., \(d\)-dimensional points, then we use the multivariate Gaussian to describe the clusters. The formula for the multivariate Gaussian is

\[ f(\mathbf{x}\vert \boldsymbol{\mu}, \Sigma) = \frac{1}{(2\pi)^{d/2}\vert \Sigma \vert^{1/2}}e^{-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})}, \]

where \(\Sigma\in\mathbb{R}^{d\times d}\) is the covariance matrix, \(\vert \Sigma\vert\), denote the determinant of \(\Sigma\), and \(\boldsymbol{\mu}\) is the \(d\)-dimensional vector of expected values.

The covariance matrix is the multidimensional analog of the variance. It determines the extent to which vector components are correlated.

See here for a refresher on the covariance matrix.

Notation

The notation \(x \sim \mathcal{N}(\mu, \sigma^2)\) is used to denote a univariate random variable that is normally distributed with expected value \(\mu\) and variance \(\sigma^{2}.\)

The notation \(\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)\) is used to denote a multivariate random variable that is normally distributed with mean \(\boldsymbol{\mu}\) and covariance matrix \(\Sigma\).

Multivariate Example: Cars

To illustrate a particular model, let us consider the properties of cars produced in the US, Europe, and Asia.

For example, let’s say we are looking to classify cars based on their model year and miles per gallon (mpg).

It seems that the data can be described (roughly) as a mixture of three multivariate Gaussian distributions.

Expectation-Maximization for Gaussian Mixture Models

The Clustering Problem

Suppose we observe \(n\) data points \(X = \{x_1, x_2, \ldots, x_n\}\) in \(\mathbb{R}^d\) that appear to come from multiple clusters.

We want to model this data as a mixture of \(K\) Gaussian distributions, each representing a different cluster.

The Gaussian Mixture Model

A Gaussian Mixture Model (GMM) assumes each data point is generated by the following process:

  1. Choose a cluster \(k \in \{1, 2, \ldots, K\}\) with probability \(\pi_k\) (where \(\sum_{k=1}^K \pi_k = 1\))
  2. Generate the point from a Gaussian distribution: \(x_i \sim \mathcal{N}(\mu_k, \Sigma_k)\)

The probability density of observing point \(x_i\) is:

\[ p(x_i \mid \theta) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k) \]

where \(\theta = \{\pi_k, \mu_k, \Sigma_k\}_{k=1}^K\) are the parameters we need to estimate.

The Likelihood Function

Assuming independent and identically distributed (i.i.d.) data, the likelihood function is:

\[ L(X, \theta) = \prod_{i=1}^{n} p(x_i \mid \theta) = \prod_{i=1}^{n} \sum_{k=1}^{K} \pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k) \]

The log-likelihood function is:

\[ \ell(X, \theta) = \log L(X, \theta) = \sum_{i=1}^{n} \log \left( \sum_{k=1}^{K} \pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k) \right) \]

Why is This Hard?

Assume \(Z = \{z_1, z_2, \ldots, z_n\}\) are the cluster assignments for each \(i=1, \ldots, n\).

If we knew which cluster generated each point, finding the parameters of the MLE would be straightforward:

  • \(\mu_k\) = mean of points in cluster \(k\)
  • \(\Sigma_k\) = covariance of points in cluster \(k\)
  • \(\pi_k\) = fraction of the total points in cluster \(k\)

But we don’t know the cluster assignments!

And if we knew the parameters, we could infer the cluster assignments.

This is a classic 🐓 and 🥚 problem.

The Log-Likelihood Challenge

The log-likelihood we want to maximize is:

\[ \ell(X,\theta) = \log p(X \mid \theta) = \sum_{i=1}^{n} \log \left( \sum_{k=1}^{K} \pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k) \right) \]

The sum inside the logarithm makes this difficult to optimize directly.

We can’t separate the terms, and taking derivatives leads to a system of coupled equations with no closed-form solution.

EM to the Rescue

EM treats the cluster assignments \(Z = \{z_1, z_2, \ldots, z_n\}\) as latent (hidden) variables.

Instead of solving for a hard assignment of points to clusters, EM finds the parameters of the MLE by iterating between two steps:

  1. E-Step: Computes soft assignments (responsibilities) - the probability each point belongs to each cluster given current parameters
  2. M-Step: Updates parameters assuming these soft assignments are the true (weighted) cluster memberships

By iterating between these steps, EM monotonically increases the likelihood until convergence to a local maximum.

The Complete Data Likelihood

If we knew the cluster assignments \(z_i \in \{1, \ldots, K\}\) for each point, the complete data log-likelihood would be:

\[ \log p(X, Z \mid \theta) = \sum_{i=1}^{n} \sum_{k=1}^{K} \mathbb{1}(z_i = k) \left[ \log \pi_k + \log \mathcal{N}(x_i \mid \mu_k, \Sigma_k) \right] \]

This is much easier to work with because the log operates on individual Gaussian densities rather than on a sum of densities.

See here for a detailed derivation.

Tip: Try deriving this yourself!

Why EM Works for GMMs

  • E-Step: Since we don’t know \(z_i\), we compute the expected value of the indicator \(\mathbb{1}(z_i = k)\), which is the posterior probability \(\gamma_{ik} = p(z_i = k \mid x_i, \theta^{(t)})\), where \(\theta^{(t)}\) is the parameter vector at iteration \(t\).
  • M-Step: With these soft assignments, maximizing the expected complete log-likelihood yields closed-form updates that look like weighted versions of the standard MLE formulas

This elegant approach transforms an intractable optimization problem into a simple iterative procedure with intuitive updates at each step.

See here for a detailed derivation.

Tip: Try deriving this yourself!

EM Algorithm for Gaussian Mixture Models

Setup and Goal

  • We observe data points \(x_1, x_2, \ldots, x_n\) in \(d\)-dimensional space
  • We assume the data comes from \(K\) Gaussian distributions, each with mean \(\mu_k\), covariance \(\Sigma_k\), and mixing weight \(\pi_k\)
  • The latent variables are the cluster assignments \(z_i\) (which component generated each point)
  • Goal: Find parameters \(\theta = \{\mu_k, \Sigma_k, \pi_k\}_{k=1}^K\) that maximize the likelihood of the observed data

Initialization

  • Randomly initialize the \(K\) means \(\mu_k\) (or use k-means clustering)
  • Initialize covariances \(\Sigma_k\) (often as identity matrices or sample covariance)
  • Initialize mixing weights \(\pi_k = \frac{1}{K}\) (uniform distribution over components)

E-Step: Compute Responsibilities

  • For each data point \(x_i\) and each component \(k\), compute the responsibility \(\gamma_{ik}\) (the probability that point \(i\) belongs to component \(k\) given current parameters)

\[ \gamma_{ik} = \frac{\pi_k \cdot \mathcal{N}(x_i \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \cdot \mathcal{N}(x_i \mid \mu_j, \Sigma_j)} \]

  • This is a soft assignment: each point gets a fractional membership in each cluster
  • Responsibilities sum to 1 for each data point: \(\sum_{k=1}^{K} \gamma_{ik} = 1\)

M-Step: Update Parameters

  • Update means as weighted averages:

\[\mu_k = \frac{\sum_{i=1}^{n} \gamma_{ik} x_i}{\sum_{i=1}^{n} \gamma_{ik}}\]

  • Update covariances using weighted samples:

\[\Sigma_k = \frac{\sum_{i=1}^{n} \gamma_{ik} (x_i - \mu_k)(x_i - \mu_k)^T}{\sum_{i=1}^{n} \gamma_{ik}}\]

  • Update mixing weights:

\[\pi_k = \frac{\sum_{i=1}^{n} \gamma_{ik}}{n}\]

  • Each parameter update has an intuitive interpretation as a weighted MLE

Iteration and Convergence

  • Alternate between E and M steps until convergence
  • Monitor the log-likelihood: it’s guaranteed to increase (or stay the same) at each iteration
  • Stop when log-likelihood change falls below a threshold or after a maximum number of iterations
  • Note: EM finds a local maximum, so results depend on initialization (run multiple times with different initializations)

Key Intuition

  • E-step: Given current cluster parameters, probabilistically assign points to clusters
  • M-step: Given these soft assignments, update cluster parameters as if they were the true assignments
  • This 🐓 and 🥚 problem is resolved through iteration

Income Level Example

We have

  • Our single data point \(x=25\), and
  • We have 2 Gaussians representing our clusters,
    • \(C_1\) with parameters \(\mu_1 = 37, \sigma_1^{2}=14\) and
    • \(C_2\) with parameters \(\mu_2 = 45, \sigma_2^{2}=11\),

Our latent variables are the probabilities that we classify someone as either above or below an income threshold, i.e., \[ \gamma_1 = P(z = 1) = P(\text{above income threshold}) \] and \[ \gamma_2 = P(z = 2) = P(\text{below income threshold}). \]

Income Level Example, cont.

Our interest in this problem is computing the posterior probability, which is,

\[ P(C_1 | x) = P(z=1 \vert x) = P(\text{above income threshold} \vert \text{age 25}). \]

How do we determine these posterior probabilities \(P(C_i \vert x) = P(z=1 \vert x)\)?

The answer is Bayes’ Rule.

Income Level Example, cont.

If we know the parameters of the Gaussians, we can determine the value of \(P(x | C_1)\) when \(x=25\). This is shown with the red dot.

Code
x = np.linspace(norm.ppf(0.001, loc = 37, scale = 14), norm.ppf(0.999, loc = 37, scale = 14), 100)
plt.plot(x, norm.pdf(x, loc = 37, scale = 14),'b-', lw = 5, alpha = 0.6, label = 'below')
x = np.linspace(norm.ppf(0.001, loc = 45, scale = 11), norm.ppf(0.999, loc = 45, scale = 11), 100)
plt.plot(x, norm.pdf(x, loc = 45, scale = 11),'g-', lw = 5, alpha = 0.6, label = 'above')
plt.plot(25, norm.pdf(25, loc = 45, scale = 11), 'ro')
plt.xlim([15, 70])
plt.xlabel('Age', size=14)
plt.legend(loc = 'best')
plt.title('Age Distributions')
plt.ylabel(r'$p(x)$', size=14)
plt.show()


Bayes’ Rule then allows us to compute

\[ P(C_1 \vert x)=\frac{P(x\vert C_1)}{P(x)}P(C_1). \]

We can always use the law of total probability to compute

\[ \begin{align*} P(x) &= P(x \vert C_1)P(C_1) + P(x\vert C_2)P(C_2), \\ &= P(z=1) P(x \vert z=1) + P(z=2) P(x\vert z=2), \\ &= \gamma_1 P(x \vert z=1) + \gamma_2 P(x\vert z=2), \\ &= \sum_{i=1}^{2} \gamma_i \cdot f(\mathbf{x}\vert \mu_i, \sigma_i). \end{align*} \]

The final formula is illustrated in the following figure.


Code
plt.figure()
x = np.linspace(norm.ppf(0.001, loc = 37, scale = 14), norm.ppf(0.999, loc = 37, scale = 14), 100)
plt.plot(x, norm.pdf(x, loc = 37, scale = 14),'b-', lw = 5, alpha = 0.6, label = 'below')
plt.plot(25, norm.pdf(25, loc = 37, scale = 14), 'ko', markersize = 8)
x = np.linspace(norm.ppf(0.001, loc = 45, scale = 11), norm.ppf(0.999, loc = 45, scale = 11), 100)
plt.plot(x, norm.pdf(x, loc = 45, scale = 11),'g-', lw = 5, alpha = 0.6, label = 'above')
plt.plot(25, norm.pdf(25, loc = 45, scale = 11), 'ro', markersize = 8)
plt.xlim([15, 70])
plt.xlabel('Age', size=14)
plt.legend(loc = 'best')
plt.title('Age Distributions')
plt.ylabel(r'$p(x)$', size=14)
plt.show()

\[ P(\text{above}\,|\,\text{age 25}) = \frac{\text{red}}{\text{red} \cdot P(\text{above}) + \text{black} \cdot P(\text{below})} \cdot P(\text{above}). \]

Where:

  • “red” = \(P(\text{age 25} | \text{above})\) and “black” = \(P(\text{age 25} | \text{below})\), and
  • \(P(\text{above})\) and \(P(\text{below})\) are the prior probabilities (mixing weights) that must sum to 1: \(P(\text{above}) + P(\text{below}) = 1\),
    • e.g. 0.5 in this example since we split the 20K points equally.

Convergence criteria

The convergence criteria for the Expectation-Maximization (EM) algorithm assess the change across iterations of either the

  • likelihood function, or
  • model parameters

with usually a limit on the maximum number of iterations.

Here are the common convergence criteria used:

Log-Likelihood Convergence

\[ |\mathcal{L}^{(t)} - \mathcal{L}^{(t-1)}| < \text{tol} \]

Where:

  • \(\mathcal{L}^{(t)}\) is the log-likelihood at iteration \(t\),
  • \(\text{tol}\) is a small positive number (e.g., \(10^{-4}\)).

Parameter Convergence

\[ ||\theta^{(t)} - \theta^{(t-1)}||_2 < \text{tol} \]

Where:

  • \(\theta^{(t)}\) represents the model parameters (means, covariances, and weights) at iteration \(t\),
  • \(||\cdot||_2\) is the Euclidean (L2) norm,
  • \(\text{tol}\) is a small positive number.

Maximum Number of Iterations

The EM algorithm is typically capped at a maximum number of iterations to avoid long runtimes in cases where the log-likelihood or parameters converge very slowly or never fully stabilize.

\[ t > \text{max\_iterations} \]

Where:

  • \(\text{max\_iterations}\) is a predefined limit (e.g., 100 or 500 iterations).

Storage and Computational Costs

It is important to be aware of the computational costs of our algorithm.

Storage costs:

  • There are \(N\) \(d\)-dimensional points
  • There are \(K\) clusters
  • There are \(N\times K\) coefficients \(r_{nk}\)
  • There are \(K\) \(d\)-dimensional cluster centers \(\boldsymbol{\mu}_k\)
  • There are \(K\) \(d\times d\) covariance matrices \(\Sigma_k\)
  • There are \(K\) weights \(w_k\)

Computational costs:

  • Computing each \(r_{nk}\) requires a sum of \(K\) evaluations of the Gaussian PDF.
  • Updating \(\boldsymbol{\mu}_k\) requires \(N\) vector summations
  • Updating \(\Sigma_k\) requires \(N\) outer products
  • Updating \(\gamma_k\) requires a division (though we must compute \(N_k\))

k-means vs GMMs

Let’s pause for a minute and compare GMM/EM with \(k\)-means.

GMM/EM

  1. Initialize randomly or using some rule
  2. Compute the probability that each point belongs in each cluster
  3. Update the clusters (weights, means, and variances).
  4. Repeat 2-3 until convergence.

\(k\)-means

  1. Initialize randomly or using some rule
  2. Assign each point to a single cluster
  3. Update the clusters (means).
  4. Repeat 2-3 until convergence.

From a practical standpoint, the main difference is that in GMMs, data points do not belong to a single cluster, but have some probability of belonging to each cluster.

In other words, as stated previously, GMMs use soft assignment.

For that reason, GMMs are also sometimes called soft \(k\)-means.


However, there is also an important conceptual difference.

The GMM starts by making an explicit assumption about how the data were generated.

It says: “the data came from a collection of multivariate Gaussians.”

We made no such assumption when we came up with the \(k\)-means problem. In that case, we simply defined an objective function and declared that it was a good one.

Nonetheless, it appears that we were making a sort of Gaussian assumption when we formulated the \(k\)-means objective function. However, it wasn’t explicitly stated.

The point is that because the GMM makes its assumptions explicit, we can

  • examine them and think about whether they are valid, and
  • replace them with different assumptions if we wish.

For example, it is perfectly valid to replace the Gaussian assumption with some other probability distribution. As long as we can estimate the parameters of such distributions from data (e.g., have MLEs), we can use EM in that case as well.

Versatility of EM

A final statement about EM generally. EM is a versatile algorithm that can be used in many other settings. What is the main idea behind it?

Notice that the problem definition only required that we find the clusters, \(C_i\), meaning that we were to find the \((\mu_i, \Sigma_i)\).

However, the EM algorithm posited that we should find as well the \(P(C_j|x_i) = P(z=j | x_i)\), that is, the probability that each point is a member of each cluster.

This is the true heart of what EM does.

By adding parameters to the problem, it actually finds a way to make the problem solvable.

These are the latent parameters we introduced earlier. Latent parameters don’t show up in the solution.

Examples

Example 1

Here is an example using GMM.

We’re going to create two clusters, one spherical, and one highly skewed.

Code
# Number of samples of larger component
n_samples = 1000

# C is a transfomation that will make a heavily skewed 2-D Gaussian
C = np.array([[0.1, -0.1], [1.7, .4]])

print(f'The covariance matrix of our skewed cluster will be:\n {C.T@C}')
The covariance matrix of our skewed cluster will be:
 [[2.9  0.67]
 [0.67 0.17]]
Code
import warnings
warnings.filterwarnings('ignore')

rng = np.random.default_rng(0)

# now we construct a data matrix that has n_samples from the skewed distribution,
# and n_samples/2 from a symmetric distribution offset to position (-4, 2)
X = np.r_[(rng.standard_normal((n_samples, 2)) @ C),
          .7 * rng.standard_normal((n_samples//2, 2)) + np.array([-4, 2])]
Code
plt.scatter(X[:, 0], X[:, 1], s = 10, alpha = 0.8)
plt.axis('equal')
plt.axis('off')
plt.show()


Code
# Fit a mixture of Gaussians with EM using two components
import sklearn.mixture
gmm = sklearn.mixture.GaussianMixture(n_components=2, 
                                      covariance_type='full', 
                                      init_params = 'kmeans')
y_pred = gmm.fit_predict(X)
Code
colors = ['bg'[p] for p in y_pred]
plt.title('Clustering via GMM')
plt.axis('off')
plt.axis('equal')
plt.scatter(X[:, 0], X[:, 1], color = colors, s = 10, alpha = 0.8)
plt.show()

Code
for clus in range(2):
    print(f'Cluster {clus}:')
    print(f' weight: {gmm.weights_[clus]:0.3f}')
    print(f' mean: {gmm.means_[clus]}')
    print(f' cov: \n{gmm.covariances_[clus]}\n')
Cluster 0:
 weight: 0.667
 mean: [-0.04719455 -0.00741777]
 cov: 
[[2.78299944 0.64624931]
 [0.64624931 0.16589271]]

Cluster 1:
 weight: 0.333
 mean: [-4.03209555  1.96770685]
 cov: 
[[ 0.46440554 -0.01263282]
 [-0.01263282  0.48113597]]

Comparison with k-means

Code
import sklearn.cluster
kmeans = sklearn.cluster.KMeans(init = 'k-means++', n_clusters = 2, n_init = 100)
y_pred_kmeans = kmeans.fit_predict(X)
colors = ['bg'[p] for p in y_pred_kmeans]
plt.title('Clustering via $k$-means\n$k$-means centers: red, GMM centers: black')
plt.axis('off')
plt.axis('equal')
plt.scatter(X[:, 0], X[:, 1], color = colors, s = 10, alpha = 0.8)
plt.plot(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 'ro')
plt.plot(gmm.means_[:,0], gmm.means_[:, 1], 'ko')
plt.show()

Code
for clus in range(2):
    print(f'Cluster {clus}:')
    print(f' center: {kmeans.cluster_centers_[clus]}\n')
Cluster 0:
 center: [0.20678138 0.04903169]

Cluster 1:
 center: [-3.91417527  1.61676661]

Overlapping Clusters

Now, let’s construct overlapping clusters. What will happen?

Code
X = np.r_[(rng.standard_normal((n_samples, 2)) @ C),
          .7 * rng.standard_normal((n_samples//2, 2))]
gmm = sklearn.mixture.GaussianMixture(n_components=2, covariance_type='full')
y_pred_over = gmm.fit_predict(X)
Code
colors = ['bgrky'[p] for p in y_pred_over]
plt.title('GMM for overlapping clusters\nNote they have nearly the same center')
plt.scatter(X[:, 0], X[:, 1], color = colors, s = 10, alpha = 0.8)
plt.axis('equal')
plt.axis('off')
plt.plot(gmm.means_[:,0], gmm.means_[:,1], 'ro')
plt.show()

Code
for clus in range(2):
    print(f'Cluster {clus}:')
    print(f' weight: {gmm.weights_[clus]:0.3f}')
    print(f' mean: {gmm.means_[clus]}\n')
    # print(f' cov: \n{gmm.covariances_[clus]}\n')
Cluster 0:
 weight: 0.647
 mean: [0.10975427 0.0237165 ]

Cluster 1:
 weight: 0.353
 mean: [-0.02170599  0.0046345 ]

How many parameters are estimated?

Most of the parameters in the model are contained in the covariance matrices.

In the most general case, for \(K\) clusters of points in \(d\) dimensions, there are \(K\) covariance matrices each of size \(d \times d\).

So we need \(Kd^2\) parameters to specify this model.

It can happen that you may not have enough data to estimate so many parameters.

Also, it can happen that you believe that clusters should have some constraints on their shapes.

Here is where the GMM assumptions become really useful.

Clusters with Equal Variance

Let’s say you believe all the clusters should have the same shape, but the shape can be arbitrary.

Then you only need to estimate one covariance matrix - just \(d^2\) parameters.

This is specified by the GMM parameter covariance_type='tied'.


Code
X = np.r_[np.dot(rng.standard_normal((n_samples, 2)), C),
          0.7 * rng.standard_normal((n_samples, 2))]
gmm = sklearn.mixture.GaussianMixture(n_components=2, covariance_type='tied')
y_pred = gmm.fit_predict(X)
Code
colors = ['bgrky'[p] for p in y_pred]
plt.scatter(X[:, 0], X[:, 1], color=colors, s=10, alpha=0.8)
plt.title('Covariance type = tied')
plt.axis('equal')
plt.axis('off')
plt.plot(gmm.means_[:,0],gmm.means_[:,1], 'ok')
plt.show()

Non-Skewed Clusters

Perhaps you believe in even more restricted shapes: all clusters should have their axes aligned with the coordinate axes.

That is, clusters are not skewed.

Then you only need to estimate the diagonals of the covariance matrices - just \(Kd\) parameters.

This is specified by the GMM parameter covariance_type='diag'.


Code
X = np.r_[np.dot(rng.standard_normal((n_samples, 2)), C),
          0.7 * rng.standard_normal((n_samples, 2))]
gmm = sklearn.mixture.GaussianMixture(n_components=4, covariance_type='diag')
y_pred = gmm.fit_predict(X)
Code
colors = ['bgrky'[p] for p in y_pred]
plt.scatter(X[:, 0], X[:, 1], color = colors, s = 10, alpha = 0.8)
plt.axis('equal')
plt.axis('off')
plt.plot(gmm.means_[:,0], gmm.means_[:,1], 'oc')
plt.show()

Round Clusters

Finally, if you believe that all clusters should be round, then you only need to estimate the \(K\) variances.

This is specified by the GMM parameter covariance_type='spherical'.


Code
X = np.r_[np.dot(rng.standard_normal((n_samples, 2)), C),
          0.7 * rng.standard_normal((n_samples, 2))]
gmm = sklearn.mixture.GaussianMixture(n_components=2, covariance_type='spherical')
y_pred = gmm.fit_predict(X)
Code
colors = ['bgrky'[p] for p in y_pred]
plt.scatter(X[:, 0], X[:, 1], color = colors, s = 10, alpha = 0.8)
plt.axis('equal')
plt.axis('off')
plt.plot(gmm.means_[:,0], gmm.means_[:,1], 'oc')
plt.show()

Back to top