Gaussian Mixture Models

From Hard to Soft Clustering

So far, we have seen how to cluster objects using \(k\)-means:

  1. Start with an initial set of cluster centers,
  2. Assign each object to its closest cluster center
  3. Recompute the centers of the new clusters
  4. Repeat 2 \(\rightarrow\) 3 until convergence

In \(k\)-means clustering every object is assigned to a single cluster.

This is called hard assignment.

However, there may be cases where we either cannot use hard assignments or we do not want to do it.

In particular, we might have reason to believe that the best description of the data is a set of overlapping clusters.

Overlapping Clusters

Rich or Poor Example

Let’s consider as an example a society that consists of just two kinds of individuals: poor or rich.

How might we model society as a mixture of poor and rich, when viewed in terms of the single feature age?

Rich or Poor Sampling

Let’s sample 20,000 rich individuals and 20,000 poor individuals. From this sample we get have the following histograms.

Code
# original inspiration for this example from
# https://www.cs.cmu.edu/~./awm/tutorials/gmm14.pdf
from scipy.stats import multivariate_normal
np.random.seed(4)
df = pd.DataFrame(multivariate_normal.rvs(mean = np.array([37, 45]), cov = np.diag([196, 121]), size = 20000),
                   columns = ['poor', 'rich'])
df.hist(bins = range(80), sharex = True)
plt.show()

We find that ages of the poor set have mean \(\mu=37\) with standard deviation \(\sigma=14\), while the ages of the rich set have mean \(\mu=45\) with standard deviation \(\sigma=11\).

Rich or Poor by Age

Code
from scipy.stats import norm
x = np.linspace(norm.ppf(0.001, loc = 37, scale = 14), norm.ppf(0.999, loc = 37, scale = 14), 100)
plt.plot(x, norm.pdf(x, loc = 37, scale = 14),'b-', lw = 5, alpha = 0.6, label = 'poor')
x = np.linspace(norm.ppf(0.001, loc = 45, scale = 11), norm.ppf(0.999, loc = 45, scale = 11), 100)
plt.plot(x, norm.pdf(x, loc = 45, scale = 11),'g-', lw = 5, alpha = 0.6, label = 'rich')
plt.xlim([15, 70])
plt.xlabel('Age', size=14)
plt.legend(loc = 'best')
plt.title('Age Distributions')
plt.ylabel(r'$p(x)$', size=14)
plt.show()

Soft Clustering

Viewed along the age dimension we observe that there are two overlapping clusters.

Furthermore, given some particular individual at a given age, say 25, we cannot say for sure which cluster they belong to.

Rather, we will use probability to quantify our uncertainty about the cluster that any single individual belongs to.

Thus, we could say that a given individual (“John Smith”, age 25) belongs to the rich cluster with some probability and the poor cluster with a different probability.

Naturally we expect the probabilities for John Smith to sum up to 1.

This is called soft assignment, and a clustering using this principle is called soft clustering.

Conditional Probability

More formally, we say that an object can belong to each particular cluster with some probability, such that the sum of the probabilities adds up to 1 for each object.

For example, assuming that we have two clusters \(C_1\) and \(C_2\), we can have that an object \(x_1\) belongs to \(C_1\) with probability \(0.3\) and to \(C_2\) with probability \(0.7\).

Note that the distribution over \(C_1\) and \(C_2\) only refers to object \(x_1\).

Thus, it is a conditional probability:

\[ \begin{align*} P(C_1 \,|\, x_1) &= 0.3, \\ P(C_2 \,|\, x_1) &= 0.7. \end{align*} \]

And to return to our previous example

\[ P(\text{rich}\,|\,\text{age 25}) + P(\text{poor}\,|\,\text{age 25}) = 1. \]

Lecture Overview

Our goal with Gaussian mixture models (GMMs) is to show how to compute the probabilities that a data point belongs to a particular cluster.

The main ideas behind GMMs are

  • Assume each cluster is normally distributed
  • Use the Expectation Maximization (EM) algorithm to iteratively determine the parameters of the Gaussian clusters
  • Determine probabilities on whether a data point belongs to a particular cluster

We will see that GMMs are better suited for clustering non-spherical distributions of points.

To help us in understanding GMMs we will first review maximum likelihood estimation.

Maximum Likelihood Estimation (MLE)

Motivation

Probability distributions are specified by their parameters.

The Gaussian (Normal) distribution is determined by the parameters \(\mu\) and \(\sigma^{2}\), i.e.,

\[ f(x\vert\mu, \sigma^{2}) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^{2}}{2\sigma^2}}. \]

Parameter estimation is the process of determining the parameters of a distribution based on a sample of observed data.

Maximum likelihood estimation is a method to estimate the parameters of an assumed probability distribution given a sample of observed data.

Likelihood Function

The likelihood function \(L(\boldsymbol{\theta}, x)\) represents the probability of observing the given data \(x\) as a function of the parameters \(\boldsymbol{\theta}\) of the distribution.

The primary purpose of the likelihood function is to estimate the parameters that make the observed data \(x\) most probable.

The likelihood function for a set of samples \(x_{n}~\text{for}~n=1, \ldots, N\) drawn from a Gaussian distribution is

\[ L(\mu, \sigma^{2}, x_1, \ldots, x_n) \ = \prod_{n=1}^{N}\mathcal{N}(x_n\vert \mu, \sigma^{2}) \ = \prod_{n=1}^{N}\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x_n-\mu)^{2}}{2\sigma^2}}. \]


For a particular set of parameters \(\mu, \sigma^{2}\)

  • large values of \(L(\mu, \sigma^{2}, x_1, \ldots, x_n)\) indicate the observed data is very probable (high likelihood)
  • small values of \(L(\mu, \sigma^{2}, x_1, \ldots, x_n)\) indicate the observed data is very improbable (low likelihood)

The parameters that maximize the likelihood function are called the maximum likelihood estimates.

Log-likelihood

A common manipulation to obtain a more useful form of the likelihood function is to take its natural logarithm.

Advantages of the log-likelihood:

  • The log function is monotonically increasing, so the MLE is the same as the log-likelihood estimate
  • The product of probabilities becomes a sum of logarithms, which is more numerically stable
  • The log-likelihood is easier to work with

Applying the Log

\[ \begin{align*} \log{\left(L(\mu, \sigma^{2}, x_1, \ldots, x_n)\right)} \ & = \log{\left(\prod_{n=1}^{N}\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x_n-\mu)^{2}}{2\sigma^2}}\right)} \\ & = \sum_{n=1}^{N}\log{\left(\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x_n-\mu)^{2}}{2\sigma^2}}\right)} \\ & = \sum_{n=1}^{N}\left(\log{\left(\frac{1}{\sigma\sqrt{2\pi}}\right)} + \log{\left(e^{-\frac{(x_n-\mu)^{2}}{2\sigma^2}}\right)}\right) \\ & = -\frac{N}{2}\log{2\pi} - N\log{\sigma} - \frac{1}{2\sigma^{2}}\sum_{n=1}^{N}(x_n -\mu)^{2}. \end{align*} \]

Using the log-likelihood we will be able to derive formulas for the maximum likelihood estimates.

Maximizing the log-likelihood

How do we maximize (optimize) a function of parameters?

To find the optimal parameters of a function, we compute partial derivatives of the function and set them equal to zero. The solution to these equations gives us a local optimal value for the parameters.

For the case of the Gaussian we compute

\[ \begin{align*} \nabla_{\mu} L(\mu, \sigma^{2}, x_1, \ldots, x_n) &= 0, \\ \nabla_{\sigma} L(\mu, \sigma^{2}, x_1, \ldots, x_n) &= 0. \\ \end{align*} \]

Gaussian MLEs

The maximum likelihood estimates for a Gaussian distribution are given by

\[ \begin{align*} \bar{\mu} &= \frac{1}{N}\sum_{n=1}^{N} x_{n}, \\ \bar{\sigma}^2 &= \frac{1}{N}\sum_{n=1}^{N}(x_{n} - \bar{\mu})^{2}. \end{align*} \]

Summary of MLE

Given samples \(x_{1}, \ldots, x_n\) drawn from a Gaussian distribution, we compute the maximum likelihood estimates by maximizing the log likelihood function

\[ \log{\left(L(\mu, \sigma^{2}, x_1, \ldots, x_n)\right)} = -\frac{N}{2}\log{2\pi} - N\log{\sigma} - \frac{1}{2\sigma^{2}}\sum_{n=1}^{N}(x_n -\mu)^{2}. \]

This gives us the maximum likelihood estimates

\[ \begin{align*} \bar{\mu} &= \frac{1}{N}\sum_{n=1}^{N} x_{n}, \\ \bar{\sigma}^2 &= \frac{1}{N}\sum_{n=1}^{N}(x_{n} - \bar{\mu})^{2}. \end{align*} \]

We use this information to help our understanding of Gaussian Mixture models.

Gaussian Mixture Models

Univariate Gaussians

In GMMs we assume that each of our clusters follows a Gaussian (normal) distribution with their own parameters.

Code
import numpy as np
import matplotlib.pyplot as plt

# Define the parameters for the 4 Gaussians
params = [
    {"mean": 0, "variance": 1, "color": "#377eb8"},  # Blue
    {"mean": 2, "variance": 0.5, "color": "#4daf4a"},  # Green
    {"mean": -2, "variance": 1.5, "color": "#e41a1c"},  # Red
    {"mean": 1, "variance": 2, "color": "#984ea3"}  # Purple
]

# Generate x values
x = np.linspace(-10, 10, 1000)

# Plot each Gaussian
for param in params:
    mean = param["mean"]
    variance = param["variance"]
    color = param["color"]
    
    # Calculate the Gaussian
    y = (1 / np.sqrt(2 * np.pi * variance)) * np.exp(- (x - mean)**2 / (2 * variance))
    
    # Plot the Gaussian
    plt.plot(x, y, color=color, label=f"$\\mu$: {mean}, $\\sigma^2$: {variance}")

# Add legend
plt.legend()

# Add title and labels
plt.title("1D Gaussians with Different Means and Variances")
plt.xlabel("x")
plt.ylabel("Probability Density")

# Show the plot
plt.show()

This is a similar situation to our previous example where we considered labeling a person as poor or rich based on the single feature age.

This could be a hypothetical situation with \(K=4\) normally distributed clusters.

Multivariate Gaussians

Given data points \(\mathbf{x}\in\mathbb{R}^{d}\), i.e., \(d\)-dimensional points, then we use the multivariate Gaussian to describe the clusters. The formula for the multivariate Gaussian is

\[ f(\mathbf{x}\vert \boldsymbol{\mu}, \Sigma) = \frac{1}{(2\pi)^{d/2}\vert \Sigma \vert^{1/2}}e^{-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})}, \]

where \(\Sigma\in\mathbb{R}^{d\times d}\) is the covariance matrix, \(\vert \Sigma\vert\), denote the determinant of \(\Sigma\), and \(\boldsymbol{\mu}\) is the \(d\)-dimensional vector of expected values.

Notation

The notation \(x \sim \mathcal{N}(\mu, \sigma^2)\) is used to denote a univariate random variable that is normally distributed with expected value \(\mu\) and variance \(\sigma^{2}.\)

The notation \(\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)\) is used to denote a multivariate random variable that is normally distributed with mean \(\boldsymbol{\mu}\) and covariance matrix \(\Sigma\).

Multivariate Example: Cars

To illustrate a particular model, let us consider the properties of cars produced in the US, Europe, and Asia.

For example, let’s say we are looking to classify cars based on their model year and miles per gallon (mpg).

It seems that the data can be described (roughly) as a mixture of three multivariate Gaussian distributions.

Intuition

Given \(K\) clusters, the goal with a GMM is to determine the probability for whether a point \(\mathbf{x}\in\mathbb{R}^{d}\) belongs to a particular cluster.

We assume that each cluster is distributed normally, according to some (unknown) parameters \(\boldsymbol{\mu}_i, \Sigma_i\). We also assume that the probability that a point belongs to a particular cluster is given by \(w_i\).

A Gaussian mixture model is defined by these parameters \(w_i, \boldsymbol{\mu}_i, \Sigma_i\) for \(i=1, \ldots, K\).

We can use these parameters to compute the probability that a point \(\mathbf{x}\) belongs to a particular cluster \(C_k\).

Similar to MLE, we must compute the parameters \(w_i, \boldsymbol{\mu}_i, \Sigma_i\).

Learning the Parameters of a GMM

Given some data how do we estimate the

  • cluster probability \(w_i\),
  • cluster mean \(\boldsymbol{\mu}_i\), and
  • cluster covariance \(\Sigma_i\)

for each \(i=1, \ldots, K\)?

We will formulate a likelihood function for these parameters \(\boldsymbol{\theta}_i = (w_i, \boldsymbol{\mu}_i, \Sigma_i)\) and then optimize this function.

Latent Variables

We assume that each data point \(x\) is produced by a latent variable \(z\).

This variable is called latent because it is never actually observed. It is used to help indicate cluster membership and is helpful for the derivation of the likelihood function.

We can view this as a one-hot-encoding that identifies a probability of membership to one of the \(K\) clusters.

We impose a distribution over \(z\) representing a soft assignment

\[ p(z) = (w_1, \cdots, w_K)^{T}, \quad \text{with} \quad 0\leq w_i \leq 1 \quad \text{and} \quad \sum_{i=1}^{K}w_i = 1. \]

PDF of a GMM

With our distribution weights \(w_i\) for the \(K\) clusters, we can compute the probability density for a GMM at any point \(\mathbf{x}\) via the formula

\[ p(\mathbf{x}) = \sum_{i=1}^{K} w_i \cdot f(\mathbf{x}\vert \mu_i, \Sigma_i). \]

If this is feeling a bit abstract, let’s see how this applies to the initial example where we were clustering a person as rich or poor based on their age.

Rich vs Poor Example

We have

  • our single data point \(x=25\), and
  • we have 2 Gaussians representing our clusters,
    • \(C_1\) with parameters \(\mu_1 = 37, \sigma_1^{2}=14\) and
    • \(C_2\) with parameters \(\mu_2 = 45, \sigma_2^{2}=11\),

our latent variables are the probabilities that we classify someone as either rich or poor, i.e., \[ w_1 = P(z = 1) = P(\text{rich}) \] and \[ w_2 = P(z = 2) = P(\text{poor}). \]

Our interest in this problem is computing the posterior probability, which is,

\[ P(C_1 | x) = P(z=1 \vert x) = P(\text{rich} \vert \text{age 25}). \]

How do we determine these posterior probabilities \(P(C_i \vert x) = P(z=1 \vert x)\)?

The answer is Bayes’ Rule.

Rich vs Poor Example Part 2

If we know the parameters of the Gaussians, we can determine the value of \(P(x | C_1)\) when \(x=25\). This is shown with the red dot.

Code
x = np.linspace(norm.ppf(0.001, loc = 37, scale = 14), norm.ppf(0.999, loc = 37, scale = 14), 100)
plt.plot(x, norm.pdf(x, loc = 37, scale = 14),'b-', lw = 5, alpha = 0.6, label = 'poor')
x = np.linspace(norm.ppf(0.001, loc = 45, scale = 11), norm.ppf(0.999, loc = 45, scale = 11), 100)
plt.plot(x, norm.pdf(x, loc = 45, scale = 11),'g-', lw = 5, alpha = 0.6, label = 'rich')
plt.plot(25, norm.pdf(25, loc = 45, scale = 11), 'ro')
plt.xlim([15, 70])
plt.xlabel('Age', size=14)
plt.legend(loc = 'best')
plt.title('Age Distributions')
plt.ylabel(r'$p(x)$', size=14)
plt.show()


Bayes’ Rule then allows us to compute

\[ P(C_1 \vert x)=\frac{P(x\vert C_1)}{P(x)}P(C_1). \]

We can always use the law of total probability to compute

\[ \begin{align*} P(x) &= P(x \vert C_1)P(C_1) + P(x\vert C_2)P(C_2), \\ &= P(z=1) P(x \vert z=1) + P(z=2) P(x\vert z=2), \\ &= w_1 P(x \vert z=1) + w_2 P(x\vert z=2), \\ &= \sum_{i=1}^{2} w_i \cdot f(\mathbf{x}\vert \mu_i, \sigma_i). \end{align*} \]

The final formula is illustrated in the following figure.


Code
plt.figure()
x = np.linspace(norm.ppf(0.001, loc = 37, scale = 14), norm.ppf(0.999, loc = 37, scale = 14), 100)
plt.plot(x, norm.pdf(x, loc = 37, scale = 14),'b-', lw = 5, alpha = 0.6, label = 'poor')
plt.plot(25, norm.pdf(25, loc = 37, scale = 14), 'ko', markersize = 8)
x = np.linspace(norm.ppf(0.001, loc = 45, scale = 11), norm.ppf(0.999, loc = 45, scale = 11), 100)
plt.plot(x, norm.pdf(x, loc = 45, scale = 11),'g-', lw = 5, alpha = 0.6, label = 'rich')
plt.plot(25, norm.pdf(25, loc = 45, scale = 11), 'ro', markersize = 8)
plt.xlim([15, 70])
plt.xlabel('Age', size=14)
plt.legend(loc = 'best')
plt.title('Age Distributions')
plt.ylabel(r'$p(x)$', size=14)
plt.show()

\[ P(\text{rich}\,|\,\text{age 25}) = \frac{\text{red}}{\text{red} \cdot P(\text{rich}) + \text{black} \cdot P(\text{poor})} \cdot P(\text{rich}).\]

GMM Likelihood

Let’s gather for all \(i=1, \ldots, K\), the coefficients \(w_i,\boldsymbol{\mu}_i, \Sigma_i\) into a vector \(\boldsymbol{\theta}\).

The likelihood function for a Gaussian mixture model with \(N\) data points

\[ L(\boldsymbol{\theta}, x_1, \ldots, x_n) = \prod_{n=1}^{N}\sum_{k=1}^{K}w_k \frac{1}{(2\pi)^{d/2}\vert \Sigma_k \vert^{1/2}}e^{-\frac{1}{2}(\mathbf{x}_n-\boldsymbol{\mu}_k)^T\Sigma_{k}^{-1}(\mathbf{x}_n-\boldsymbol{\mu}_{k})} \]

The log-likelihood is

\[ \log{\left(L(\boldsymbol{\theta}, x_1, \ldots, x_n)\right)} = \sum_{n=1}^{N}\log{\left(\sum_{k=1}^{K}w_k \frac{1}{(2\pi)^{d/2}\vert \Sigma_k \vert^{1/2}}e^{-\frac{1}{2}(\mathbf{x}_n-\boldsymbol{\mu}_k)^T\Sigma_{k}^{-1}(\mathbf{x}_n-\boldsymbol{\mu}_{k})}\right)}. \]

Unlike the log of a product, the log of a sum does not simplify nicely. As a result, the partial derivatives with respect to \(\boldsymbol{\mu}_k\), depend on the covariances and mixture weights (similarly for the other partial derivatives of \(\Sigma_k\) and \(w_k\)).

To solve this problem we turn to Expectation Maximization.

Expectation Maximization

This is another famous algorithm, in the same “super-algorithm” league as \(k\)-means.

EM is formulated using a probabilistic model for data. It can solve a problem like

Given a set of data points and a parameter \(K\), find the \((w_k, \mu_k, \Sigma_k)~k = 1,\dots,K\) that maximizes the likelihood of the data assuming a GMM with those parameters.

It can also solve lots of other problems involving maximizing the likelihood of data under a different model.

Similar to \(k\)-means, this problem is NP-hard.

Furthermore, EM only guarantees that we will find a local optimum of the objective function.

Expectation Maximization for GMM – The Algorithm

Here is the EM algorithm.

Step 1 Initialization

Initialize the parameters \(w_k, \boldsymbol{\mu}_k, \Sigma_k\) for \(k=1, \dots, K\). The final result will be sensitive to this choice, so a good (and fast) initialization procedure is \(k\)-means.

Step 2 Expectation

Use the current values for \(w_k, \boldsymbol{\mu}_k, \Sigma_k\) and for each of the \(N\) data points \(x_n\), compute the posterior probabilities

\[ r_{nk} = \frac{w_k f(\mathbf{x}_n \vert \boldsymbol{\mu}_{k}, \Sigma_k)}{\sum_{i=1}^{K}w_i f(\mathbf{x}_n \vert \boldsymbol{\mu}_{i}, \Sigma_i)}, \]

where \(f(\mathbf{x}_n \vert \boldsymbol{\mu}_{k}, \Sigma_k)\) is the multivariate Gaussian.

Step 3 Maximization

Using the values \(r_{nk}\) for \(n=1,\ldots, N\) and \(k=1, \ldots, K\). First compute \(N_{k} = \sum_{n=1}^{N} r_{nk}\). Then compute updated \(w_k, \boldsymbol{\mu}_k, \Sigma_k\) according to the formulas

\[ \boldsymbol{\mu}_{k}= \frac{1}{N_k}\sum_{n=1}^{N} r_{nk} \mathbf{x}_n, \quad \Sigma_{k} = \frac{1}{N_k}\sum_{n=1}^{N} r_{nk}(\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n -\boldsymbol{\mu}_k)^{T}, \quad w_k = \frac{N_k}{N}. \]

Step 4 Stop if convergence criterion is satisfied. Otherwise repeat Steps 2 and 3.

Convergence criteria

The convergence criteria for the Expectation-Maximization (EM) algorithm assess the change across iterations of either the

  • likelihood function, or
  • model parameters

with usually a limit on the maximum number of iterations.

Here are the common convergence criteria used:

Log-Likelihood Convergence

\[ |\mathcal{L}^{(t)} - \mathcal{L}^{(t-1)}| < \text{tol} \]

Where:

  • \(\mathcal{L}^{(t)}\) is the log-likelihood at iteration \(t\),
  • \(\text{tol}\) is a small positive number (e.g., \(10^{-4}\)).

Parameter Convergence

\[ ||\theta^{(t)} - \theta^{(t-1)}||_2 < \text{tol} \]

Where:

  • \(\theta^{(t)}\) represents the model parameters (means, covariances, and weights) at iteration \(t\),
  • \(||\cdot||_2\) is the Euclidean (L2) norm,
  • \(\text{tol}\) is a small positive number.

Maximum Number of Iterations

The EM algorithm is typically capped at a maximum number of iterations to avoid long runtimes in cases where the log-likelihood or parameters converge very slowly or never fully stabilize.

\[ t > \text{max\_iterations} \]

Where:

  • \(\text{max\_iterations}\) is a predefined limit (e.g., 100 or 500 iterations).

Storage and Computational Costs

It is important to be aware of the computational costs of our algorithm.

Storage costs:

  • There are \(N\) \(d\)-dimensional points
  • There are \(K\) clusters
  • There are \(N\times K\) coefficients \(r_{nk}\)
  • There are \(K\) \(d\)-dimensional cluster centers \(\boldsymbol{\mu}_k\)
  • There are \(K\) \(d\times d\) covariance matrices \(\Sigma_k\)
  • There are \(K\) weights \(w_k\)

Computational costs:

  • Computing each \(r_{nk}\) requires a sum of \(K\) evaluations of the Gaussian PDF.
  • Updating \(\boldsymbol{\mu}_k\) requires \(N\) vector summations
  • Updating \(\Sigma_k\) requires \(N\) outer products
  • Updating \(w_k\) requires a division (though we must compute \(N_k\))

k-means vs GMMs

Let’s pause for a minute and compare GMM/EM with \(k\)-means.

GMM/EM

  1. Initialize randomly or using some rule
  2. Compute the probability that each point belongs in each cluster
  3. Update the clusters (weights, means, and variances).
  4. Repeat 2-3 until convergence.

\(k\)-means

  1. Initialize randomly or using some rule
  2. Assign each point to a single cluster
  3. Update the clusters (means).
  4. Repeat 2-3 until convergence.

From a practical standpoint, the main difference is that in GMMs, data points do not belong to a single cluster, but have some probability of belonging to each cluster.

In other words, as stated previously, GMMs use soft assignment.

For that reason, GMMs are also sometimes called soft \(k\)-means.


However, there is also an important conceptual difference.

The GMM starts by making an explicit assumption about how the data were generated.

It says: “the data came from a collection of multivariate Gaussians.”

We made no such assumption when we came up with the \(k\)-means problem. In that case, we simply defined an objective function and declared that it was a good one.

Nonetheless, it appears that we were making a sort of Gaussian assumption when we formulated the \(k\)-means objective function. However, it wasn’t explicitly stated.

The point is that because the GMM makes its assumptions explicit, we can

  • examine them and think about whether they are valid, and
  • replace them with different assumptions if we wish.

For example, it is perfectly valid to replace the Gaussian assumption with some other probability distribution. As long as we can estimate the parameters of such distributions from data (e.g., have MLEs), we can use EM in that case as well.

Versatility of EM

A final statement about EM generally. EM is a versatile algorithm that can be used in many other settings. What is the main idea behind it?

Notice that the problem definition only required that we find the clusters, \(C_i\), meaning that we were to find the \((\mu_i, \Sigma_i)\).

However, the EM algorithm posited that we should find as well the \(P(C_j|x_i) = P(z=j | x_i)\), that is, the probability that each point is a member of each cluster.

This is the true heart of what EM does.

By adding parameters to the problem, it actually finds a way to make the problem solvable.

These are the latent parameters we introduced earlier. Latent parameters don’t show up in the solution.

Example

Here is an example using GMM.

We’re going to create two clusters, one spherical, and one highly skewed.

Code
# Number of samples of larger component
n_samples = 1000

# C is a transfomation that will make a heavily skewed 2-D Gaussian
C = np.array([[0.1, -0.1], [1.7, .4]])

print(f'The covariance matrix of our skewed cluster will be:\n {C.T@C}')
The covariance matrix of our skewed cluster will be:
 [[2.9  0.67]
 [0.67 0.17]]
Code
rng = np.random.default_rng(0)

# now we construct a data matrix that has n_samples from the skewed distribution,
# and n_samples/2 from a symmetric distribution offset to position (-4, 2)
X = np.r_[(rng.standard_normal((n_samples, 2)) @ C),
          .7 * rng.standard_normal((n_samples//2, 2)) + np.array([-4, 2])]
Code
plt.scatter(X[:, 0], X[:, 1], s = 10, alpha = 0.8)
plt.axis('equal')
plt.axis('off')
plt.show()


Code
# Fit a mixture of Gaussians with EM using two components
import sklearn.mixture
gmm = sklearn.mixture.GaussianMixture(n_components=2, 
                                      covariance_type='full', 
                                      init_params = 'kmeans')
y_pred = gmm.fit_predict(X)
Code
colors = ['bg'[p] for p in y_pred]
plt.title('Clustering via GMM')
plt.axis('off')
plt.axis('equal')
plt.scatter(X[:, 0], X[:, 1], color = colors, s = 10, alpha = 0.8)
plt.show()

Code
for clus in range(2):
    print(f'Cluster {clus}:')
    print(f' weight: {gmm.weights_[clus]:0.3f}')
    print(f' mean: {gmm.means_[clus]}')
    print(f' cov: \n{gmm.covariances_[clus]}\n')
Cluster 0:
 weight: 0.667
 mean: [-0.04719455 -0.00741777]
 cov: 
[[2.78299944 0.64624931]
 [0.64624931 0.16589271]]

Cluster 1:
 weight: 0.333
 mean: [-4.03209555  1.96770685]
 cov: 
[[ 0.46440554 -0.01263282]
 [-0.01263282  0.48113597]]

Comparison with k-means

Code
import sklearn.cluster
kmeans = sklearn.cluster.KMeans(init = 'k-means++', n_clusters = 2, n_init = 100)
y_pred_kmeans = kmeans.fit_predict(X)
colors = ['bg'[p] for p in y_pred_kmeans]
plt.title('Clustering via $k$-means\n$k$-means centers: red, GMM centers: black')
plt.axis('off')
plt.axis('equal')
plt.scatter(X[:, 0], X[:, 1], color = colors, s = 10, alpha = 0.8)
plt.plot(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 'ro')
plt.plot(gmm.means_[:,0], gmm.means_[:, 1], 'ko')
plt.show()

Code
for clus in range(2):
    print(f'Cluster {clus}:')
    print(f' center: {kmeans.cluster_centers_[clus]}\n')
Cluster 0:
 center: [0.20678138 0.04903169]

Cluster 1:
 center: [-3.91417527  1.61676661]

Overlapping Clusters

Now, let’s construct overlapping clusters. What will happen?

Code
X = np.r_[(rng.standard_normal((n_samples, 2)) @ C),
          .7 * rng.standard_normal((n_samples//2, 2))]
gmm = sklearn.mixture.GaussianMixture(n_components=2, covariance_type='full')
y_pred_over = gmm.fit_predict(X)
Code
colors = ['bgrky'[p] for p in y_pred_over]
plt.title('GMM for overlapping clusters\nNote they have nearly the same center')
plt.scatter(X[:, 0], X[:, 1], color = colors, s = 10, alpha = 0.8)
plt.axis('equal')
plt.axis('off')
plt.plot(gmm.means_[:,0], gmm.means_[:,1], 'ro')
plt.show()

Code
for clus in range(2):
    print(f'Cluster {clus}:')
    print(f' weight: {gmm.weights_[clus]:0.3f}')
    print(f' mean: {gmm.means_[clus]}\n')
    # print(f' cov: \n{gmm.covariances_[clus]}\n')
Cluster 0:
 weight: 0.647
 mean: [0.10975427 0.0237165 ]

Cluster 1:
 weight: 0.353
 mean: [-0.02170599  0.0046345 ]

How many parameters are estimated?

Most of the parameters in the model are contained in the covariance matrices.

In the most general case, for \(K\) clusters of points in \(d\) dimensions, there are \(K\) covariance matrices each of size \(d \times d\).

So we need \(Kd^2\) parameters to specify this model.

It can happen that you may not have enough data to estimate so many parameters.

Also, it can happen that you believe that clusters should have some constraints on their shapes.

Here is where the GMM assumptions become really useful.

Clusters with Equal Variance

Let’s say you believe all the clusters should have the same shape, but the shape can be arbitrary.

Then you only need to estimate one covariance matrix - just \(d^2\) parameters.

This is specified by the GMM parameter covariance_type='tied'.


Code
X = np.r_[np.dot(rng.standard_normal((n_samples, 2)), C),
          0.7 * rng.standard_normal((n_samples, 2))]
gmm = sklearn.mixture.GaussianMixture(n_components=2, covariance_type='tied')
y_pred = gmm.fit_predict(X)
Code
colors = ['bgrky'[p] for p in y_pred]
plt.scatter(X[:, 0], X[:, 1], color=colors, s=10, alpha=0.8)
plt.title('Covariance type = tied')
plt.axis('equal')
plt.axis('off')
plt.plot(gmm.means_[:,0],gmm.means_[:,1], 'ok')
plt.show()

Non-Skewed Clusters

Perhaps you believe in even more restricted shapes: all clusters should have their axes aligned with the coordinate axes.

That is, clusters are not skewed.

Then you only need to estimate the diagonals of the covariance matrices - just \(Kd\) parameters.

This is specified by the GMM parameter covariance_type='diag'.


Code
X = np.r_[np.dot(rng.standard_normal((n_samples, 2)), C),
          0.7 * rng.standard_normal((n_samples, 2))]
gmm = sklearn.mixture.GaussianMixture(n_components=4, covariance_type='diag')
y_pred = gmm.fit_predict(X)
Code
colors = ['bgrky'[p] for p in y_pred]
plt.scatter(X[:, 0], X[:, 1], color = colors, s = 10, alpha = 0.8)
plt.axis('equal')
plt.axis('off')
plt.plot(gmm.means_[:,0], gmm.means_[:,1], 'oc')
plt.show()

Round Clusters

Finally, if you believe that all clusters should be round, then you only need to estimate the \(K\) variances.

This is specified by the GMM parameter covariance_type='spherical'.


Code
X = np.r_[np.dot(rng.standard_normal((n_samples, 2)), C),
          0.7 * rng.standard_normal((n_samples, 2))]
gmm = sklearn.mixture.GaussianMixture(n_components=2, covariance_type='spherical')
y_pred = gmm.fit_predict(X)
Code
colors = ['bgrky'[p] for p in y_pred]
plt.scatter(X[:, 0], X[:, 1], color = colors, s = 10, alpha = 0.8)
plt.axis('equal')
plt.axis('off')
plt.plot(gmm.means_[:,0], gmm.means_[:,1], 'oc')
plt.show()

Back to top