GMM Appendix D: Deriving the E-Step: From Indicators to Responsibilities

The Complete Data Log-Likelihood1

Recall that if we knew the cluster assignments \(z_1, z_2, \ldots, z_n\), the complete data log-likelihood would be:

\[ \log p(X, Z \mid \theta) = \sum_{i=1}^{n} \log p(x_i, z_i \mid \theta) \]

For a GMM, each data point \(x_i\) is generated by:

  1. Choosing cluster \(z_i = k\) with probability \(\pi_k\)
  2. Drawing \(x_i\) from \(\mathcal{N}(\mu_k, \Sigma_k)\)

So the joint probability is:

\[ p(x_i, z_i \mid \theta) = p(z_i \mid \theta) \cdot p(x_i \mid z_i, \theta) \]


Using Indicator Variables

We can write \(z_i = k\) using an indicator variable. Let:

\[ \mathbb{1}(z_i = k) = \begin{cases} 1 & \text{if } z_i = k \\ 0 & \text{otherwise} \end{cases} \]

Then we can express the complete data log-likelihood as:

\[ \log p(X, Z \mid \theta) = \sum_{i=1}^{n} \sum_{k=1}^{K} \mathbb{1}(z_i = k) \log [\pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k)] \]

Why? For each point \(i\), only one value of \(k\) has \(\mathbb{1}(z_i = k) = 1\), so we pick out exactly the right term.

Expanding:

\[ \log p(X, Z \mid \theta) = \sum_{i=1}^{n} \sum_{k=1}^{K} \mathbb{1}(z_i = k) [\log \pi_k + \log \mathcal{N}(x_i \mid \mu_k, \Sigma_k)] \]


The E-Step: Taking Expectations

The problem is that we don’t observe \(Z\), so we can’t compute the complete data log-likelihood directly. The EM algorithm solves this by taking the expected value of the complete data log-likelihood with respect to the distribution of \(Z\) given \(X\) and the current parameters \(\theta^{(t)}\).

The Q-Function

\[ Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{Z \mid X, \theta^{(t)}} [\log p(X, Z \mid \theta)] \]

Substituting our expression:

\[ Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{Z \mid X, \theta^{(t)}} \left[ \sum_{i=1}^{n} \sum_{k=1}^{K} \mathbb{1}(z_i = k) [\log \pi_k + \log \mathcal{N}(x_i \mid \mu_k, \Sigma_k)] \right] \]

Since expectation is linear, we can move it inside the sum:

\[ Q(\theta \mid \theta^{(t)}) = \sum_{i=1}^{n} \sum_{k=1}^{K} \mathbb{E}_{Z \mid X, \theta^{(t)}} [\mathbb{1}(z_i = k)] \cdot [\log \pi_k + \log \mathcal{N}(x_i \mid \mu_k, \Sigma_k)] \]


Computing the Expected Value of the Indicator

Now we need to compute:

\[ \mathbb{E}_{Z \mid X, \theta^{(t)}} [\mathbb{1}(z_i = k)] \]

What is this expectation?

The expectation of an indicator variable is just the probability that the indicator equals 1:

\[ \mathbb{E}[\mathbb{1}(z_i = k)] = 1 \cdot P(z_i = k) + 0 \cdot P(z_i \neq k) = P(z_i = k) \]

But we need the conditional expectation given \(X\) and \(\theta^{(t)}\):

\[ \mathbb{E}_{Z \mid X, \theta^{(t)}} [\mathbb{1}(z_i = k)] = P(z_i = k \mid X, \theta^{(t)}) \]

Factorization of the conditional distribution

Since the \(z_i\) are conditionally independent given the data and parameters (each point’s cluster assignment depends only on that point):

\[ P(z_i = k \mid X, \theta^{(t)}) = P(z_i = k \mid x_i, \theta^{(t)}) \]

This is the posterior probability that point \(i\) belongs to cluster \(k\), given the observed data and current parameter estimates.


Defining the Responsibility

We define the responsibility of cluster \(k\) for point \(i\) as:

\[ \gamma_{ik} = P(z_i = k \mid x_i, \theta^{(t)}) \]

So we’ve shown that:

\[ \boxed{\mathbb{E}_{Z \mid X, \theta^{(t)}} [\mathbb{1}(z_i = k)] = \gamma_{ik}} \]


Computing the Responsibility Using Bayes’ Rule

Now we need to actually compute \(\gamma_{ik} = P(z_i = k \mid x_i, \theta^{(t)})\).

Bayes’ Rule

\[ P(z_i = k \mid x_i, \theta^{(t)}) = \frac{P(x_i \mid z_i = k, \theta^{(t)}) \cdot P(z_i = k \mid \theta^{(t)})}{P(x_i \mid \theta^{(t)})} \]

Breaking down each term:

  1. Prior: \(P(z_i = k \mid \theta^{(t)}) = \pi_k^{(t)}\) (the mixing weight)

  2. Likelihood: \(P(x_i \mid z_i = k, \theta^{(t)}) = \mathcal{N}(x_i \mid \mu_k^{(t)}, \Sigma_k^{(t)})\)

  3. Evidence (marginal likelihood): \[ \begin{aligned} P(x_i \mid \theta^{(t)}) & = \sum_{j=1}^{K} P(x_i \mid z_i = j, \theta^{(t)}) \cdot P(z_i = j \mid \theta^{(t)}) \\ & = \sum_{j=1}^{K} \pi_j^{(t)} \mathcal{N}(x_i \mid \mu_j^{(t)}, \Sigma_j^{(t)}) \end{aligned} \]

Final formula for responsibility:

\[ \boxed{\gamma_{ik} = \frac{\pi_k^{(t)} \mathcal{N}(x_i \mid \mu_k^{(t)}, \Sigma_k^{(t)})}{\sum_{j=1}^{K} \pi_j^{(t)} \mathcal{N}(x_i \mid \mu_j^{(t)}, \Sigma_j^{(t)})}} \]


The Complete E-Step

Putting it all together, the Q-function becomes:

\[ Q(\theta \mid \theta^{(t)}) = \sum_{i=1}^{n} \sum_{k=1}^{K} \gamma_{ik} [\log \pi_k + \log \mathcal{N}(x_i \mid \mu_k, \Sigma_k)] \]

where:

\[ \gamma_{ik} = \frac{\pi_k^{(t)} \mathcal{N}(x_i \mid \mu_k^{(t)}, \Sigma_k^{(t)})}{\sum_{j=1}^{K} \pi_j^{(t)} \mathcal{N}(x_i \mid \mu_j^{(t)}, \Sigma_j^{(t)})} \]


Intuitive Interpretation

The Indicator Variable Perspective

  • If we knew \(z_i = k\), then \(\mathbb{1}(z_i = k) = 1\) and this term contributes fully to the log-likelihood
  • If we knew \(z_i \neq k\), then \(\mathbb{1}(z_i = k) = 0\) and this term contributes nothing

The Responsibility Perspective

  • Since we don’t know \(z_i\), we use a soft assignment \(\gamma_{ik} \in [0, 1]\)
  • \(\gamma_{ik}\) is the probability that point \(i\) came from cluster \(k\)
  • If \(\gamma_{ik} = 0.8\), then point \(i\) contributes 80% to cluster \(k\)’s statistics
  • The responsibilities for each point sum to 1: \(\sum_{k=1}^{K} \gamma_{ik} = 1\)

Mathematical Equivalence

\[ \text{Hard assignment: } \mathbb{1}(z_i = k) \in \{0, 1\} \]

\[ \text{Soft assignment: } \gamma_{ik} = \mathbb{E}[\mathbb{1}(z_i = k)] \in [0, 1] \]


Numerical Example

Code
import numpy as np
from scipy.stats import multivariate_normal

# Current parameters (iteration t)
pi = np.array([0.3, 0.5, 0.2])  # mixing weights
mu = np.array([[0, 0], [3, 3], [0, 5]])  # means
Sigma = [np.eye(2), np.eye(2), np.eye(2)]  # covariances

# A single data point
x_i = np.array([2, 2])

# Compute likelihood for each component
likelihoods = np.array([
    multivariate_normal.pdf(x_i, mean=mu[k], cov=Sigma[k])
    for k in range(3)
])

print("Likelihoods p(x_i | z_i=k, θ):")
for k in range(3):
    print(f"  Component {k+1}: {likelihoods[k]:.6f}")

# Compute numerators: π_k * p(x_i | z_i=k, θ)
numerators = pi * likelihoods
print(f"\nNumerators π_k * p(x_i | z_i=k, θ):")
for k in range(3):
    print(f"  Component {k+1}: {numerators[k]:.6f}")

# Compute denominator: sum over all components
denominator = np.sum(numerators)
print(f"\nDenominator (marginal): {denominator:.6f}")

# Compute responsibilities
gamma_i = numerators / denominator
print(f"\nResponsibilities γ_ik:")
for k in range(3):
    print(f"  Component {k+1}: {gamma_i[k]:.6f} ({gamma_i[k]*100:.1f}%)")

print(f"\nSum of responsibilities: {np.sum(gamma_i):.6f}")

# Interpretation
print(f"\nInterpretation:")
print(f"Point x_i = {x_i} is:")
print(f"  {gamma_i[0]*100:.1f}% likely from cluster 1 (mean {mu[0]})")
print(f"  {gamma_i[1]*100:.1f}% likely from cluster 2 (mean {mu[1]})")
print(f"  {gamma_i[2]*100:.1f}% likely from cluster 3 (mean {mu[2]})")
Likelihoods p(x_i | z_i=k, θ):
  Component 1: 0.002915
  Component 2: 0.058550
  Component 3: 0.000239

Numerators π_k * p(x_i | z_i=k, θ):
  Component 1: 0.000875
  Component 2: 0.029275
  Component 3: 0.000048

Denominator (marginal): 0.030197

Responsibilities γ_ik:
  Component 1: 0.028960 (2.9%)
  Component 2: 0.969455 (96.9%)
  Component 3: 0.001585 (0.2%)

Sum of responsibilities: 1.000000

Interpretation:
Point x_i = [2 2] is:
  2.9% likely from cluster 1 (mean [0 0])
  96.9% likely from cluster 2 (mean [3 3])
  0.2% likely from cluster 3 (mean [0 5])

Visualization of Responsibilities

Code
import matplotlib.pyplot as plt

# Generate a grid of points
x_range = np.linspace(-3, 6, 100)
y_range = np.linspace(-3, 8, 100)
X_grid, Y_grid = np.meshgrid(x_range, y_range)
grid_points = np.column_stack([X_grid.ravel(), Y_grid.ravel()])

# Compute responsibilities for each point on the grid
responsibilities = np.zeros((len(grid_points), 3))

for i, point in enumerate(grid_points):
    likelihoods = np.array([
        multivariate_normal.pdf(point, mean=mu[k], cov=Sigma[k])
        for k in range(3)
    ])
    numerators = pi * likelihoods
    denominator = np.sum(numerators)
    responsibilities[i] = numerators / denominator

# Reshape for plotting
gamma_1 = responsibilities[:, 0].reshape(X_grid.shape)
gamma_2 = responsibilities[:, 1].reshape(X_grid.shape)
gamma_3 = responsibilities[:, 2].reshape(X_grid.shape)

# Create RGB image where each color represents a cluster
rgb_image = np.zeros((X_grid.shape[0], X_grid.shape[1], 3))
rgb_image[:, :, 0] = gamma_1  # Red for cluster 1
rgb_image[:, :, 1] = gamma_2  # Green for cluster 2
rgb_image[:, :, 2] = gamma_3  # Blue for cluster 3

# Plot
fig, axes = plt.subplots(1, 4, figsize=(20, 5))

# RGB combined view
axes[0].imshow(rgb_image, extent=[x_range[0], x_range[-1], y_range[0], y_range[-1]], 
               origin='lower', aspect='auto')
axes[0].scatter(mu[:, 0], mu[:, 1], c='white', s=200, marker='X', 
                edgecolors='black', linewidths=2)
axes[0].set_title('Combined Responsibilities\n(Red=C1, Green=C2, Blue=C3)', fontsize=12)
axes[0].set_xlabel('x₁')
axes[0].set_ylabel('x₂')

# Individual responsibility maps
for k in range(3):
    gamma_k = responsibilities[:, k].reshape(X_grid.shape)
    im = axes[k+1].contourf(X_grid, Y_grid, gamma_k, levels=20, cmap='viridis')
    axes[k+1].scatter(mu[k, 0], mu[k, 1], c='red', s=200, marker='X', 
                      edgecolors='black', linewidths=2)
    axes[k+1].set_title(f'γ_i{k+1}: Responsibility of Component {k+1}', fontsize=12)
    axes[k+1].set_xlabel('x₁')
    axes[k+1].set_ylabel('x₂')
    plt.colorbar(im, ax=axes[k+1])

plt.tight_layout()
plt.show()


Key Properties of Responsibilities

1. Normalization

For each point \(i\): \[ \sum_{k=1}^{K} \gamma_{ik} = 1 \]

This follows from the law of total probability.

Code
# Verify normalization
print("Verification that responsibilities sum to 1:")
for i in range(5):
    point = grid_points[i]
    likelihoods = np.array([
        multivariate_normal.pdf(point, mean=mu[k], cov=Sigma[k])
        for k in range(3)
    ])
    numerators = pi * likelihoods
    gamma = numerators / np.sum(numerators)
    print(f"Point {i}: sum of γ = {np.sum(gamma):.10f}")
Verification that responsibilities sum to 1:
Point 0: sum of γ = 1.0000000000
Point 1: sum of γ = 1.0000000000
Point 2: sum of γ = 1.0000000000
Point 3: sum of γ = 1.0000000000
Point 4: sum of γ = 1.0000000000

2. Soft Assignment

  • \(\gamma_{ik} = 1\): point \(i\) definitely belongs to cluster \(k\)
  • \(\gamma_{ik} = 0\): point \(i\) definitely does not belong to cluster \(k\)
  • \(0 < \gamma_{ik} < 1\): partial membership (uncertainty)

3. Relationship to Hard K-Means

K-means uses hard assignments: \[ z_i = \arg\max_k \gamma_{ik} \]

EM uses the full probability distribution over assignments.

Code
# Compare soft vs hard assignments
test_point = np.array([1.5, 1.5])

likelihoods = np.array([
    multivariate_normal.pdf(test_point, mean=mu[k], cov=Sigma[k])
    for k in range(3)
])
numerators = pi * likelihoods
gamma = numerators / np.sum(numerators)

print(f"Test point: {test_point}")
print(f"\nSoft assignment (EM):")
for k in range(3):
    print(f"  Cluster {k+1}: {gamma[k]:.4f}")

print(f"\nHard assignment (K-means):")
hard_assignment = np.argmax(gamma) + 1
print(f"  Cluster {hard_assignment}")
Test point: [1.5 1.5]

Soft assignment (EM):
  Cluster 1: 0.3744
  Cluster 2: 0.6239
  Cluster 3: 0.0017

Hard assignment (K-means):
  Cluster 2

Summary

We’ve derived that:

\[ \boxed{\mathbb{E}_{Z \mid X, \theta^{(t)}} [\mathbb{1}(z_i = k)] = \gamma_{ik} = P(z_i = k \mid x_i, \theta^{(t)})} \]

And computed it using Bayes’ rule:

\[ \boxed{\gamma_{ik} = \frac{\pi_k^{(t)} \mathcal{N}(x_i \mid \mu_k^{(t)}, \Sigma_k^{(t)})}{\sum_{j=1}^{K} \pi_j^{(t)} \mathcal{N}(x_i \mid \mu_j^{(t)}, \Sigma_j^{(t)})}} \]

Key insight: Since we don’t know which cluster generated each point, we replace the binary indicator \(\mathbb{1}(z_i = k)\) with its expected value under the posterior distribution, which is the probability (responsibility) \(\gamma_{ik}\).

Back to top

Footnotes

  1. Courtesy of Claude.ai↩︎