Dimensionality Reduction - PCA + t-SNE

We previously learned how to use the SVD as a tool for constructing low-rank matrices.

We now consider it as a tool for transforming (i.e., reducing the dimension of) our data.

Overview

Collected data is often high-dimensional. The high-dimensionality of, or the large number of features in a dataset is challenging to work with.

High-Dimensional Challenges

We have seen some of these challenges already, in particular:

  • the curse of dimensionality, where data points become sparse in higher dimensions and distance metrics have less meaning,
  • overfitting, where high-dimensional data can lead to overly complex models that fit to noise in the data as opposed to the actual signal,
  • computational complexity, high-dimensional data requires more computing power and memory,
  • visualization, where high-dimensional data makes understanding and interpreting the data difficult.

How can we reduce the dimension of our data but still preserve the most important information in our dataset?

We consider two techniques:

  • Principle Component Analysis (PCA)
  • t-distributed stochastic neighbor embedding (t-SNE)

We will demonstrate the relationship between PCA and the SVD.

t-SNE is an alternative nonlinear method for dimensionality reduction.

PCA

Dimensionality Reduction

Input: \(\mathbf{x}_1,\ldots, \mathbf{x}_m\) with \(\mathbf{x}_i \in \mathbb{R}^n \: \: \forall \: i \in \{1, \ldots, n\}.\)

Output: \(\mathbf{y}_1,\dots, \mathbf{y}_m\) with \(\mathbf{y}_i \in \mathbb{R}^d \: \: \forall \: i \in \{1, \dots, n\}\).

The goal is to compute the new data points \(\mathbf{y}_i\) such that \(d << n\) while still preserving the most information contained in the data points \(\mathbf{x}_i\).

Be aware that row i of the data matrix \(X_0\) is the \(n\) dimensional vector \(\mathbf{x}_i\). This keeps our matrix in the structure \(m\) data rows and \(n\) features (columns). The same is true for \(Y\) (i.e., there are \(m\) rows with \(d\) features).

\[ X_0 = \begin{bmatrix} x_{11} & x_{12} & \dots & x_{1n} \\ x_{21} & x_{22} & \dots & x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m1} & x_{m2} & \dots & x_{mn} \end{bmatrix} \: \xrightarrow[\text{PCA}]{} \: Y = \begin{bmatrix} y_{11} & y_{12} & \dots & y_{1d} \\ y_{21} & y_{22} & \dots & y_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ y_{m1} & y_{m2} & \dots & y_{md} \end{bmatrix} \]

PCA Application: Genes Mirror Geography

We consider a dataset from Novembre et al. (2008). The authors collected DNA data called SNPs (single nucleotide polymorphism) from 3000 individuals in Europe.

SNPs describe the changes in DNA from a common base pair (A,T,C,G). A value of 0 means no changes to the base pair, a value of 1 means 1 change to the base pair, and a value of 2 means both base pairs change.

The data for each individual consisted of approximately 500k SNPs. This means the data matrix we are working with is 3000 x 500k.


The authors performed PCA and plotted 1387 of the individuals in the reduced dimensions.

Image Credit Novembre et al. (2008)

For comparison, a color coded map of western Europe is added and the same color coding was applied to the data samples by country of origin.


Key observations:

  • the first principal components of the data almost reproduce the map of Europe, i.e., they appear to correspond to latitude and longitude.
  • SNPs are similar geographically
  • DNA of an individual reveals their birthplace within a few hundred kilometers

Would a similar study in the USA be effective?

PCA Overview

The following describes the process to perform PCA and obtain your reduced data set. The goal is given

Input: \(X_0\in\mathbb{R}^{m\times n}\), produce

Output: \(Y\in\mathbb{R}^{m\times d}\) with \(d << n\).

The first step is to center the data (subtract the mean across rows): \(X_0 \rightarrow X\).

Example: \[ X_0 = \begin{bmatrix} 90 & 60 & 60\\ 80 & 60 & 70 \end{bmatrix} \]

The mean across rows is: \[ \boldsymbol{\mu} = \begin{bmatrix} 85 & 60 & 65 \end{bmatrix} \]

The mean-centered dataset is

\[ X = \begin{bmatrix} 5 & 0 & -5\\ -5 & 0 & 5 \end{bmatrix} \]

The next step is to determine the directions of the data that correspond to the largest variances. These are the principal components.


Least Squares Interpretation

Centered data often clusters along a line (or other low-dimensional subspace of \(\mathbb{R}^{n}\)).

The sum of variances (squared distances to the mean) of the projected points is a maximum.

The sum of residuals (squared distances from the points to the line) is a minimum.

What is the statistical entity that measures the variability in the data?

Answer: the covariance matrix.

Covariance Matrix

Let \(X\in\mathbb{R}^{m\times n}\) contained the centered data. Recall that the sample covariance matrix is defined by

\[ S = \frac{1}{m-1}X^{T}X \]

Example:
\[ X = \begin{bmatrix} 5 & 0 & -5\\ -5 & 0 & 5 \end{bmatrix} \]

\[ S = \begin{bmatrix} 5 & -5\\ 0 & 0\\ -5 & 5 \end{bmatrix} \begin{bmatrix} 5 & 0 & -5\\ -5 & 0 & 5 \end{bmatrix} = \begin{bmatrix} 50 & 0 & -50\\ 0 & 0 & 0\\ -50 & 0 & 50 \end{bmatrix} \]

The matrix \(S\) is symmetric, i.e., \(S = S^{T}.\)

Spectral Decomposition

We use the fact that \(S\) is symmetric to apply the Spectral Decomposition, which states:

Every real symmetric matrix \(S\) has the factorization \(V\Lambda V^{T}\), where \(\Lambda\) is a diagonal matrix that contains the eigenvalues of S and the columns of \(V\) are orthogonal eigenvectors of \(S\).

You can refresh your memory about this in the Linear Algebra Refresher.


The covariance matrix \(S \in \mathbb{R}^{n\times n}\) has the spectral decomposition \(S = V\Lambda V^T\) where

\[ \Lambda = \begin{bmatrix} \lambda_1 \\ & \lambda_2 & & \text{\Large0}\\ & & \ddots \\ & \text{\Large0} & & \lambda_{n-1} \\ & & & & \lambda_n \end{bmatrix} \] with \(\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_{n-1} \geq \lambda_n\), and \[ V = \begin{bmatrix} \bigg| & \bigg| & & \bigg| & \bigg| \\ \mathbf{v}_1 & \mathbf{v}_2 & \dots & \mathbf{v}_{n-1} & \mathbf{v}_n \\ \bigg| & \bigg| & & \bigg| & \bigg| \end{bmatrix} \]

with \(S\mathbf{v}_i = \lambda_i \mathbf{v}_i\) and \(\mathbf{v}_i \perp \mathbf{v}_j\) for \(i\neq j\).


The previous decomposition was for all \(n\) dimensions. To obtain our reduced data, we take the first \(d\) columns of \(V\), i.e.,

\[ V' = \begin{bmatrix} \bigg| & \bigg| & & \bigg| \\ \mathbf{v}_1 & \mathbf{v}_2 & \dots & \mathbf{v}_{d} \\ \bigg| & \bigg| & & \bigg| \end{bmatrix}, \] and the \(d\times d\) upper block of matrix \(\Lambda\) \[ \Lambda' = \begin{bmatrix} \lambda_1 & & \\ & \lambda_2 & \\ & & \ddots \\ & & & & \lambda_d \end{bmatrix}. \]


The direction \(\mathbf{v}_i\) is the \(i\)-th principal component and the corresponding \(\lambda_i\) accounts for the \(i\)-th largest variance in the dataset.

In other words, \(\mathbf{v}_1\) is the first principal component and is the direction that accounts for the most variance in the dataset.

The vector \(\mathbf{v}_d\) is the \(d\)-th principal component and is the direction that accounts for the \(d\)-th most variance in the dataset.

The reduced data matrix is obtained by computing

\[ Y = XV'. \]


Returning to our example \[X = \begin{bmatrix} 5 & 0 & -5\\ -5 & 0 & 5 \end{bmatrix}, \quad S = \begin{bmatrix} 50 & 0 & -50\\ 0 & 0 & 0\\ -50 & 0 & 50 \end{bmatrix} \]

Eigenvalues of \(S\): \(\lambda_1 = 100, \: \lambda_2 = \lambda_3 =0\)

Eigenvectors of \(S\): \[ \mathbf{v}_1 = \begin{bmatrix} 0.7071 \\ 0 \\ -0.7071 \end{bmatrix}, \: \mathbf{v}_2 = \begin{bmatrix} 0.7071 \\ 0 \\ 0.7071 \end{bmatrix}, \: \mathbf{v}_3 = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix} \]

What is the value of the total variance in the data?

Total variance: \(T = 100\).

The first principal component accounts for all the variance in the dataset.

PCA Summary

  • The columns of \(V\) are the principal directions (or components).
  • The reduced dimension data is the projection of the data in the direction of the principal directions, i.e., \(Y=XV'\).
  • The total variance \(T\) in the data is the sum of all eigenvalues: \(T = \lambda_1 + \lambda_2 + \dots + \lambda_n.\)
  • The first eigenvector \(\mathbf{v}_1\) points in the most significant direction of the data. This direction explains the largest fraction \(\lambda_1/T\) of the total variance.
  • The second eigenvector \(\mathbf{v}_2\) accounts for a smaller fraction \(\lambda_2/T\).
  • The explained variance of component \(i\) is the value \(\lambda_i/T\). As \(i\rightarrow n\) the explained variance gets smaller and approaches \(\lambda_n/T\).
  • The cumulative (total) explained variance is the sum of the explained variances. The total explained variance for all eigenvalues is 1.

Case Study: Random Data

Let’s consider some randomly generated data consisting of 2 features.

Code
import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=42)

import numpy as np

# Set the seed for reproducibility
seed = 42
rng = np.random.default_rng(seed)

n_samples = 500
C = np.array([[0.1, 0.6], [2., .6]])
X0 = rng.standard_normal((n_samples, 2)) @ C + np.array([-6, 3])
X = X0 - X0.mean(axis=0)

plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal')
plt.show()


Let’s do PCA and plot the principle components over our dataset. As expected, the principal components are orthogonal and point in the directions of the maximum variance.

Code
import numpy as np
import matplotlib.pyplot as plt

origin = [0, 0]

# Compute the covariance matrix
cov_matrix = np.cov(X, rowvar=False)

# Calculate the eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

# Sort the eigenvalues and eigenvectors in descending order
sorted_indices = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[sorted_indices]
eigenvectors = eigenvectors[:, sorted_indices]

# Plot the original dataset
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)

# Plot the principal components
for i in range(len(eigenvalues)):
    plt.quiver(origin[0], origin[1], -eigenvectors[0, i], -eigenvectors[1, i],
               angles='xy', scale_units='xy', scale=1, color=['r', 'g'][i])

plt.title('Principal Components on Original Dataset')
plt.axis('equal')
plt.show()

Case Study: Digits

Let’s consider another example using the MNIST dataset.

Code
from sklearn import datasets
digits = datasets.load_digits()

plt.figure(figsize=(8, 8),)
for i in range(8):
    plt.subplot(2, 4, i + 1)
    plt.imshow(digits.images[i], cmap='gray_r')
    plt.axis('off')
plt.tight_layout()
plt.show()


We first stack the columns of each digit on top of each other (this operation is called vectorizing) and perform PCA on the 64-D representation of the digits.

We can plot the explained variance ratio and the total (cumulative) explained variance ratio.

Code
from sklearn.decomposition import PCA

X = digits.data
y = digits.target

pca = PCA()
X_pca = pca.fit_transform(X)

# Create subplots
fig, axs = plt.subplots(2, 1, figsize=(6, 5))

# Scree plot (graph of eigenvalues corresponding to PC number)
# This shows the explained variance ratio
axs[0].plot(np.arange(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, marker='o', linestyle='--')
axs[0].set_title('Scree Plot')
axs[0].set_xlabel('Principal Component')
axs[0].set_ylabel('Explained Variance Ratio')
axs[0].grid(True)

# Cumulative explained variance plot
axs[1].plot(np.arange(1, len(pca.explained_variance_ratio_) + 1), np.cumsum(pca.explained_variance_ratio_), marker='o', linestyle='--')
axs[1].set_title('Cumulative Explained Variance Plot')
axs[1].set_xlabel('Principal Component')
axs[1].set_ylabel('Cumulative Explained Variance')
axs[1].grid(True)

# Adjust layout
plt.tight_layout()
plt.show()


Let’s plot the data in the first 2 principal component directions. We’ll use the digit labels to color each digit in the reduced space.

Code
plt.figure(figsize=(7, 7))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab20', edgecolor='k', s=50)
plt.title('Digits in PC1 and PC2 Space')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')

# Create a legend with discrete labels
legend_labels = np.unique(y)
handles = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=plt.cm.tab20(i / 9), markersize=10) for i in legend_labels]
plt.legend(handles, legend_labels, title="Digit Label", loc="best")

plt.show()


We observe the following in our plot of the digits in the first two principal components:

  • There is a decent clustering of some of our digits, in particular 0, 2, 3, 4, and 6.
  • The numbers 0 and 6 seem to be relatively close to each other in this space.
  • There is not a very clear separation of the number 5 from some of the other points.

To scale or not to scale

Consider a situation where we have age (years) and height (feet) data for 4 people.

Person Age [years] Height [feet]
A 25 6.232
B 30 6.232
C 25 5.248
D 30 5.248

Notice that the dominant direction of the variance is aligned horizontally.


What if the height is in cm?

Person Age [years] Height [cm]
A 25 189.95
B 30 189.95
C 25 159.96
D 30 159.96

Notice that the dominant direction of the variance is aligned vertically.


Let’s standardize our data.

Person Age [years] Height [cm]
A -1 1
B 1 1
C -1 -1
D 1 -1

When we normalize our data, we observe an equal distribution of the variance.


What quantity is represented by \(\frac{Cov(X, Y)}{\sigma_X\sigma_Y}\), where \(X\) and \(Y\) represent the random variables associated to sampling a person with that Age and Height?

This is the correlation matrix. When you standardize your data you are no longer working with the covariance matrix but the correlation matrix.

PCA on a standardized dataset and working with the correlation matrix is the best practice when your data has very different scales.

PCA identifies the directions with the maximum variance and large scale data can disproportionately influence the results of your PCA.

Relationship to the SVD

Recall that the SVD of a mean-centered data matrix \(X\in\mathbb{R}^{m\times n}\) (\(m>n\)) is

\[ X = U\Sigma V^T. \]


How can we relate this to what we learned in PCA?

In PCA, we computed an eigen-decomposition of the covariance matrix, i.e., \(S=V\Lambda V^{T}\).

Consider the following computation using the SVD of \(X\)

\[ \begin{align*} X^{T}X = (U\Sigma V^T)^{T}(U\Sigma V^T) = V\Sigma U^TU \Sigma V^{T} = V\Sigma^{2}V^T \end{align*} \]

We see that eigenvalues \(\lambda_i\) of the matrix \(S\) are the squares \(\sigma_{i}^{2}\) of the singular values \(\Sigma\) (scaled by \(\frac{1}{m-1}\)).


In practice, PCA is done by computing the SVD of your data matrix as opposed to forming the covariance matrix and computing the eigenvalues.

The principal components are the right singular vectors \(V\). The projected data is computed as \(XV\). The eigenvalues of \(S\) are obtained from squaring the entries of \(\Sigma\) and scaling them by \(\frac{1}{m-1}\).

Reasons to compute PCA this way are

  • Numerical Stability: SVD is a numerically stable method, which means it can handle datasets with high precision and avoid issues related to floating-point arithmetic errors.

  • Efficiency: SVD is computationally efficient, especially for large datasets. Many optimized algorithms and libraries (like those in NumPy and scikit-learn) leverage SVD for fast computations.

  • Direct Computation of Principal Components: SVD directly provides the principal components (right singular vectors) and the singular values, which are related to the explained variance. This makes the process straightforward and avoids the need to compute the covariance matrix explicitly.

  • Memory Efficiency: For large datasets, SVD can be more memory-efficient. Techniques like truncated SVD can be used to compute only the top principal components, reducing memory usage.

  • Versatility: SVD is a fundamental linear algebra technique used in various applications beyond PCA, such as signal processing, image compression, and solving linear systems. This versatility makes it a well-studied and widely implemented method.

Pros and Cons

Here are some advantages (left column) and disadvantages (right column) of PCA.

Advantages

  • Allows for visualizations
  • Removes redundant variables
  • Prevents overfitting
  • Speeds up other ML algorithms

Disadvantages

  • reduces interpretability
  • can result in information loss
  • can be less effective than non-linear methods

t-SNE

What is t-SNE?

  • t-SNE stands for t-Distributed Stochastic Neighbor Embedding.
  • It is a non-linear dimensionality reduction technique.
  • Primarily used for visualizing high-dimensional data in 2 or 3 dimensions.
  • Developed by Laurens van der Maaten and Geoffrey Hinton in 2008, see Van der Maaten and Hinton (2008).

How t-SNE Works

  • t-SNE converts high-dimensional Euclidean distances into conditional probabilities.
  • It aims to preserve the local structure of the data.
  • Uses a heavy-tailed Student-t distribution in the low-dimensional space to prevent crowding. This is where the \(t\) in \(t\)-SNE comes from.

We will describe how this process works.

Step 1: Compute Pairwise Similarities

  • Calculate pairwise similarities between points in the high-dimensional space.
  • Use a Gaussian distribution to convert distances into probabilities.
  • The probability \(p_{j\vert i}\) between points \(i\) and \(j\) is given by: \[ p_{j\vert i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}. \]
  • The \(p_{j\vert i}\) value represents the conditional probability that point \(x_i\) would choose \(x_j\) as its neighbor.
  • Note that we set \(p_{i\vert i} = 0\) – we don’t consider a point to be its own neighbor.
  • The t-SNE defines the joint probability \(p_{ij} = \frac{p_{j\vert i} + p_{i\vert j}}{2d}\) where \(d\) is the dimension of the point. This is because the probabilities \(p_{j\vert i}\) and \(p_{i\vert j}\) are different. Using this value of \(p_{ij}\) is called symmetric t-SNE.

Visualizing the Pairwise Similarities

The Gaussian distribution is centered at point \(x_i\) and the points \(x_j\) that are further away have less probability of being chosen as neighbor.

Code
mean = 0
std_dev = 1
x = np.linspace(-5, 5, 1000)
y = (1/(std_dev * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mean) / std_dev)**2)

# Points on the x-axis
points_x = [0, 1, 4, 4.5]
points_y = [0, 0, 0, 0]
colors = ['#009E73', '#009E73', '#CC79A7', '#CC79A7']

# Create the plot
plt.plot(x, y, label='Gaussian Distribution')
plt.scatter(points_x, points_y, color=colors, zorder=5)

# Add labels and title
plt.xlabel('X-axis')
plt.ylabel('Probability Density')
plt.title('Gaussian Distribution with Individual Points')
plt.legend()

# Show the plot
plt.show()

Choosing \(\sigma\)

How do we determine the variance for the Gaussian?

  • The variance is determined by a hyperparameter called the perplexity.
  • As a result the perplexity controls the effective number of neighbors.
  • A high perplexity means more neighbors for each point. A low perplexity means less neighbors are considered for each point.
  • The perplexity is defined as \(\operatorname{Perp}(P_i) = 2^{H(P_i)}\), where \(H(P_i)\) is the entropy of \(P_i\). \(P_i\) is the probability distribution induced by \(\sigma_i\).
  • The entropy \(H(P_i) = -\sum_j p_{j\vert i} \log_2{(p_{j\vert i})}\).
  • The value for \(\sigma_i\) is computed to produce a distribution \(P_i\) which equals the chosen value of the perplexity.

The perplexity commonly ranges between 5 and 50.

Step 2: Define Low-Dimensional Map

  • Initialize points randomly in the low-dimensional space.
  • Define a similar probability distribution using a Student-t distribution: \[ q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}}. \]
  • The heavy tails of the Student-t distribution help to spread out the points and prevent crowding of the points in the lower dimensional space.

Code
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t

# Generate data for Gaussian distribution
mean = 0
std_dev = 1
x = np.linspace(-5, 5, 1000)
gaussian_y = (1/(std_dev * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mean) / std_dev)**2)

# Generate data for Student's t-distribution with 10 degrees of freedom
df = 1
t_y = t.pdf(x, df)

# Create the plot
plt.plot(x, gaussian_y, label='Gaussian Distribution')
plt.plot(x, t_y, label="Student's t-Distribution", linestyle='dashed')

# Add labels and title
plt.xlabel('X-axis')
plt.ylabel('Probability Density')
plt.title('Gaussian Distribution and Student\'s t-Distribution')
plt.legend()

# Show the plot
plt.show()

Step 3: Minimize Kullback-Leibler Divergence

  • Minimize the Kullback-Leibler (KL) divergence between the high-dimensional and low-dimensional distributions: \[ KL(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}. \]
  • The KL divergence measures how different distribution \(P\) is from distribution \(Q\). In other words, how different are each of the higher dimensional \(P_i\) point distributions from the lower dimensional \(Q_i\) distributions.
  • By minimizing this function, we ensure that the lower dimensional point clusters are similar to the higher dimensional clusters.
  • To minimize the KL divergence we use gradient descent to iteratively adjust the positions of points in the low-dimensional space.

Pros of t-SNE

  • Excellent for visualizing complex datasets: Reveals clusters and patterns that are not visible in high-dimensional space.
  • Captures non-linear relationships: Preserves local structure and relationships between points.
  • Preserves local structure effectively: Ensures that similar points in high-dimensional space remain close in the low-dimensional representation.

Cons of t-SNE

  • Computationally intensive: Requires significant computational resources, especially for large datasets.
  • Results can vary depending on hyperparameters: Parameters like perplexity and learning rate can significantly affect the outcome.
  • Not suitable for all types of data analysis: Primarily used for visualization, not for tasks like feature extraction or predictive modeling.

Case Study: Digits

Let’s consider how t-SNE performs clustering the MNIST digits datset.

Code
from sklearn.manifold import TSNE
from sklearn import datasets

digits = datasets.load_digits()
X = digits.data
y = digits.target

tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

plt.figure(figsize=(7, 7))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab20', edgecolor='k', s=50)
plt.title('Digits in t-SNE Space')
plt.xlabel('tSNE Component 1')
plt.ylabel('tSNE Component 2')

# Create a legend with discrete labels
legend_labels = np.unique(y)
handles = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=plt.cm.tab20(i / 9), markersize=10) for i in legend_labels]
plt.legend(handles, legend_labels, title="Digit Label", loc="best")

plt.show()

t-SNE vs PCA: Dimensionality Reduction

  • PCA: Linear method, reduces dimensions by maximizing variance along principal components.
  • t-SNE: Non-linear method, focuses on preserving local structure and relationships between points.

t-SNE vs PCA: Visualization

  • PCA: Good for understanding global structure and variance in the data.
  • t-SNE: Superior for visualizing clusters, local relationships, and non-linear structures.

t-SNE vs PCA: Computational Complexity

  • PCA: Faster and less computationally intensive, suitable for large datasets.
  • t-SNE: Slower, requires more computational resources, but provides more detailed visualizations.

Recap

Today we discussed two dimensionality reduction techniques: PCA and t-SNE.

We considered both on the MNIST digits dataset.

In summary

  • PCA is a faster, linear dimensionality reduction technique.
  • PCA captures the maximum amount of variance in each principal direction.
  • No parameter adjustments are needed for the SVD.
  • t-SNE is a powerful tool for reducing and visualizing high-dimensional data.
  • Choose t-SNE for detailed visualizations of local structures and clusters.
  • t-SNE is sensitive to changes in it’s parameters.
  • Both methods have their own strengths and are chosen based on the specific needs of the analysis.

Back to top

References

Novembre, John, Toby Johnson, Katarzyna Bryc, Zoltán Kutalik, Adam R Boyko, Adam Auton, Amit Indap, et al. 2008. “Genes Mirror Geography Within Europe.” Nature 456 (7218): 98–101.
Strang, Gilbert. 2016. Introduction to Linear Algebra. 5th ed. Wellesley-Cambridge Press Welleslye, MA.
Van der Maaten, Laurens, and Geoffrey Hinton. 2008. “Visualizing Data Using t-SNE.” Journal of Machine Learning Research 9 (11).