We will demonstrate the relationship between PCA and the SVD.
t-SNE is an alternative nonlinear method for dimensionality reduction.
PCA
Dimensionality Reduction
Input: \(\mathbf{x}_1,\ldots, \mathbf{x}_m\) with \(\mathbf{x}_i \in \mathbb{R}^n \: \: \forall \: i \in \{1, \ldots, n\}.\)
Output: \(\mathbf{y}_1,\dots, \mathbf{y}_m\) with \(\mathbf{y}_i \in \mathbb{R}^d \: \: \forall \: i \in \{1, \dots, n\}\).
The goal is to compute the new data points \(\mathbf{y}_i\) such that \(d << n\) while still preserving the most information contained in the data points \(\mathbf{x}_i\).
Be aware that row i of the data matrix \(X_0\) is the \(n\) dimensional vector \(\mathbf{x}_i\). This keeps our matrix in the structure \(m\) data rows and \(n\) features (columns). The same is true for \(Y\) (i.e., there are \(m\) rows with \(d\) features).
We consider a dataset from Novembre et al. (2008). The authors collected DNA data called SNPs (single nucleotide polymorphism) from 3000 individuals in Europe.
SNPs describe the changes in DNA from a common base pair (A,T,C,G). A value of 0 means no changes to the base pair, a value of 1 means 1 change to the base pair, and a value of 2 means both base pairs change.
The data for each individual consisted of approximately 500k SNPs. This means the data matrix we are working with is 3000 x 500k.
The authors performed PCA and plotted 1387 of the individuals in the reduced dimensions.
For comparison, a color coded map of western Europe is added and the same color coding was applied to the data samples by country of origin.
Key observations:
the first principal components of the data almost reproduce the map of Europe, i.e., they appear to correspond to latitude and longitude.
SNPs are similar geographically
DNA of an individual reveals their birthplace within a few hundred kilometers
Would a similar study in the USA be effective?
PCA Overview
The following describes the process to perform PCA and obtain your reduced data set. The goal is given
Input: \(X_0\in\mathbb{R}^{m\times n}\), produce
Output: \(Y\in\mathbb{R}^{m\times d}\) with \(d << n\).
The first step is to center the data (subtract the mean across rows): \(X_0 \rightarrow X\).
The matrix \(S\) is symmetric, i.e., \(S = S^{T}.\)
Spectral Decomposition
We use the fact that \(S\) is symmetric to apply the Spectral Decomposition, which states:
Every real symmetric matrix \(S\) has the factorization \(V\Lambda V^{T}\), where \(\Lambda\) is a diagonal matrix that contains the eigenvalues of S and the columns of \(V\) are orthogonal eigenvectors of \(S\).
The direction \(\mathbf{v}_i\) is the \(i\)-th principal component and the corresponding \(\lambda_i\) accounts for the \(i\)-th largest variance in the dataset.
In other words, \(\mathbf{v}_1\) is the first principal component and is the direction that accounts for the most variance in the dataset.
The vector \(\mathbf{v}_d\) is the \(d\)-th principal component and is the direction that accounts for the \(d\)-th most variance in the dataset.
What is the value of the total variance in the data?
Total variance: \(T = 100\).
The first principal component accounts for all the variance in the dataset.
PCA Summary
The columns of \(V\) are the principal directions (or components).
The reduced dimension data is the projection of the data in the direction of the principal directions, i.e., \(Y=XV'\).
The total variance \(T\) in the data is the sum of all eigenvalues: \(T = \lambda_1 + \lambda_2 + \dots + \lambda_n.\)
The first eigenvector \(\mathbf{v}_1\) points in the most significant direction of the data. This direction explains the largest fraction \(\lambda_1/T\) of the total variance.
The second eigenvector \(\mathbf{v}_2\) accounts for a smaller fraction \(\lambda_2/T\).
The explained variance of component \(i\) is the value \(\lambda_i/T\). As \(i\rightarrow n\) the explained variance gets smaller and approaches \(\lambda_n/T\).
The cumulative (total) explained variance is the sum of the explained variances. The total explained variance for all eigenvalues is 1.
Case Study: Random Data
Let’s consider some randomly generated data consisting of 2 features.
Code
import numpy as npimport matplotlib.pyplot as pltrng = np.random.default_rng(seed=42)import numpy as np# Set the seed for reproducibilityseed =42rng = np.random.default_rng(seed)n_samples =500C = np.array([[0.1, 0.6], [2., .6]])X0 = rng.standard_normal((n_samples, 2)) @ C + np.array([-6, 3])X = X0 - X0.mean(axis=0)plt.scatter(X[:, 0], X[:, 1])plt.axis('equal')plt.show()
Let’s do PCA and plot the principle components over our dataset. As expected, the principal components are orthogonal and point in the directions of the maximum variance.
Code
import numpy as npimport matplotlib.pyplot as pltorigin = [0, 0]# Compute the covariance matrixcov_matrix = np.cov(X, rowvar=False)# Calculate the eigenvalues and eigenvectorseigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)# Sort the eigenvalues and eigenvectors in descending ordersorted_indices = np.argsort(eigenvalues)[::-1]eigenvalues = eigenvalues[sorted_indices]eigenvectors = eigenvectors[:, sorted_indices]# Plot the original datasetplt.scatter(X[:, 0], X[:, 1], alpha=0.2)# Plot the principal componentsfor i inrange(len(eigenvalues)): plt.quiver(origin[0], origin[1], -eigenvectors[0, i], -eigenvectors[1, i], angles='xy', scale_units='xy', scale=1, color=['r', 'g'][i])plt.title('Principal Components on Original Dataset')plt.axis('equal')plt.show()
Case Study: Digits
Let’s consider another example using the MNIST dataset.
Code
from sklearn import datasetsdigits = datasets.load_digits()plt.figure(figsize=(8, 8),)for i inrange(8): plt.subplot(2, 4, i +1) plt.imshow(digits.images[i], cmap='gray_r') plt.axis('off')plt.tight_layout()plt.show()
We first stack the columns of each digit on top of each other (this operation is called vectorizing) and perform PCA on the 64-D representation of the digits.
We can plot the explained variance ratio and the total (cumulative) explained variance ratio.
Let’s plot the data in the first 2 principal component directions. We’ll use the digit labels to color each digit in the reduced space.
Code
plt.figure(figsize=(7, 7))scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab20', edgecolor='k', s=50)plt.title('Digits in PC1 and PC2 Space')plt.xlabel('Principal Component 1')plt.ylabel('Principal Component 2')# Create a legend with discrete labelslegend_labels = np.unique(y)handles = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=plt.cm.tab20(i /9), markersize=10) for i in legend_labels]plt.legend(handles, legend_labels, title="Digit Label", loc="best")plt.show()
We observe the following in our plot of the digits in the first two principal components:
There is a decent clustering of some of our digits, in particular 0, 2, 3, 4, and 6.
The numbers 0 and 6 seem to be relatively close to each other in this space.
There is not a very clear separation of the number 5 from some of the other points.
To scale or not to scale
Consider a situation where we have age (years) and height (feet) data for 4 people.
Person
Age [years]
Height [feet]
A
25
6.232
B
30
6.232
C
25
5.248
D
30
5.248
Notice that the dominant direction of the variance is aligned horizontally.
What if the height is in cm?
Person
Age [years]
Height [cm]
A
25
189.95
B
30
189.95
C
25
159.96
D
30
159.96
Notice that the dominant direction of the variance is aligned vertically.
Let’s standardize our data.
Person
Age [years]
Height [cm]
A
-1
1
B
1
1
C
-1
-1
D
1
-1
When we normalize our data, we observe an equal distribution of the variance.
What quantity is represented by \(\frac{Cov(X, Y)}{\sigma_X\sigma_Y}\), where \(X\) and \(Y\) represent the random variables associated to sampling a person with that Age and Height?
This is the correlation matrix. When you standardize your data you are no longer working with the covariance matrix but the correlation matrix.
PCA on a standardized dataset and working with the correlation matrix is the best practice when your data has very different scales.
PCA identifies the directions with the maximum variance and large scale data can disproportionately influence the results of your PCA.
Relationship to the SVD
Recall that the SVD of a mean-centered data matrix \(X\in\mathbb{R}^{m\times n}\) (\(m>n\)) is
\[
X = U\Sigma V^T.
\]
How can we relate this to what we learned in PCA?
In PCA, we computed an eigen-decomposition of the covariance matrix, i.e., \(S=V\Lambda V^{T}\).
Consider the following computation using the SVD of \(X\)
We see that eigenvalues \(\lambda_i\) of the matrix \(S\) are the squares \(\sigma_{i}^{2}\) of the singular values \(\Sigma\) (scaled by \(\frac{1}{m-1}\)).
In practice, PCA is done by computing the SVD of your data matrix as opposed to forming the covariance matrix and computing the eigenvalues.
The principal components are the right singular vectors \(V\). The projected data is computed as \(XV\). The eigenvalues of \(S\) are obtained from squaring the entries of \(\Sigma\) and scaling them by \(\frac{1}{m-1}\).
Reasons to compute PCA this way are
Numerical Stability: SVD is a numerically stable method, which means it can handle datasets with high precision and avoid issues related to floating-point arithmetic errors.
Efficiency: SVD is computationally efficient, especially for large datasets. Many optimized algorithms and libraries (like those in NumPy and scikit-learn) leverage SVD for fast computations.
Direct Computation of Principal Components: SVD directly provides the principal components (right singular vectors) and the singular values, which are related to the explained variance. This makes the process straightforward and avoids the need to compute the covariance matrix explicitly.
Memory Efficiency: For large datasets, SVD can be more memory-efficient. Techniques like truncated SVD can be used to compute only the top principal components, reducing memory usage.
Versatility: SVD is a fundamental linear algebra technique used in various applications beyond PCA, such as signal processing, image compression, and solving linear systems. This versatility makes it a well-studied and widely implemented method.
Pros and Cons
Here are some advantages (left column) and disadvantages (right column) of PCA.
Advantages
Allows for visualizations
Removes redundant variables
Prevents overfitting
Speeds up other ML algorithms
Disadvantages
reduces interpretability
can result in information loss
can be less effective than non-linear methods
t-SNE
What is t-SNE?
t-SNE stands for t-Distributed Stochastic Neighbor Embedding.
It is a non-linear dimensionality reduction technique.
Primarily used for visualizing high-dimensional data in 2 or 3 dimensions.
Developed by Laurens van der Maaten and Geoffrey Hinton in 2008, see Van der Maaten and Hinton (2008).
How t-SNE Works
t-SNE converts high-dimensional Euclidean distances into conditional probabilities.
It aims to preserve the local structure of the data.
Uses a heavy-tailed Student-t distribution in the low-dimensional space to prevent crowding. This is where the \(t\) in \(t\)-SNE comes from.
We will describe how this process works.
Step 1: Compute Pairwise Similarities
Calculate pairwise similarities between points in the high-dimensional space.
Use a Gaussian distribution to convert distances into probabilities.
The probability \(p_{j\vert i}\) between points \(i\) and \(j\) is given by: \[
p_{j\vert i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}.
\]
The \(p_{j\vert i}\) value represents the conditional probability that point \(x_i\) would choose \(x_j\) as its neighbor.
Note that we set \(p_{i\vert i} = 0\) – we don’t consider a point to be its own neighbor.
The t-SNE defines the joint probability \(p_{ij} = \frac{p_{j\vert i} + p_{i\vert j}}{2d}\) where \(d\) is the dimension of the point. This is because the probabilities \(p_{j\vert i}\) and \(p_{i\vert j}\) are different. Using this value of \(p_{ij}\) is called symmetric t-SNE.
Visualizing the Pairwise Similarities
The Gaussian distribution is centered at point \(x_i\) and the points \(x_j\) that are further away have less probability of being chosen as neighbor.
Code
mean =0std_dev =1x = np.linspace(-5, 5, 1000)y = (1/(std_dev * np.sqrt(2* np.pi))) * np.exp(-0.5* ((x - mean) / std_dev)**2)# Points on the x-axispoints_x = [0, 1, 4, 4.5]points_y = [0, 0, 0, 0]colors = ['#009E73', '#009E73', '#CC79A7', '#CC79A7']# Create the plotplt.plot(x, y, label='Gaussian Distribution')plt.scatter(points_x, points_y, color=colors, zorder=5)# Add labels and titleplt.xlabel('X-axis')plt.ylabel('Probability Density')plt.title('Gaussian Distribution with Individual Points')plt.legend()# Show the plotplt.show()
Choosing \(\sigma\)
How do we determine the variance for the Gaussian?
The variance is determined by a hyperparameter called the perplexity.
As a result the perplexity controls the effective number of neighbors.
A high perplexity means more neighbors for each point. A low perplexity means less neighbors are considered for each point.
The perplexity is defined as \(\operatorname{Perp}(P_i) = 2^{H(P_i)}\), where \(H(P_i)\) is the entropy of \(P_i\). \(P_i\) is the probability distribution induced by \(\sigma_i\).
The entropy \(H(P_i) = -\sum_j p_{j\vert i} \log_2{(p_{j\vert i})}\).
The value for \(\sigma_i\) is computed to produce a distribution \(P_i\) which equals the chosen value of the perplexity.
The perplexity commonly ranges between 5 and 50.
Step 2: Define Low-Dimensional Map
Initialize points randomly in the low-dimensional space.
Define a similar probability distribution using a Student-t distribution: \[
q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}}.
\]
The heavy tails of the Student-t distribution help to spread out the points and prevent crowding of the points in the lower dimensional space.
Code
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import t# Generate data for Gaussian distributionmean =0std_dev =1x = np.linspace(-5, 5, 1000)gaussian_y = (1/(std_dev * np.sqrt(2* np.pi))) * np.exp(-0.5* ((x - mean) / std_dev)**2)# Generate data for Student's t-distribution with 10 degrees of freedomdf =1t_y = t.pdf(x, df)# Create the plotplt.plot(x, gaussian_y, label='Gaussian Distribution')plt.plot(x, t_y, label="Student's t-Distribution", linestyle='dashed')# Add labels and titleplt.xlabel('X-axis')plt.ylabel('Probability Density')plt.title('Gaussian Distribution and Student\'s t-Distribution')plt.legend()# Show the plotplt.show()
Step 3: Minimize Kullback-Leibler Divergence
Minimize the Kullback-Leibler (KL) divergence between the high-dimensional and low-dimensional distributions: \[
KL(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}.
\]
The KL divergence measures how different distribution \(P\) is from distribution \(Q\). In other words, how different are each of the higher dimensional \(P_i\) point distributions from the lower dimensional \(Q_i\) distributions.
By minimizing this function, we ensure that the lower dimensional point clusters are similar to the higher dimensional clusters.
To minimize the KL divergence we use gradient descent to iteratively adjust the positions of points in the low-dimensional space.
Pros of t-SNE
Excellent for visualizing complex datasets: Reveals clusters and patterns that are not visible in high-dimensional space.
Captures non-linear relationships: Preserves local structure and relationships between points.
Preserves local structure effectively: Ensures that similar points in high-dimensional space remain close in the low-dimensional representation.
Cons of t-SNE
Computationally intensive: Requires significant computational resources, especially for large datasets.
Results can vary depending on hyperparameters: Parameters like perplexity and learning rate can significantly affect the outcome.
Not suitable for all types of data analysis: Primarily used for visualization, not for tasks like feature extraction or predictive modeling.
Case Study: Digits
Let’s consider how t-SNE performs clustering the MNIST digits datset.
Code
from sklearn.manifold import TSNEfrom sklearn import datasetsdigits = datasets.load_digits()X = digits.datay = digits.targettsne = TSNE(n_components=2, random_state=42)X_tsne = tsne.fit_transform(X)plt.figure(figsize=(7, 7))scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab20', edgecolor='k', s=50)plt.title('Digits in t-SNE Space')plt.xlabel('tSNE Component 1')plt.ylabel('tSNE Component 2')# Create a legend with discrete labelslegend_labels = np.unique(y)handles = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=plt.cm.tab20(i /9), markersize=10) for i in legend_labels]plt.legend(handles, legend_labels, title="Digit Label", loc="best")plt.show()
t-SNE vs PCA: Dimensionality Reduction
PCA: Linear method, reduces dimensions by maximizing variance along principal components.
t-SNE: Non-linear method, focuses on preserving local structure and relationships between points.
t-SNE vs PCA: Visualization
PCA: Good for understanding global structure and variance in the data.
t-SNE: Superior for visualizing clusters, local relationships, and non-linear structures.
t-SNE vs PCA: Computational Complexity
PCA: Faster and less computationally intensive, suitable for large datasets.
t-SNE: Slower, requires more computational resources, but provides more detailed visualizations.
Recap
Today we discussed two dimensionality reduction techniques: PCA and t-SNE.
We considered both on the MNIST digits dataset.
In summary
PCA is a faster, linear dimensionality reduction technique.
PCA captures the maximum amount of variance in each principal direction.
No parameter adjustments are needed for the SVD.
t-SNE is a powerful tool for reducing and visualizing high-dimensional data.
Choose t-SNE for detailed visualizations of local structures and clusters.
t-SNE is sensitive to changes in it’s parameters.
Both methods have their own strengths and are chosen based on the specific needs of the analysis.
Novembre, John, Toby Johnson, Katarzyna Bryc, Zoltán Kutalik, Adam R Boyko, Adam Auton, Amit Indap, et al. 2008. “Genes Mirror Geography Within Europe.”Nature 456 (7218): 98–101.
Strang, Gilbert. 2016. Introduction to Linear Algebra. 5th ed. Wellesley-Cambridge Press Welleslye, MA.
Van der Maaten, Laurens, and Geoffrey Hinton. 2008. “Visualizing Data Using t-SNE.”Journal of Machine Learning Research 9 (11).