SVD - Low Rank Approximations

Low Rank Approximations

We now consider applications of the Singular Value Decomposition (SVD).

SVD is “the Swiss Army Knife of Numerical Linear Algebra.”

Dianne O’Leary, MMDS ’06 (Workshop on Algorithms for Modern Massive Data Sets)

We will see how the SVD is used for

low rank approximations
dimensionality reduction

Singular Vectors and Values

For \(A\in\mathbb{R}^{m\times n}\) with \(m>n\) and rank \(k\), there exists a set of orthogonal vectors \(\{\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_n\}\) and a set of orthogonal vectors \(\{\mathbf{u}_1, \mathbf{u}_2, \dots, \mathbf{u}_m\}\) such that

\[ A\mathbf{v}_1 = \sigma_1 \mathbf{u}_1 \cdots A\mathbf{v}_k = \sigma_k \mathbf{u}_k \quad\quad A\mathbf{v}_{k+1} = 0 \cdots A\mathbf{v}_n = 0. \]

We can collect the vectors \(\mathbf{v}_i\) into a matrix \(V\) and the vectors \(\mathbf{u}_i\) into a matrix \(U\). \[ A \begin{bmatrix} \vert & \vert & & \vert \\ \mathbf{v}_1 & \mathbf{v}_2 & \dots & \mathbf{v}_n \\ \vert & \vert & & \vert \end{bmatrix} = \begin{bmatrix} \vert & \vert & & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \dots & \mathbf{u}_m \\ \vert & \vert & & \vert \end{bmatrix} \left[ \begin{array}{c|c} \begin{matrix} \sigma_1 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \sigma_k \end{matrix} & \mathbf{0} \\ \hline \mathbf{0} & \mathbf{0} \end{array} \right] . \]

We call the \(\mathbf{v}_i\) the right singular vectors and the \(\mathbf{u}_i\) the left singular vectors.

And because \(V\) is an orthogonal matrix, we have \(V V^T = I\), so we can right multiply both sides by \(V^T\) to get

\[ A = \begin{bmatrix} \vert & \vert & & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \dots & \mathbf{u}_m \\ \vert & \vert & & \vert \end{bmatrix} \left[ \begin{array}{c|c} \begin{matrix} \sigma_1 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \sigma_k \end{matrix} & \mathbf{0} \\ \hline \mathbf{0} & \mathbf{0} \end{array} \right] \begin{bmatrix} \vert & \vert & & \vert \\ \mathbf{v}_1 & \mathbf{v}_2 & \dots & \mathbf{v}_n \\ \vert & \vert & & \vert \end{bmatrix}^T. \]

We can write this as

\[ A = U\Sigma V^{T}. \]

Singular Value Decomposition

The SVD of a matrix \(A\in\mathbb{R}^{m\times n}\) (where \(m>n\)) is

\[ A = U\Sigma V^{T}, \]

where

\(U\) has dimension \(m\times n\). The columns of \(U\) are orthogonal. The columns of \(U\) are the left singular vectors.
\(\Sigma\) has dimension \(n\times n\). The only non-zero values are on the main diagonal and they are nonnegative real numbers \(\sigma_1\geq \sigma_2 \geq \ldots \geq \sigma_k\) and \(\sigma_{k+1} = \ldots = \sigma_n = 0\). These are called the singular values of \(A\).
\(V\) has dimension \(n \times n\). The columns of \(V\) are orthogonal. The columns of \(V\) are the right singular vectors.

Code

import matplotlib.pyplot as plt
import matplotlib.patches as patches

# Create a figure and axis
fig, ax = plt.subplots()

# Draw matrix A
rect_A = patches.Rectangle((0, 0), 2, 3, linewidth=1, edgecolor='black', facecolor='none')
ax.add_patch(rect_A)
ax.text(1, 1.5, r'$A$', fontsize=20, ha='center', va='center')
ax.text(1, -0.5, r'$(m \times n)$', fontsize=12, ha='center', va='center')

# Draw equal sign
ax.text(2.5, 1.5, r'$=$', fontsize=20, ha='center', va='center')

# Draw matrix U
rect_U = patches.Rectangle((3, 0), 2, 3, linewidth=1, edgecolor='black', facecolor='none')
ax.add_patch(rect_U)
ax.text(4, 1.5, r'$U$', fontsize=20, ha='center', va='center')
ax.text(4, -0.5, r'$(m \times n)$', fontsize=12, ha='center', va='center')

# Draw Sigma
rect_Sigma = patches.Rectangle((5.5, 1), 2, 2, linewidth=1, edgecolor='black', facecolor='none')
ax.add_patch(rect_Sigma)
ax.text(6.5, 2, r'$\Sigma$', fontsize=20, ha='center', va='center')
ax.text(6.5, 0.5, r'$(n \times n)$', fontsize=12, ha='center', va='center')

# Draw matrix V^T with the same dimensions as Sigma
rect_VT = patches.Rectangle((8, 1), 2, 2, linewidth=1, edgecolor='black', facecolor='none')
ax.add_patch(rect_VT)
ax.text(9, 2, r'$V^T$', fontsize=20, ha='center', va='center')
ax.text(9, 0.5, r'$(n \times n)$', fontsize=12, ha='center', va='center')

# Set limits and remove axes
ax.set_xlim(-1, 11)
ax.set_ylim(-2, 4)
ax.axis('off')

# Show the plot
plt.show()

The SVD of a matrix always exists.

The existence of the SVD was proven in 1936 by Carl Eckart and Gale Young.

The singular values are uniquely determined.

The left and right singular vectors are uniquely determined up to a complex sign (complex factor of modulus 1).

Outer Products

The SVD can also be represented as a sum of outer products

\[ A = \sum_{i=1}^{n} \sigma_{i}\mathbf{u}_i\mathbf{v}_{i}^{T}, \]

where \(\mathbf{u}_i, \mathbf{v}_{i}\) are the \(i\)-th columns of \(U\) and \(V\), respectively.

An outer product of a \(m\times 1\) vector and a \(1\times n\) vector is a \(m\times n\) matrix.

\[\mathbf{u}_i\mathbf{v}_{i}^{T}= \begin{bmatrix} u_{i1}v_{i1} & u_{i1}v_{i2} & \cdots & u_{i1}v_{in} \\ u_{i2}v_{i1} & u_{i2}v_{i2} & \cdots & u_{i2}v_{in} \\ \vdots & \vdots & \ddots & \vdots \\ u_{im}v_{i1} & u_{i2}v_{in} & \cdots & u_{im}v_{in} \\ \end{bmatrix} \]

is a rank-1 matrix.

Alternate Interpretation:

The SVD decomposes \(A\) into a linear combination of rank-1 matrices.

The singular value tells us the weight (contribution) of each rank-1 matrix to the matrix \(A\).

Topics Covered

In this lecture we first discuss:

Theoretical properties of the SVD related to

matrix rank
determining the best low rank approximations to a matrix

We will then apply these results when we consider data matrices from the following applications

internet traffic data
social media data
image data
movie data

SVD Properties

Matrix Rank

Let’s review some definitions.

Let \(A\in\mathbb{R}^{m\times n}\) be a real matrix such that with \(m>n\).

The rank of \(A\) is the number of linearly independent rows or columns of the matrix.

The largest value that a matrix rank can take is \(\min(m,n)\). Since we assumed \(m>n\), the largest value of the rank is \(n\).

If the matrix \(A\) has rank equal to \(n\), then we say it is full rank.

However, it can happen that the rank of a matrix is less than \(\min(m,n)\). In this case we say that \(A\) is rank-deficient.

Recall that the dimension of a vector space is the smallest number of linearly independent vectors needed to span the space.

The dimension of the column space of \(A\) is the smallest number of vectors that suffice to construct the columns of \(A\).

If the dimension of the column spaces is \(k\), then there exists a set of vectors \(\{\mathbf{c}_1, \mathbf{c}_2, \dots, \mathbf{c}_k\}\) such that every column \(\mathbf{a}_i\) of \(A\) can be expressed as:

\[\mathbf{a}_i = r_{1i}\mathbf{c}_1 + r_{2i}\mathbf{c}_2 + \dots + r_{ki}\mathbf{c}_k\quad i=1,\ldots,n.\]

To store a matrix \(A \in \mathbb{R}^{m\times n}\) we need to store \(mn\) values.

However, if \(A\) has rank \(k\), it can be factored as \(A = CR\), \[ A = \begin{bmatrix} \bigg| & \bigg| & & \bigg| \\ \mathbf{c}_1 & \mathbf{c}_2 & \dots & \mathbf{c}_k \\ \bigg| & \bigg| & & \bigg| \end{bmatrix} \begin{bmatrix} \vert & \vert & & \vert \\ \mathbf{r}_1 & \mathbf{r}_2 & \dots & \mathbf{r}_n \\ \vert & \vert & & \vert \end{bmatrix} \]

where \(C \in \mathbb{R}^{m\times k}\) and \(R \in \mathbb{R}^{k \times n}\).

This only requires \(k(m+n)\) values, which could be much smaller than \(mn\).

Low Effective Rank

In many situations we want to approximate a matrix \(A\) with a low-rank matrix \(A^{(k)}.\)

To talk about when one matrix approximates another, we need a norm for matrices.

We will use the Frobenius norm, which is defined as

\[\Vert A\Vert_F = \sqrt{\sum a_{ij}^2}.\]

Observe that this is the \(\ell_2\) norm for a vectorized matrix, i.e., by stacking the columns of the matrix \(A\) to form a vector of length \(mn\).

To quantify when one matrix is close to another, we define the distance function:

\[ \operatorname{dist}(A,B) = \Vert A-B\Vert_F. \]

This can be viewed as Euclidean distance between \(mn\)-dimensional vectors.

We define the optimal rank-\(k\) approximation to \(A\) as

\[ A^{(k)} =\mathop{\arg\min}\limits_{\{B~|~\operatorname{Rank} B = k\}} \Vert A-B\Vert_F. \]

In other words, \(A^{(k)}\) is the closest rank-\(k\) matrix to \(A\).

Finding Rank-\(k\) Approximations

How can we find the optimal rank-\(k\) approximation to a matrix \(A\)?

The Singular Value Decomposition (SVD).

Why?

The SVD gives the best rank-\(k\) approximation to \(A\) for every \(k\) up to the rank of \(A\).

To form the best rank-\(k\) approximation to using the SVD you calculate

\[ A^{(k)} = U'\Sigma'(V')^T,\]

where

\(U'\) are the \(k\) leftmost columns of \(U\),
\(\Sigma'\) is the \(k\times k\) upper left sub-matrix of \(\Sigma\), and
\(V'\) is the \(k\) leftmost columns of \(V\).

For a matrix \(A\) of rank \(n\), we can prove that

\[\Vert A-A^{(k)}\Vert_F^2 = \sum_{i=k+1}^n\sigma^2_i.\]

This means that the distance (in Frobenius norm) of the best rank-\(k\) approximation \(A^{(k)}\) from \(A\) is equal to \(\sqrt{\sum_{i=k+1}^n\sigma^2_i}\).

Low Rank Approximations in Practice

Models are simplifications

One way of thinking about modeling or clustering is that we are building a simplification of the data.

That is, a model is a description of the data, that is simpler than the data.

In particular, instead of thinking of the data as thousands or millions of individual data points, we might think of it in terms of a small number of clusters, or a parametric distribution, etc.

From this simpler description, we hope to gain insight.

There is an interesting question here: why does this process often lead to insight?

That is, why does it happen so often that a large dataset can be described in terms of a much simpler model?

William of Ockham

Source

William of Ockham (c. 1300 AD) said:

Non sunt multiplicanda entia sine necessitate

or, in other words:

Entities must not be multiplied beyond necessity.

by which he meant:

Among competing hypotheses, the one with the fewest assumptions should be selected.

This has come to be known as “Occam’s razor.”

Occam’s Razor

William was saying that it is more common for a set of observations to be determined by a simple process than a complex process.

In other words, the world is full of simple (but often hidden) patterns.

From which one can justify the observation that modeling works surprisingly often.

Low Effective Rank of Data Matrices

In general, a data matrix \(A\in\mathbb{R}^{m\times n}\) is usually full rank, meaning that \(\operatorname{Rank}(A)\equiv p = \min(m, n)\).

However, it is possible to encounter data matrices that have low effective rank.

This means that we can approximate \(A\) by some \(A^{(k)}\) for which \(k \ll p\).

For any data matrix, we can judge when this is the case by looking at its singular values, because the singular values tell us the distance to the nearest rank-\(k\) matrix.

Traffic Data

Let’s see how this theory can be used in practice and investigate some real data.

We’ll look at data traffic on the Abilene network:

Source: Internet2, circa 2005

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

with open('data/net-traffic/AbileneFlows/odnames','r') as f:
    odnames = [line.strip() for line in f]
dates = pd.date_range('9/1/2003', freq = '10min', periods = 1008)
Atraf = pd.read_table('data/net-traffic/AbileneFlows/X', sep='  ', header=None, names=odnames, engine='python')
Atraf.index = dates
Atraf

	ATLA-ATLA	ATLA-CHIN	ATLA-DNVR	ATLA-HSTN	ATLA-IPLS	ATLA-KSCY	ATLA-LOSA	ATLA-NYCM	ATLA-SNVA	ATLA-STTL	...	WASH-CHIN	WASH-DNVR	WASH-HSTN	WASH-IPLS	WASH-KSCY	WASH-LOSA	WASH-NYCM	WASH-SNVA	WASH-STTL	WASH-WASH
2003-09-01 00:00:00	8466132.0	29346537.0	15792104.0	3646187.0	21756443.0	10792818.0	14220940.0	25014340.0	13677284.0	10591345.0	...	53296727.0	18724766.0	12238893.0	52782009.0	12836459.0	31460190.0	105796930.0	13756184.0	13582945.0	120384980.0
2003-09-01 00:10:00	20524567.0	28726106.0	8030109.0	4175817.0	24497174.0	8623734.0	15695839.0	36788680.0	5607086.0	10714795.0	...	68413060.0	28522606.0	11377094.0	60006620.0	12556471.0	32450393.0	70665497.0	13968786.0	16144471.0	135679630.0
2003-09-01 00:20:00	12864863.0	27630217.0	7417228.0	5337471.0	23254392.0	7882377.0	16176022.0	31682355.0	6354657.0	12205515.0	...	67969461.0	37073856.0	15680615.0	61484233.0	16318506.0	33768245.0	71577084.0	13938533.0	14959708.0	126175780.0
2003-09-01 00:30:00	10856263.0	32243146.0	7136130.0	3695059.0	28747761.0	9102603.0	16200072.0	27472465.0	9402609.0	10934084.0	...	66616097.0	43019246.0	12726958.0	64027333.0	16394673.0	33440318.0	79682647.0	16212806.0	16425845.0	112891500.0
2003-09-01 00:40:00	10068533.0	30164311.0	8061482.0	2922271.0	35642229.0	9104036.0	12279530.0	29171205.0	7624924.0	11327807.0	...	66797282.0	40408580.0	11733121.0	54541962.0	16769259.0	33927515.0	81480788.0	16757707.0	15158825.0	123140310.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2003-09-07 23:10:00	8849096.0	33461807.0	5866138.0	3786793.0	19097140.0	10561532.0	26092040.0	28640962.0	8343867.0	8820650.0	...	65925313.0	21751316.0	11058944.0	58591021.0	17137907.0	24297674.0	83293655.0	17329425.0	20865535.0	123125390.0
2003-09-07 23:20:00	9776675.0	31474607.0	5874654.0	11277465.0	14314837.0	9106198.0	26412752.0	26168288.0	8638782.0	9193717.0	...	70075490.0	29126443.0	12667321.0	54571764.0	15383038.0	25238842.0	70015955.0	16526455.0	16881206.0	142106800.0
2003-09-07 23:30:00	9144621.0	32117262.0	5762691.0	7154577.0	17771350.0	10149256.0	29501669.0	25998158.0	11343171.0	9423042.0	...	68544458.0	27817836.0	15892668.0	50326213.0	12098328.0	27689197.0	73553203.0	18022288.0	18471915.0	127918530.0
2003-09-07 23:40:00	8802106.0	29932510.0	5279285.0	5950898.0	20222187.0	10636832.0	19613671.0	26124024.0	8732768.0	8217873.0	...	65087776.0	28836922.0	11075541.0	52574692.0	11933512.0	31632344.0	81693475.0	16677568.0	16766967.0	138180630.0
2003-09-07 23:50:00	8716795.6	22660870.0	6240626.4	5657380.6	17406086.0	8808588.5	15962917.0	18367639.0	7767967.3	7470650.1	...	65599891.0	25862152.0	11673804.0	60086953.0	11851656.0	30979811.0	73577193.0	19167646.0	19402758.0	137288810.0

1008 rows × 121 columns

Atraf.shape

(1008, 121)

As we would expect, our traffic matrix has rank 121:

np.linalg.matrix_rank(Atraf)

np.int64(121)

However – perhaps it has low effective rank.

The numpy routine for computing the SVD is np.linalg.svd:

u, s, vt = np.linalg.svd(Atraf)

Now let’s look at the singular values of Atraf to see if it can be usefully approximated as a low-rank matrix:

Code

fig = plt.figure(figsize=(5, 3))
plt.plot(range(1, 1+len(s)), s)
plt.xlabel(r'$k$', size=20)
plt.ylabel(r'$\sigma_k$', size=20)
plt.ylim(ymin=0)
plt.xlim(xmin=-1)
plt.title(r'Singular Values of $A$', size=20)
plt.show()

This classic, sharp-elbow tells us that a few singular values are very large, and most singular values are quite small.

Zooming in for just small \(k\) values, we can see that the elbow is around 4 - 6 singular values:

Code

fig = plt.figure(figsize=(5, 3))
Anorm = np.linalg.norm(Atraf)
plt.plot(range(1, 21), s[0:20]/Anorm, '.-')
plt.xlim([0.5, 20])
plt.ylim([0, 1])
plt.xlabel(r'$k$', size=20)
plt.xticks(range(1, 21))
plt.ylabel(r'$\sigma_k$', size=20);
plt.title(r'Singular Values of $A$',size=20)
plt.show()

This pattern of singular values suggests low effective rank.

Let’s use the formula above to compute the relative error of a rank-\(k\) approximation to \(A\):

Code

fig = plt.figure(figsize=(5, 3))
Anorm = np.linalg.norm(Atraf)
err = np.cumsum(s[::-1]**2)
err = np.sqrt(err[::-1])
plt.plot(range(0, 20), err[:20]/Anorm, '.-')
plt.xlim([0, 20])
plt.ylim([0, 1])
plt.xticks(range(1, 21))
plt.xlabel(r'$k$', size = 16)
plt.ylabel(r'relative F-norm error', size=16)
plt.title(r'Relative Error of rank-$k$ approximation to $A$', size=16)
plt.show()

Remarkably, we are down to 9% relative error using only a rank 20 approximation to \(A\).

So instead of storing

\(mn =\) (1008 \(\cdot\) 121) = 121,968 values,

we only need to store

\(k(m+n)\) = 20 \(\cdot\) (1008 + 121) = 22,580 values,

which is an 81% reduction in size.

Low Effective Rank is Common

In practice many datasets have low effective rank.

We consider the following examples:

Likes on Facebook,
Yelp reviews and Tweets (the site formerly known as twitter),
User preferences over time,
Images.

Likes on Facebook

Here, the matrices are

Number of likes: Timebins \(\times\) Users
Number of likes: Users \(\times\) Page Categories
Entropy of likes across categories: Timebins \(\times\) Users

Source: [Viswanath et al., Usenix Security, 2014]

Netflix

Example: the Netflix prize worked with partially-observed matrices like this:

\[ \begin{bmatrix} & & & \vdots & & & \\ & & 3 & 2 & & 1 &\\ & 1 & & 1 & & & \\ \dots & & 2 & & 4 & & \dots\\ & 5 & 5 & & 4 & & \\ & 1 & & & 1 & 5 & \\ & & & \vdots & & & \\ \end{bmatrix}, \]

where the rows correspond to users, the columns to movies, and the entries are ratings.

Although the problem matrix was of size 500,000 \(\times\) 18,000, the winning approach modeled the matrix as having rank 20 to 40.

Source: [Koren et al, IEEE Computer, 2009]

Images

Image data often shows low effective rank.

For example, here is an original photo:

Code

boat = np.loadtxt('data/images/boat/boat.dat')
import matplotlib.cm as cm
plt.figure()
plt.imshow(boat, cmap=cm.Greys_r)
plt.axis('off')
plt.show()

Let’s look at the singular values.

Code

u, s, vt = np.linalg.svd(boat, full_matrices=False)
plt.plot(s)
plt.xlabel('$k$', size=16)
plt.ylabel(r'$\sigma_k$', size=16)
plt.title('Singular Values of Boat Image', size=16)
plt.show()

This image is 512 \(\times\) 512. As a matrix, it has rank of 512.

But its effective rank is low.

Based on the previous plot, its effective rank is perhaps 40.

Let’s find the closest rank-40 matrix and view it.

u, s, vt = np.linalg.svd(boat, full_matrices=False)
s[40:] = 0
boatApprox = u @ np.diag(s) @ vt

Code

plt.figure(figsize=(9, 6))
plt.subplot(1, 2, 1)
plt.imshow(boatApprox, cmap=cm.Greys_r)
plt.axis('off')
plt.title('Rank 40 Boat')
plt.subplot(1, 2, 2)
plt.imshow(boat, cmap=cm.Greys_r)
plt.axis('off')
plt.title('Rank 512 Boat')
plt.show()

Interpretations of Low Effective Rank

How can we understand the low-effective-rank phenomenon in general?

There are two helpful interpretations:

Common Patterns
Latent Factors

Low Rank Implies Common Patterns

The first interpretation of low-rank behavior is in answering the question:

“What is the strongest pattern in the data?”

Remember that using the SVD we form the low-rank approximation as

\[ A^{(k)} = U'\Sigma'(V')^T\]

and

\(U'\) are the \(k\) leftmost columns of \(U\),
\(\Sigma'\) is the \(k\times k\) upper left submatrix of \(\Sigma\), and
\(V'\) are the \(k\) leftmost columns of \(V\).

In this interpretation, we think of each column of \(A^{(k)}\) as a combination of the columns of \(U'\).

How can this be helpful?

Common Patterns: Traffic Example

Consider the set of traffic traces. There are clearly some common patterns. How can we find them?

Code

with open('data/net-traffic/AbileneFlows/odnames','r') as f:
    odnames = [line.strip() for line in f]
dates = pd.date_range('9/1/2003', freq='10min', periods=1008)
Atraf = pd.read_table('data/net-traffic/AbileneFlows/X', sep='  ', header=None, names=odnames, engine='python')
Atraf.index = dates
plt.figure(figsize=(10, 8))
for i in range(1, 13):
    ax = plt.subplot(4, 3, i)
    Atraf.iloc[:, i-1].plot()
    plt.title(odnames[i])
plt.subplots_adjust(hspace=1)
plt.suptitle('Twelve Example Traffic Traces', size=20)
plt.show()

Let’s use as our example \(\mathbf{a}_1,\) the first column of \(A\).

This happens to be the ATLA-CHIN flow.

The equation above tells us that

\[\mathbf{a}_1 \approx v_{11}\sigma_1\mathbf{u}_1 + v_{12}\sigma_2\mathbf{u}_2 + \dots + v_{1k}\sigma_k\mathbf{u}_k.\]

In other words, \(\mathbf{u}_1\) (the first column of \(U\)) is the “strongest” pattern occurring in \(A\), and its strength is measured by \(\sigma_1\).

Here is a view of the first 2 columns of \(U\Sigma\) for the traffic matrix data.

These are the strongest patterns occurring across all of the 121 traces.

Code

u, s, vt = np.linalg.svd(Atraf, full_matrices=False)
uframe = pd.DataFrame(u @ np.diag(s), index=pd.date_range('9/1/2003', freq = '10min', periods = 1008))
uframe[0].plot(color='r', label='Column 1')
uframe[1].plot(label='Column 2')
plt.legend(loc='best')
plt.title('First Two Columns of $U$')
plt.show()

Low Rank Defines Latent Factors

The next interpretation of low-rank behavior is that it exposes “latent factors” that describe the data.

In this interpretation, we think of each element of \(A^{(k)}=U'\Sigma'(V')^T\) as the inner product of a row of \(U'\Sigma'\) and a column of \((V')^{T}\) (equivalently a row of \(V'\)).

Let’s say we are working with a matrix of users and items.

In particular, let the items be movies and matrix entries be ratings, as in the Netflix prize.

Latent Factors: Netflix example

Recall the structure from earlier:

\[ \begin{bmatrix} \vdots & \vdots & & \vdots \\ \mathbf{a}_{1} & \mathbf{a}_{2} & \cdots & \mathbf{a}_{n} \\ \vdots & \vdots & & \vdots \\ \end{bmatrix} \approx \underbrace{ \begin{bmatrix} \vdots & \vdots \\ \sigma_1 \mathbf{u}_1 & \sigma_k \mathbf{u}_{k} \\ \vdots& \vdots \\ \end{bmatrix} }_{\tilde{U}\in\mathbb{R}^{m\times k}} \underbrace{ \begin{bmatrix} \cdots & \mathbf{v}_{1}^{T} & \cdots \\ \cdots & \mathbf{v}_{k}^{T} & \cdots \\ \end{bmatrix} }_{\tilde{V}\in\mathbb{R}^{k\times n}}, \]

where the rows of \(A\) are the users and the columns are movie ratings.

Then the rating that a user gives a movie is the inner product of a \(k\)-element vector that corresponds to the user, and a \(k\)-element vector that corresponds to the movie.

In other words we have:

\[ a_{ij} = \sum_{p=1}^{k} \tilde{U}_{ip} \tilde{V}_{pj}. \]

We can think of user \(i\)’s preferences as being captured by row \(i\) of \(\tilde{U}\), which is a point in \(\mathbb{R}^k\).

We have described everything we need to know to predict user \(i\)’s ratings via a \(k\)-element vector.

The \(k\)-element vector is called a latent factor.

Likewise, we can think of column \(j\) of \(\tilde{V}\) as a “description” of movie \(j\) (another latent factor).

The value in using latent factors comes from the summarization of user preferences, and the predictive power one obtains.

For example, the winning entry in the Netflix prize competition modeled user preferences with 20 latent factors.

The remarkable thing is that a person’s preferences for all 18,000 movies can be reasonably well captured in a vector of dimension 20!

Here is a figure from the paper that described the winning strategy in the Netflix prize.

It shows a hypothetical latent space in which each user, and each movie, is represented by a latent vector.

Source: Koren et al, IEEE Computer, 2009

In practice, this is perhaps a 20- or 40-dimensional space.

Here are some representations of movies in that space (reduced to 2-D).

Notice how the space seems to capture similarity among movies!

Source: Koren et al, IEEE Computer, 2009

Summary

When we are working with data matrices, it is valuable to consider the effective rank.
Many (many) datasets in real life show low effective rank.
This property can be explored precisely using the Singular Value Decomposition of the matrix.
When low effective rank is present
- the matrix can be compressed with only small loss of accuracy,
- we can extract the strongest patterns in the data,
- we can describe each data item in terms of the inner product of latent factors.

Low Rank Approximations

Singular Vectors and Values

Singular Value Decomposition

Outer Products

Topics Covered

SVD Properties

Matrix Rank

Low Effective Rank

Finding Rank-\(k\) Approximations

Low Rank Approximations in Practice

Models are simplifications

William of Ockham

Occam’s Razor

Low Effective Rank of Data Matrices

Traffic Data

Low Effective Rank is Common

Likes on Facebook

Social Media Activity

Netflix

Images

Interpretations of Low Effective Rank

Low Rank Implies Common Patterns

Common Patterns: Traffic Example

Low Rank Defines Latent Factors

Latent Factors: Netflix example

Summary