We now consider applications of the Singular Value Decomposition (SVD).
SVD is “the Swiss Army Knife of Numerical Linear Algebra.”
Dianne O’Leary, MMDS ’06 (Workshop on Algorithms for Modern Massive Data Sets)
We will see how the SVD is used for
low rank approximations
dimensionality reduction
Singular Vectors and Values
For \(A\in\mathbb{R}^{m\times n}\) with \(m>n\) and rank \(k\), there exists a set of orthogonal vectors \(\{\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_n\}\) and a set of orthogonal vectors \(\{\mathbf{u}_1, \mathbf{u}_2, \dots, \mathbf{u}_m\}\) such that
The SVD of a matrix \(A\in\mathbb{R}^{m\times n}\) (where \(m>n\)) is
\[
A = U\Sigma V^{T},
\]
where
\(U\) has dimension \(m\times n\). The columns of \(U\) are orthogonal. The columns of \(U\) are the left singular vectors.
\(\Sigma\) has dimension \(n\times n\). The only non-zero values are on the main diagonal and they are nonnegative real numbers \(\sigma_1\geq \sigma_2 \geq \ldots \geq \sigma_k\) and \(\sigma_{k+1} = \ldots = \sigma_n = 0\). These are called the singular values of \(A\).
\(V\) has dimension \(n \times n\). The columns of \(V\) are orthogonal. The columns of \(V\) are the right singular vectors.
The SVD decomposes \(A\) into a linear combination of rank-1 matrices.
The singular value tells us the weight (contribution) of each rank-1 matrix to the matrix \(A\).
Topics Covered
In this lecture we first discuss:
Theoretical properties of the SVD related to
matrix rank
determining the best low rank approximations to a matrix
We will then apply these results when we consider data matrices from the following applications
internet traffic data
social media data
image data
movie data
SVD Properties
Matrix Rank
Let’s review some definitions.
Let \(A\in\mathbb{R}^{m\times n}\) be a real matrix such that with \(m>n\).
The rank of \(A\) is the number of linearly independent rows or columns of the matrix.
The largest value that a matrix rank can take is \(\min(m,n)\). Since we assumed \(m>n\), the largest value of the rank is \(n\).
If the matrix \(A\) has rank equal to \(n\), then we say it is full rank.
However, it can happen that the rank of a matrix is less than \(\min(m,n)\). In this case we say that \(A\) is rank-deficient.
Recall that the dimension of a vector space is the smallest number of linearly independent vectors needed to span the space.
The dimension of the column space of \(A\) is the smallest number of vectors that suffice to construct the columns of \(A\).
If the dimension of the column spaces is \(k\), then there exists a set of vectors \(\{\mathbf{c}_1, \mathbf{c}_2, \dots, \mathbf{c}_k\}\) such that every column \(\mathbf{a}_i\) of \(A\) can be expressed as:
This means that the distance (in Frobenius norm) of the best rank-\(k\) approximation \(A^{(k)}\) from \(A\) is equal to \(\sqrt{\sum_{i=k+1}^n\sigma^2_i}\).
Low Rank Approximations in Practice
Models are simplifications
One way of thinking about modeling or clustering is that we are building a simplification of the data.
That is, a model is a description of the data, that is simpler than the data.
In particular, instead of thinking of the data as thousands or millions of individual data points, we might think of it in terms of a small number of clusters, or a parametric distribution, etc.
From this simpler description, we hope to gain insight.
There is an interesting question here: why does this process often lead to insight?
That is, why does it happen so often that a large dataset can be described in terms of a much simpler model?
Among competing hypotheses, the one with the fewest assumptions should be selected.
This has come to be known as “Occam’s razor.”
Occam’s Razor
William was saying that it is more common for a set of observations to be determined by a simple process than a complex process.
In other words, the world is full of simple (but often hidden) patterns.
From which one can justify the observation that modeling works surprisingly often.
Low Effective Rank of Data Matrices
In general, a data matrix \(A\in\mathbb{R}^{m\times n}\) is usually full rank, meaning that \(\operatorname{Rank}(A)\equiv p = \min(m, n)\).
However, it is possible to encounter data matrices that have low effective rank.
This means that we can approximate \(A\) by some \(A^{(k)}\) for which \(k \ll p\).
For any data matrix, we can judge when this is the case by looking at its singular values, because the singular values tell us the distance to the nearest rank-\(k\) matrix.
Traffic Data
Let’s see how this theory can be used in practice and investigate some real data.
We’ll look at data traffic on the Abilene network:
Source: Internet2, circa 2005
Code
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltwithopen('data/net-traffic/AbileneFlows/odnames','r') as f: odnames = [line.strip() for line in f]dates = pd.date_range('9/1/2003', freq ='10min', periods =1008)Atraf = pd.read_table('data/net-traffic/AbileneFlows/X', sep=' ', header=None, names=odnames, engine='python')Atraf.index = datesAtraf
ATLA-ATLA
ATLA-CHIN
ATLA-DNVR
ATLA-HSTN
ATLA-IPLS
ATLA-KSCY
ATLA-LOSA
ATLA-NYCM
ATLA-SNVA
ATLA-STTL
...
WASH-CHIN
WASH-DNVR
WASH-HSTN
WASH-IPLS
WASH-KSCY
WASH-LOSA
WASH-NYCM
WASH-SNVA
WASH-STTL
WASH-WASH
2003-09-01 00:00:00
8466132.0
29346537.0
15792104.0
3646187.0
21756443.0
10792818.0
14220940.0
25014340.0
13677284.0
10591345.0
...
53296727.0
18724766.0
12238893.0
52782009.0
12836459.0
31460190.0
105796930.0
13756184.0
13582945.0
120384980.0
2003-09-01 00:10:00
20524567.0
28726106.0
8030109.0
4175817.0
24497174.0
8623734.0
15695839.0
36788680.0
5607086.0
10714795.0
...
68413060.0
28522606.0
11377094.0
60006620.0
12556471.0
32450393.0
70665497.0
13968786.0
16144471.0
135679630.0
2003-09-01 00:20:00
12864863.0
27630217.0
7417228.0
5337471.0
23254392.0
7882377.0
16176022.0
31682355.0
6354657.0
12205515.0
...
67969461.0
37073856.0
15680615.0
61484233.0
16318506.0
33768245.0
71577084.0
13938533.0
14959708.0
126175780.0
2003-09-01 00:30:00
10856263.0
32243146.0
7136130.0
3695059.0
28747761.0
9102603.0
16200072.0
27472465.0
9402609.0
10934084.0
...
66616097.0
43019246.0
12726958.0
64027333.0
16394673.0
33440318.0
79682647.0
16212806.0
16425845.0
112891500.0
2003-09-01 00:40:00
10068533.0
30164311.0
8061482.0
2922271.0
35642229.0
9104036.0
12279530.0
29171205.0
7624924.0
11327807.0
...
66797282.0
40408580.0
11733121.0
54541962.0
16769259.0
33927515.0
81480788.0
16757707.0
15158825.0
123140310.0
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
2003-09-07 23:10:00
8849096.0
33461807.0
5866138.0
3786793.0
19097140.0
10561532.0
26092040.0
28640962.0
8343867.0
8820650.0
...
65925313.0
21751316.0
11058944.0
58591021.0
17137907.0
24297674.0
83293655.0
17329425.0
20865535.0
123125390.0
2003-09-07 23:20:00
9776675.0
31474607.0
5874654.0
11277465.0
14314837.0
9106198.0
26412752.0
26168288.0
8638782.0
9193717.0
...
70075490.0
29126443.0
12667321.0
54571764.0
15383038.0
25238842.0
70015955.0
16526455.0
16881206.0
142106800.0
2003-09-07 23:30:00
9144621.0
32117262.0
5762691.0
7154577.0
17771350.0
10149256.0
29501669.0
25998158.0
11343171.0
9423042.0
...
68544458.0
27817836.0
15892668.0
50326213.0
12098328.0
27689197.0
73553203.0
18022288.0
18471915.0
127918530.0
2003-09-07 23:40:00
8802106.0
29932510.0
5279285.0
5950898.0
20222187.0
10636832.0
19613671.0
26124024.0
8732768.0
8217873.0
...
65087776.0
28836922.0
11075541.0
52574692.0
11933512.0
31632344.0
81693475.0
16677568.0
16766967.0
138180630.0
2003-09-07 23:50:00
8716795.6
22660870.0
6240626.4
5657380.6
17406086.0
8808588.5
15962917.0
18367639.0
7767967.3
7470650.1
...
65599891.0
25862152.0
11673804.0
60086953.0
11851656.0
30979811.0
73577193.0
19167646.0
19402758.0
137288810.0
1008 rows × 121 columns
Atraf.shape
(1008, 121)
As we would expect, our traffic matrix has rank 121:
np.linalg.matrix_rank(Atraf)
np.int64(121)
However – perhaps it has low effective rank.
The numpy routine for computing the SVD is np.linalg.svd:
u, s, vt = np.linalg.svd(Atraf)
Now let’s look at the singular values of Atraf to see if it can be usefully approximated as a low-rank matrix:
In other words, \(\mathbf{u}_1\) (the first column of \(U\)) is the “strongest” pattern occurring in \(A\), and its strength is measured by \(\sigma_1\).
Here is a view of the first 2 columns of \(U\Sigma\) for the traffic matrix data.
These are the strongest patterns occurring across all of the 121 traces.
Code
u, s, vt = np.linalg.svd(Atraf, full_matrices=False)uframe = pd.DataFrame(u @ np.diag(s), index=pd.date_range('9/1/2003', freq ='10min', periods =1008))uframe[0].plot(color='r', label='Column 1')uframe[1].plot(label='Column 2')plt.legend(loc='best')plt.title('First Two Columns of $U$')plt.show()
Low Rank Defines Latent Factors
The next interpretation of low-rank behavior is that it exposes “latent factors” that describe the data.
In this interpretation, we think of each element of \(A^{(k)}=U'\Sigma'(V')^T\) as the inner product of a row of \(U'\Sigma'\) and a column of \((V')^{T}\) (equivalently a row of \(V'\)).
Let’s say we are working with a matrix of users and items.
In particular, let the items be movies and matrix entries be ratings, as in the Netflix prize.
where the rows of \(A\) are the users and the columns are movie ratings.
Then the rating that a user gives a movie is the inner product of a \(k\)-element vector that corresponds to the user, and a \(k\)-element vector that corresponds to the movie.
Social Media Activity
Here, the matrices are
Source: [Viswanath et al., Usenix Security, 2014]