Linear Algebra Refresher

Introduction

Introduction to linear algebra

Linear algebra is the branch of mathematics involving vectors and matrices. In particular, how vectors are transformed. Knowledge of linear algebra is essential in data science.

Linear algebra

Linear algebra allows us to understand the operations and transformations used to manipulate and extract information from data.

Examples

Deep neural networks use matrix-vector and matrix-matrix multiplication.
Natural language processing use the dot product to determine word similarity.
Least squares uses matrix inverses and matrix factorizations to compute models for predicting continuous values.
PCA (dimensionality reduction) uses the matrix factorization called the Singular Value Decomposition (SVD).
Graphs are described by adjacency matrices. Eigenvectors and eigenvalues of this matrix provide information about the graph structure.

Linear algebra is used to implement data science algorithms efficiently and accurately.

You will not have to program linear algebra algorithms. You will use appropriate Python packages.

This lecture is a review of some aspects of linear algebra that are important for data science. Given the prerequisites for this course, I assume that you previously learned this material.

The goal of this lecture is to refresh the following topics:

vectors,
matrices,
operations with vectors and matrices,
eigenvectors and eigenvalues,
linear systems and least squares,
matrix factorizations.

Below is a list of very useful resources for learning about linear algebra:

Linear Algebra and Its Applications (6th edition), David C. Lay, Judi J. McDonald, and Steven R. Lay, Pearson, 2021,
Introduction to Linear Algebra (6th edition), Gilbert Strang, Wellesley-Cambridge Press, 2023,
- Gilbert Strang’s lecture videos
Linear Algebra and Learning from Data, Gilbert Strang, Wellesley-Cambridge Press, 2019,
Numerical Linear Algebra, Lloyn N. Trefethen and David Bau, SIAM, 1997.

Vectors and Vector Operations

Vectors

A vector of length \(n\), \(\mathbf{x}\in\mathbb{R}^{n}\), is a 1-dimensional (1-D) array of real numbers

\[ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}. \]

When discussing vectors we will only consider column vectors. A row vector can always be obtained from a column vector via transposition

\[ \mathbf{x}^{T} = [x_1, x_2, \ldots, x_n]. \]

Geometric interpretation of vectors

Vectors in \(\mathbb{R}^{2}\) can be visualized as points in a 2-D plane (or arrows originating at the origin),
Vectors in \(\mathbb{R}^{3}\) can be visualized as points in a 3-D space (or arrows originating at the origin).

Let

\[ \mathbf{x}=\begin{bmatrix} 2 \\ 2 \end{bmatrix},~ \mathbf{y} = \begin{bmatrix} 3 \\ -1 \end{bmatrix},~ \mathbf{z} = \begin{bmatrix} -2 \\ -1 \end{bmatrix}. \]

These vectors are illustrated in Figure 1.

Code

import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax = plt.gca()
x = np.array([2, 2])
y = np.array([3, -1])
z = np.array([-2, -1])
V = np.array([x, y, z])
origin = np.array([[0, 0, 0], [0, 0, 0]])
plt.quiver(*origin, V[:, 0], V[:, 1], 
           color=['r', 'b', 'g'], 
           angles='xy', 
           scale_units='xy', 
           scale=1)
ax.set_xlim([-6, 6])
ax.set_ylim([-2, 4])
ax.text(3.3, -1.1, '$(3,-1)$', size=16)
ax.text(2.3, 1.9, '$(2,2)$', size=16)
ax.text(-3.7, -1.3, '$(-2,-1)$', size=16)
ax.grid()
plt.show()

Vector Operations

Scalar multiplication: Let \(c\in\mathbb{R}\), \(\mathbf{x}\in\mathbb{R}^{n}\), then \[ c\mathbf{x} = \begin{bmatrix} cx_1 \\ cx_2 \\ \vdots \\ cx_n \end{bmatrix}. \]

Multiplication by a scalar \(c\in\mathbb{R}\).

For \(c>1\) the vector is lengthened.
For \(0<c<1\) the vector shrinks.
If we negate \(c\) the direction of the vector is flipped 180 degrees.

Figure Figure 2 shows the vector \(\mathbf{x} = [2, 2]\) multiplied by the scalar value \(c=2\).

Code

import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax = plt.gca()
x = np.array([2, 2])
y = np.array([4, 4])
V = np.array([x, y])
origin = np.array([[0, 0], [0, 0]])
plt.quiver(*origin, V[:, 0], V[:, 1], 
           color=['r', 'b'], 
           angles='xy', 
           scale_units='xy', 
           scale=1,
           alpha= 0.5)
ax.set_xlim([-5, 5])
ax.set_ylim([-1, 5])
ax.text(2.3, 1.9, '$x$', size=16)
ax.text(4.3, 3.9, '$cx$', size=16)
ax.grid()
plt.show()

Figure 2: Scalar multiplication of a vector

Vector addition: Let \(\mathbf{u},\mathbf{v}\in\mathbb{R}^{n}\) then \[ \mathbf{u} + \mathbf{v} = \begin{bmatrix} u_1 + v_1 \\ u_2 + v_2 \\ \vdots \\ u_n + v_n \end{bmatrix}. \]

We plot the sum of \(\mathbf{u} = [1, 2]\) and \(\mathbf{v} = [4, 1]\) in Figure 3. The sum \(\mathbf{u} + \mathbf{v} = [5, 3]\) is obtained by placing the tip of one vector to the tail of the other vector.

Code

import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax = plt.gca()
u = np.array([1, 2])
v = np.array([4, 1])
w = np.array([5, 3])
V = np.array([u, v, w])
origin = np.array([[0, 0, 0], [0, 0, 0]])
plt.quiver(*origin, V[:, 0], V[:, 1], 
           color=['b', 'b', 'r'], 
           angles='xy', 
           scale_units='xy', 
           scale=1)
ax.set_xlim([-1, 6])
ax.set_ylim([-1, 4])
ax.text(1.3, 1.9, '$u$', size=16)
ax.text(4.3, 1.2, '$v$', size=16)
ax.text(5.3, 2.9, '$u+v$', size=16)
plt.plot([1, 5], [2, 3], 'g--')
plt.plot([4, 5], [1, 3], 'g--')
ax.grid()
plt.show()

Dot product: Let \(\mathbf{u},\mathbf{v}\in\mathbb{R}^{n}\) then the dot product is defined as \[ \mathbf{u}\cdot\mathbf{v} = \sum_{i=0}^n u_i v_i. \]

Vector 2-norm: The 2-norm of a vector \(\mathbf{v}\in\mathbb{R}^{n}\) is defined as \[ \Vert \mathbf{v}\Vert_2 = \sqrt{\mathbf{v}\cdot\mathbf{v}} = \sqrt{\sum_{i=1}^n v_i^2}. \]

This norm is referred to as the \(\ell_2\) norm. In these notes, the notation \(\Vert \mathbf{v} \Vert\), indicates the 2-norm.

Unit vector: A unit vector \(\mathbf{v}\) is a vector such that \(\Vert \mathbf{v} \Vert_2 = 1\). - All vectors of the form \(\frac{\mathbf{v}}{\Vert \mathbf{v} \Vert_2 }\) are unit vectors.

Distance: Let \(\mathbf{u},\mathbf{v}\in\mathbb{R}^{n}\), the distance between \(\mathbf{u}\) and \(\mathbf{v}\) is \[ \Vert \mathbf{u} - \mathbf{v} \Vert_2. \]

Orthogonality: Two vectors \(\mathbf{u},\mathbf{v}\in\mathbb{R}^{n}\) are orthogonal if and only if \(\mathbf{u}\cdot\mathbf{v}=0\).

Angle between vectors: Let \(\mathbf{u},\mathbf{v}\in\mathbb{R}^{n}\), the angle between these vectors is \[ \cos{\theta} = \frac{\mathbf{u}\cdot\mathbf{v}}{\Vert \mathbf{u}\Vert_2 \Vert\mathbf{v}\Vert_2}. \]

The dot product of two vectors \(\mathbf{u}, \mathbf{v}\) can be used to project \(\mathbf{u}\) onto \(\mathbf{v}\)

\[ \mathrm{proj}_{\mathbf{v}}\mathbf{u} = \frac{\mathbf{u}\cdot\mathbf{v}}{\Vert \mathbf{v} \Vert^2}\mathbf{v}. \]

This is illustrated in Figure 4.

Code

import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure()
ax = plt.gca()

# Define vectors u and v
u = np.array([1, 2])
v = np.array([4, 1])

# Project u onto v
projection = np.dot(u, v) * v / np.dot(v, v)

V = np.array([u, v, projection])

origin = np.array([[0, 0, 0], [0, 0, 0]])
plt.quiver(*origin, V[:, 0], V[:, 1], color=['b', 'b', 'r'], angles='xy', scale_units='xy', scale=1)

ax.set_xlim([-1, 6])
ax.set_ylim([-1, 4])
ax.text(1.3, 1.9, r'$\mathbf{u}$', size=16)
ax.text(4.3, 1.2, r'$\mathbf{v}$', size=16)
ax.text(0.4, -0.3, r'$\mathrm{proj}_{\mathbf{v}}\mathbf{u}$', size=16)

plt.plot([u[0], projection[0]], [u[1], projection[1]], 'g--')

ax.grid()
plt.show()

Observe that a right angle forms between the vectors \(\mathbf{u}\) and \(\mathbf{v}\) when \(\mathbf{u}\cdot \mathbf{v} = 0\).

And we can calculate the angle between \(\mathbf{u}\) and \(\mathbf{v}\).

Code

# Define vectors u and v
u = np.array([1, 2])
v = np.array([4, 1])

theta = np.arccos(np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v)))
print(f"The angle between u and v is {theta} radians or {np.degrees(theta)} degrees.")

The angle between u and v is 0.8621700546672264 radians or 49.398705354995535 degrees.

Linear dependence: A set of \(n\) vectors \(\mathbf{v}_{1}, \ldots, \mathbf{v}_{n}\in\mathbb{R}^n\) is linearly dependent if there exists scalars \(a_1,\ldots, a_n\) not all zero such that \[ \sum_{i=1}^{n} a_i \mathbf{v}_i = 0. \]

Linear independence: The vectors \(\mathbf{v}_{1}, \ldots, \mathbf{v}_{n}\) are linearly independent if they are not linearly dependent, i.e., the equation \[ a_1 \mathbf{v}_1 + \cdots + a_n \mathbf{v}_n = 0, \]

is only satisfied if \(a_i=0\) for \(i=1, \ldots,n\).

Span: Given a set of vectors \(V = \{\mathbf{v}_{1}, \ldots, \mathbf{v}_{n}\}\), where \(\mathbf{v}_i\in\mathbb{R}^n\), the span(V) is the set of all linear combinations of vectors in \(V\).

Matrices

A matrix \(A\in\mathbb{R}^{m\times n}\) is a 2-D array of numbers

\[ A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} &a_{m2} & \cdots & a_{mn} \\ \end{bmatrix}, \]

with \(m\) rows and \(n\) columns. The element at row \(i\) and column \(j\) is denoted \(a_{ij}\). If \(m=n\) we call it a square matrix.

Similar to vectors, we can multiply matrices by scalar values and add matrices of the same dimension, i.e.,

Scalar multiplication: Let \(c\in\mathbb{R}\) and \(A\in\mathbb{R}^{m\times n}\), then \[ cA = \begin{bmatrix} ca_{11} & ca_{12} & \cdots & ca_{1n} \\ ca_{21} & ca_{22} & \cdots & ca_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ ca_{m1} & ca_{m2} & \cdots & ca_{mn} \\ \end{bmatrix}. \]

Matrix addition: Let \(A, B\in\mathbb{R}^{m\times n}\), then \[ A + B = \begin{bmatrix} a_{11} + b_{11} & a_{12} + b_{12} & \cdots & a_{1n} + b_{1n} \\ a_{21} + b_{21} & a_{22} + b_{22} & \cdots & a_{2n} + b_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} + b_{m1} & a_{m2} + b_{m2} & \cdots & a_{mn} + b_{mn} \\ \end{bmatrix} \]

Transpose: The transpose \(A^{T}\) is defined as

\[ A^{T} = \begin{bmatrix} a_{11} & a_{21} & \cdots & a_{m1} \\ a_{12} & a_{22} & \cdots & a_{m2} \\ \vdots & \vdots & \ddots & \vdots \\ a_{1n} &a_{2n} & \cdots & a_{nm} \\ \end{bmatrix}. \]

The transpose turns columns of the matrix into rows (equivalently rows into columns). A square matrix is called symmetric if \(A=A^{T}\).

Matrix multiplication

We discuss the following two important matrix multiplication operations

matrix-vector multiplication,
matrix-matrix multiplication.

Matrix-vector multiplication

Let \(A\in\mathbb{R}^{m\times n}\) and \(\mathbf{x}\in\mathbb{R}^{n}\), then \(A\mathbf{x}\in\mathbb{R}^{m}\) can be defined row-wise as

\[ A\mathbf{x} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = \begin{bmatrix} x_1a_{11} + x_2 a_{12} + \cdots + x_na_{1n} \\ x_1a_{21} + x_2 a_{22} + \cdots + x_na_{2n} \\ \vdots \\ x_1a_{m1} + x_2 a_{m2} + \cdots + x_na_{mn} \\ \end{bmatrix}. \]

Equivalently, this means that \(A\mathbf{x}\) is a linear combination of the columns of \(A\), i.e.,

\[ A\mathbf{x} = x_1 \begin{bmatrix} a_{11} \\ a_{21} \\ \vdots \\ a_{m1} \end{bmatrix} + x_2 \begin{bmatrix} a_{12} \\ a_{22} \\ \vdots \\ a_{m2} \end{bmatrix} + \cdots + x_n \begin{bmatrix} a_{1n} \\ a_{2n} \\ \vdots \\ a_{mn} \end{bmatrix}. \]

Observe that the matrix \(A\) is a linear transformation that maps vectors in \(\mathbb{R}^{n}\) to \(\mathbb{R}^{m}\).

Matrix-matrix multiplication

Let \(A\in\mathbb{R}^{m\times n}\) and \(B\in\mathbb{R}^{n\times p}\), then the elements of \(C=AB\in\mathbb{R}^{m\times p}\) are

\[ c_{ij} = \sum_{k=1}^{n} a_{ik}b_{kj}, \]

for \(i=1,\ldots, m\) and \(j=1, \ldots, p\).

Vector spaces

An essential concept in linear algebra is the notion of a vector space.

A vector space is a set \(V\) such that for any 2 elements in the set, say \(\mathbf{u},\mathbf{v}\in V\), and any scalars, \(c\) and \(d\), then \(c\mathbf{u} + d\mathbf{v}\in V\).

In addition, a vector space must satisfy the following properties

\(\mathbf{u} + (\mathbf{v} + \mathbf{w}) = (\mathbf{u} + \mathbf{v}) + \mathbf{w}\) (associativity).
\(\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}\) (commutativity).
There exists \(\mathbf{0}\in V\) such that \(\mathbf{v} + \mathbf{0} = \mathbf{v}\) for all \(\mathbf{v}\in V\) (identity element).
For every \(\mathbf{v}\in V\), there exists \(-\mathbf{v}\in V\) such that \(\mathbf{v} + (-\mathbf{v}) = \mathbf{0}\) (inverse).
\(c(d\mathbf{v}) = (cd)\mathbf{v}\)
\(1\mathbf{v} = \mathbf{v}\)
\(c(\mathbf{u} + \mathbf{v}) = c\mathbf{u} + c\mathbf{v}\)
\((c + d)\mathbf{v} = c\mathbf{v} + d\mathbf{v}\)

Some examples of vector spaces are:

The set of n-dimensional vectors with real numbers, i.e., \(\mathbb{R}^n\).
Given a matrix \(A\in\mathbb{R}^{m\times n}\),
- the column space \(col(A)\), which is the span of all columns in the matrix \(A\) is a vector space.
- The similarly defined row space is also a vector space.
Given a matrix \(A\in\mathbb{R}^{n\times n}\), the set of all solutions to the equation \(A\mathbf{x} = \mathbf{0}\) is a vector space. This space is called the null space of matrix \(A\).
The set of all \(m\times n\) matrices with real numbers is also a vector space.
The set of all polynomials of degree \(n\) is a vector space.

Important matrices

We introduce notation for some commonly used and important matrices.

The \(n \times n\) identity matrix is

\[ \mathbf{I} = \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \\ \end{bmatrix}. \]

For every \(\mathbf{A}\in\mathbb{R}^{n\times n}\), then \(\mathbf{AI} = \mathbf{IA}\).

The inverse \(A^{-1}\in\mathbb{R}^{n\times n}\) is defined as the matrix for which \[AA^{-1} = A^{-1}A = I,\]

When \(A^{-1}\) exists the matrix is said to be invertible.

Note that \((AB)^{-1} = B^{-1}A^{-1}\) for invertible \(B\in\mathbb{R}^{n\times n}\).

A diagonal matrix \(D\in\mathbb{R}^{n\times n}\) has entries \(d_{ij}=0\) if \(i\neq j\), i.e.,

\[ D = \begin{bmatrix} d_{11} & 0 & \cdots & 0 \\ 0 & d_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_{nn} \\ \end{bmatrix}. \]

A square matrix \(Q\in\mathbb{R}^{n}\) is orthogonal if

\[QQ^{T}=Q^{T}Q=I.\]

In particular, the inverse of an orthogonal matrix is it’s transpose.

A lower triangular matrix \(L\in\mathbb{R}^{n\times n}\) is a matrix where all the entries above the main diagonal are zero

\[ L = \begin{bmatrix} l_{11} & 0 & \cdots & 0 \\ l_{12} & l_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ l_{1n} & l_{n2} & \cdots & l_{nn} \\ \end{bmatrix}. \]

An upper triangular matrix \(U\in\mathbb{R}^{n\times n}\) is a matrix where all the entries below the main diagonal are zero

\[ U = \begin{bmatrix} u_{11} & u_{12} & \cdots & u_{1n} \\ 0 & u_{22} & \cdots & u_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & u_{nn} \\ \end{bmatrix}. \]

The inverse of a lower triangular matrix is itself a lower triangular matrix. This is also true for upper triangular matrices, i.e., the inverse is also upper triangular.

Eigenvalues and eigenvectors

An eigenvector of an \(n\times n\) matrix \(\mathbf{A}\) is a nonzero vector \(\mathbf{x}\) such that

\[ A\mathbf{x} = \lambda\mathbf{x} \]

for some scalar \(\lambda.\)

The scalar \(\lambda\) is called an eigenvalue.

An \(n \times n\) matrix has at most \(n\) distinct eigenvectors and at most \(n\) distinct eigenvalues.

Matrix decompositions

We introduce here important matrix decompositions. These are useful in solving linear equations. Furthermore they play an important role in various data science applications.

LU factorization

An LU decomposition of a square matrix \(A\in\mathbb{R}^{n\times n}\) is a factorization of \(A\) into a product of matrices

\[ A = LU, \]

where \(L\) is a lower triangular square matrix and \(U\) is an triangular square matrix. For example, when \(n=3\), we have

\[ \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \\ \end{bmatrix} = \begin{bmatrix} l_{11} & 0 & 0 \\ l_{21} & l_{22} & 0\\ l_{31} & l_{32} & a_{33} \\ \end{bmatrix} \begin{bmatrix} u_{11} & u_{12} & u_{13} \\ 0 & u_{22} & u_{23} \\ 0 & 0 & u_{33} \\ \end{bmatrix} \]

It simplifies the process of solving linear equations and is more numerically stable than computing the inverse of \(A\).

You can then solve this in two steps:

Forward substitution: First, solve the equation \(L\mathbf{y} = \mathbf{b}\) for \(\mathbf{y}\).
Backward substitution: Then, solve \(U\mathbf{x} = \mathbf{y}\) for \(\mathbf{x}\).

This method is particularly useful because triangular matrices are much easier to work with. It can make solving linear systems faster and more efficient, especially for large matrices.

QR decomposition

A QR decomposition of a square matrix \(A\in\mathbb{R}^{n\times n}\) is a factorization of \(A\) into a product of matrices

\[ A=QR, \] where \(Q\) is an orthogonal square matrix and \(R\) is an upper-triangular square matrix.

QR factorization is useful in solving linear systems and performing least squares fitting.

This factorization has a couple of important benefits:

Solving Linear Systems: When you’re working with a system of equations represented by \(A\mathbf{x} = \mathbf{b}\), you can substitute the QR factorization into this equation:

\[ QR\mathbf{x} = \mathbf{b} \]

Since \(Q\) is orthogonal, you can multiply both sides by \(Q^T\) (the transpose of \(Q\)) to simplify it:

\[ R\mathbf{x} = Q^T\mathbf{b} \]

Now, you can solve this upper triangular system for \(\mathbf{x}\) using backward substitution, which is typically easier and more stable.
Least Squares Problems: In many data science applications, you want to find the best fit line or hyperplane for your data. QR factorization is particularly useful here because it helps in minimizing the error when the system is overdetermined (more equations than unknowns). You can solve the least squares problem by leveraging the QR decomposition to find:

\[ \hat{\mathbf{x}} = R^{-1}Q^T\mathbf{b} \]

By converting the problem into a triangular system, QR factorization often provides a more stable numerical solution than other methods, especially for poorly conditioned matrices.

Eigendecomposition

Let \(A\in\mathbb{R}^{n\times n}\) have \(n\) linearly independent eigenvectors \(\mathbf{x}_i\) for \(i=1,\ldots, n\), then \(A\) can be factorized as

\[ A = X\Lambda X^{-1}, \]

where the columns of matrix \(X\) are the eigenvectors of \(A\), and

\[ \Lambda = \begin{bmatrix} \lambda_{1} & 0 & \cdots & 0 \\ 0 & \lambda_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_{n} \\ \end{bmatrix}, \]

is a diagonal matrix of the eigenvalues.

In this case, the matrix \(A\) is said to be diagonalizable.

Instead of calulating \(A\mathbf{x}\) directly, we can map it by first transforming it using \(X^{-1}\) and then stretching it by \(\Lambda\) and finally transforming it using \(X\).

\[ A \mathbf{x} = X \Lambda X^{-1} \mathbf{x} \]

We’ll use eigendecomposition in Principal Component Analysis (PCA) to reduce dimensionality in datasets while preserving as much variance as possible.

A special case occurs when \(A\) is symmetric. Recall that a matrix is symmetric when \(A = A^T.\)

In this case, it can be shown that the eigenvectors of \(A\) are all mutually orthogonal. Consequently, \(X^{-1} = X^{T}\) and we can decompose \(A\) as:

\[A = XDX^T.\]

This is known as the spectral theorem and this decomposition of \(A\) is its spectral decomposition. The eigenvalues of a matrix are also called its spectrum.

Singular value decomposition

For the previous few examples, we required \(\mathbf{A}\) to be square. Now let \(A\in\mathbb{R}^{m\times n}\) with \(m>n\), then \(A\) admits a decomposition

\[ A = U\Sigma V^{T}. \] The matrices \(U\in\mathbb{R}^{m\times m}\) and \(V\in\mathbb{R}^{n\times n}\) are orthogonal. The columns of \(U\) are the left singular vectors and the columns of \(V\) are the right singular vectors.

The matrix \(\Sigma\in\mathbb{R}^{m\times n}\) is a diagonal matrix of the form

\[ \Sigma = \begin{bmatrix} \sigma_{11} & 0 & \cdots & 0 \\ 0 & \sigma_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma_{mn} \\ \end{bmatrix}. \]

The values \(\sigma_{ij}\) are the singular values of the matrix \(A\).

Amazingly, it can be proven that every matrix \(A\in\mathbb{R}^{m\times n}\) has a singular value decomposition.

Linear Systems and Least Squares

Linear systems of equations

A system of \(m\) linear equations in \(n\) unknowns can be written as

\[ \begin{align*} a_{11} x_{1} + a_{12} x_{2} + \cdots + a_{1n} x_{n} &= b_1, \\ a_{21} x_{1} + a_{22} x_{2} + \cdots + a_{2n} x_{n} &= b_2, \\ \vdots \qquad \qquad \quad \\ a_{m1} x_{1} + a_{m2} x_{2} + \cdots + a_{mn} x_{n} &= b_m.\\ \end{align*} \]

Observe that this is simply the matrix vector equation

\[ A\mathbf{x}=\mathbf{b}. \]

A linear system of equations may have:

infinitely many solutions,
a unique solution,
no solutions.

When \(m > n\), the system is said to be overdetermined and in general has no solutions. When \(m < n\) the system is underdetermined and in general has infinitely many solutions. For the case when \(m=n\) and the matrix has \(n\) linearly dependent columns, the solution is always unique.

For an invertible square matrix \(A\mathbf{x}=\mathbf{b}\), the solution is always \(\mathbf{x}=A^{-1}\mathbf{b}\).

We can use matrix factorizations to help us solve a linear system of equation. We demonstrate how to do this with the LU decomposition. Observe that \[A\mathbf{x} = LU\mathbf{x} = \mathbf{b}.\]

Then

\[ \mathbf{x} = U^{-1}L^{-1}\mathbf{b}. \]

The process of inverting \(L\) and \(U\) is called backward and forward substitution.

Least squares

In data science it is often the case that we have to solve the linear system

\[ A \mathbf{x} = \mathbf{b}. \]

This problem may have no solution – perhaps due to noise or measurement error.

In such a case, we look for a vector \(\mathbf{x}\) such that \(A\mathbf{x}\) is a good approximation to \(\mathbf{b}.\)

The quality of the approximation can be measured using the distance from \(A\mathbf{x}\) to \(\mathbf{b},\) i.e.,

\[ \Vert A\mathbf{x} - \mathbf{b}\Vert_2. \]

The general least-squares problem is given \(A\in\mathbb{R}^{m\times n}\) and and \(\mathbf{b}\in\mathbb{R}^{m}\), find a vector \(\hat{\mathbf{x}}\in\mathbb{R}^{n}\) such that \(\Vert A\mathbf{x}-\mathbf{b}\Vert_2\) is minimized, i.e.

\[ \hat{\mathbf{x}} = \arg\min_\mathbf{x} \Vert A\mathbf{x} - \mathbf{b}\Vert. \]

This emphasizes the fact that the least squares problem is a minimization problem.

Minimizations problems are an example of a broad class of problems called optimization problems. In optimization problems we attempt to find an optimal solution that minimizes (or maximizes) a set particular set of equations (and possibly constraints).

We can connect the above minimization of the distance between vectors to the minimization of the sum of squared errors. Let \(\mathbf{y} = A\mathbf{x}\) and observe that

\[\Vert A\mathbf{x}-\mathbf{b}\Vert_2^2 = \Vert \mathbf{y}-\mathbf{b}\Vert_2^2 = \sum_i (y_i-b_i)^2.\]

The above expression is the sum of squared errors. In statistics, the \(y_i\) are the estimated values and the \(b_i\) are the measured values. This is the most common measure of error used in statistics and is a key principle.

Minimizing the length of \(A\mathbf{x} - \mathbf{b}\) is equivalent to minimizing the sum of the squared errors.

We can find \(\hat{\mathbf{x}}\) using either

geometric arguments based on projections of the vector \(\mathbf{b}\),
by calculus (taking the derivative of the right-hand-side expression above and setting it equal to zero).

Either way, we obtain the result that \(\hat{\mathbf{x}}\) is the solution of:

\[A^TA\mathbf{x} = A^T\mathbf{b}.\]

This system of equations is called the normal equations.

We can prove that these equations always have at least one solution.

When \(A^TA\) is invertible, the system is said to be overdetermined. This means that there is a unique solution

\[\hat{\mathbf{x}} = (A^TA)^{-1}A^T\mathbf{b}.\]

Be aware that computing the solution using \((A^TA)^{-1}A^T\) can be numerically unstable. A more stable method is to use the QR decomposition of \(A\), i.e., \(\hat{\mathbf{x}} = R^{-1}Q^T\mathbf{b}\).

The NumPy function np.linalg.lstsq() solves the least squares problem in a stable way.

Consider the following example where the solution to the least squares problem is x=np.array([1, 1]).

import numpy as np
A = np.array([[-1.42382504, -1.4238264 ],
              [ 1.26372846,  1.26372911], 
              [-0.87066174, -0.87066138]])
b = A @ np.array([1, 1])

# Pseudoinverse
x1 = np.linalg.inv(A.T @ A) @ A.T @ b
print(x1)
# QR
[Q, R] = np.linalg.qr(A)
x2 = np.linalg.solve(R, Q.T @ b)
print(x2)
# np.linalg.lstsq
x3, _, _, _ = np.linalg.lstsq(A, b, rcond=None)
print(x3) 
print(np.linalg.norm(x1-np.array([1, 1])))
print(np.linalg.norm(x2-np.array([1, 1])))
print(np.linalg.norm(x3-np.array([1, 1])))

[1.00010437 0.99918931]
[1. 1.]
[1. 1.]
0.0008173801402491371
2.56558995595757e-10
6.814380261227873e-10