Linear algebra is the branch of mathematics involving vectors and matrices. In particular, how vectors are transformed. Knowledge of linear algebra is essential in data science.
Linear algebra
Linear algebra allows us to understand the operations and transformations used to manipulate and extract information from data.
Examples
Deep neural networks use matrix-vector and matrix-matrix multiplication.
Natural language processing use the dot product to determine word similarity.
Least squares uses matrix inverses and matrix factorizations to compute models for predicting continuous values.
PCA (dimensionality reduction) uses the matrix factorization called the Singular Value Decomposition (SVD).
Graphs are described by adjacency matrices. Eigenvectors and eigenvalues of this matrix provide information about the graph structure.
Linear algebra is used to implement data science algorithms efficiently and accurately.
You will not have to program linear algebra algorithms. You will use appropriate Python packages.
This lecture is a review of some aspects of linear algebra that are important for data science. Given the prerequisites for this course, I assume that you previously learned this material.
The goal of this lecture is to refresh the following topics:
vectors,
matrices,
operations with vectors and matrices,
eigenvectors and eigenvalues,
linear systems and least squares,
matrix factorizations.
Below is a list of very useful resources for learning about linear algebra:
Linear Algebra and Its Applications (6th edition), David C. Lay, Judi J. McDonald, and Steven R. Lay, Pearson, 2021,
Introduction to Linear Algebra (6th edition), Gilbert Strang, Wellesley-Cambridge Press, 2023,
We plot the sum of \(\mathbf{u} = [1, 2]\) and \(\mathbf{v} = [4, 1]\) in Figure 3. The sum \(\mathbf{u} + \mathbf{v} = [5, 3]\) is obtained by placing the tip of one vector to the tail of the other vector.
Dot product: Let \(\mathbf{u},\mathbf{v}\in\mathbb{R}^{n}\) then the dot product is defined as \[
\mathbf{u}\cdot\mathbf{v} = \sum_{i=0}^n u_i v_i.
\]
Vector 2-norm: The 2-norm of a vector \(\mathbf{v}\in\mathbb{R}^{n}\) is defined as \[
\Vert \mathbf{v}\Vert_2 = \sqrt{\mathbf{v}\cdot\mathbf{v}} = \sqrt{\sum_{i=1}^n v_i^2}.
\]
This norm is referred to as the \(\ell_2\) norm. In these notes, the notation \(\Vert \mathbf{v} \Vert\), indicates the 2-norm.
Unit vector: A unit vector \(\mathbf{v}\) is a vector such that \(\Vert \mathbf{v} \Vert_2 = 1\). - All vectors of the form \(\frac{\mathbf{v}}{\Vert \mathbf{v} \Vert_2 }\) are unit vectors.
Distance: Let \(\mathbf{u},\mathbf{v}\in\mathbb{R}^{n}\), the distance between \(\mathbf{u}\) and \(\mathbf{v}\) is \[
\Vert \mathbf{u} - \mathbf{v} \Vert_2.
\]
Orthogonality: Two vectors \(\mathbf{u},\mathbf{v}\in\mathbb{R}^{n}\) are orthogonal if and only if \(\mathbf{u}\cdot\mathbf{v}=0\).
Angle between vectors: Let \(\mathbf{u},\mathbf{v}\in\mathbb{R}^{n}\), the angle between these vectors is \[
\cos{\theta} = \frac{\mathbf{u}\cdot\mathbf{v}}{\Vert \mathbf{u}\Vert_2 \Vert\mathbf{v}\Vert_2}.
\]
The dot product of two vectors \(\mathbf{u}, \mathbf{v}\) can be used to project \(\mathbf{u}\) onto \(\mathbf{v}\)
import matplotlib.pyplot as pltimport numpy as npfig = plt.figure()ax = plt.gca()# Define vectors u and vu = np.array([1, 2])v = np.array([4, 1])# Project u onto vprojection = np.dot(u, v) * v / np.dot(v, v)V = np.array([u, v, projection])origin = np.array([[0, 0, 0], [0, 0, 0]])plt.quiver(*origin, V[:, 0], V[:, 1], color=['b', 'b', 'r'], angles='xy', scale_units='xy', scale=1)ax.set_xlim([-1, 6])ax.set_ylim([-1, 4])ax.text(1.3, 1.9, r'$\mathbf{u}$', size=16)ax.text(4.3, 1.2, r'$\mathbf{v}$', size=16)ax.text(0.4, -0.3, r'$\mathrm{proj}_{\mathbf{v}}\mathbf{u}$', size=16)plt.plot([u[0], projection[0]], [u[1], projection[1]], 'g--')ax.grid()plt.show()
Observe that a right angle forms between the vectors \(\mathbf{u}\) and \(\mathbf{v}\) when \(\mathbf{u}\cdot \mathbf{v} = 0\).
And we can calculate the angle between \(\mathbf{u}\) and \(\mathbf{v}\).
Code
# Define vectors u and vu = np.array([1, 2])v = np.array([4, 1])theta = np.arccos(np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v)))print(f"The angle between u and v is {theta} radians or {np.degrees(theta)} degrees.")
The angle between u and v is 0.8621700546672264 radians or 49.398705354995535 degrees.
Linear dependence: A set of \(n\) vectors \(\mathbf{v}_{1}, \ldots, \mathbf{v}_{n}\in\mathbb{R}^n\) is linearly dependent if there exists scalars \(a_1,\ldots, a_n\) not all zero such that \[
\sum_{i=1}^{n} a_i \mathbf{v}_i = 0.
\]
Linear independence: The vectors \(\mathbf{v}_{1}, \ldots, \mathbf{v}_{n}\) are linearly independent if they are not linearly dependent, i.e., the equation \[
a_1 \mathbf{v}_1 + \cdots + a_n \mathbf{v}_n = 0,
\]
is only satisfied if \(a_i=0\) for \(i=1, \ldots,n\).
Span: Given a set of vectors \(V = \{\mathbf{v}_{1}, \ldots, \mathbf{v}_{n}\}\), where \(\mathbf{v}_i\in\mathbb{R}^n\), the span(V) is the set of all linear combinations of vectors in \(V\).
Matrices
Matrices
A matrix \(A\in\mathbb{R}^{m\times n}\) is a 2-D array of numbers
Observe that the matrix \(A\) is a linear transformation that maps vectors in \(\mathbb{R}^{n}\) to \(\mathbb{R}^{m}\).
Matrix-matrix multiplication
Let \(A\in\mathbb{R}^{m\times n}\) and \(B\in\mathbb{R}^{n\times p}\), then the elements of \(C=AB\in\mathbb{R}^{m\times p}\) are
\[
c_{ij} = \sum_{k=1}^{n} a_{ik}b_{kj},
\]
for \(i=1,\ldots, m\) and \(j=1, \ldots, p\).
Vector spaces
Vector spaces
An essential concept in linear algebra is the notion of a vector space.
A vector space is a set \(V\) such that for any 2 elements in the set, say \(\mathbf{u},\mathbf{v}\in V\), and any scalars, \(c\) and \(d\), then \(c\mathbf{u} + d\mathbf{v}\in V\).
In addition, a vector space must satisfy the following properties
The set of n-dimensional vectors with real numbers, i.e., \(\mathbb{R}^n\).
Given a matrix \(A\in\mathbb{R}^{m\times n}\),
the column space\(col(A)\), which is the span of all columns in the matrix \(A\) is a vector space.
The similarly defined row space is also a vector space.
Given a matrix \(A\in\mathbb{R}^{n\times n}\), the set of all solutions to the equation \(A\mathbf{x} = \mathbf{0}\) is a vector space. This space is called the null space of matrix \(A\).
The set of all \(m\times n\) matrices with real numbers is also a vector space.
The set of all polynomials of degree \(n\) is a vector space.
Important matrices
Important matrices
We introduce notation for some commonly used and important matrices.
The inverse of a lower triangular matrix is itself a lower triangular matrix. This is also true for upper triangular matrices, i.e., the inverse is also upper triangular.
Eigenvalues and eigenvectors
An eigenvector of an \(n\times n\) matrix \(\mathbf{A}\) is a nonzero vector \(\mathbf{x}\) such that
\[
A\mathbf{x} = \lambda\mathbf{x}
\]
for some scalar \(\lambda.\)
The scalar \(\lambda\) is called an eigenvalue.
An \(n \times n\) matrix has at most \(n\) distinct eigenvectors and at most \(n\) distinct eigenvalues.
Matrix decompositions
We introduce here important matrix decompositions. These are useful in solving linear equations. Furthermore they play an important role in various data science applications.
LU factorization
An LU decomposition of a square matrix \(A\in\mathbb{R}^{n\times n}\) is a factorization of \(A\) into a product of matrices
\[
A = LU,
\]
where \(L\) is a lower triangular square matrix and \(U\) is an triangular square matrix. For example, when \(n=3\), we have
It simplifies the process of solving linear equations and is more numerically stable than computing the inverse of \(A\).
You can then solve this in two steps:
Forward substitution: First, solve the equation \(L\mathbf{y} = \mathbf{b}\) for \(\mathbf{y}\).
Backward substitution: Then, solve \(U\mathbf{x} = \mathbf{y}\) for \(\mathbf{x}\).
This method is particularly useful because triangular matrices are much easier to work with. It can make solving linear systems faster and more efficient, especially for large matrices.
QR decomposition
A QR decomposition of a square matrix \(A\in\mathbb{R}^{n\times n}\) is a factorization of \(A\) into a product of matrices
\[
A=QR,
\] where \(Q\) is an orthogonal square matrix and \(R\) is an upper-triangular square matrix.
QR factorization is useful in solving linear systems and performing least squares fitting.
This factorization has a couple of important benefits:
Solving Linear Systems: When you’re working with a system of equations represented by \(A\mathbf{x} = \mathbf{b}\), you can substitute the QR factorization into this equation:
\[
QR\mathbf{x} = \mathbf{b}
\]
Since \(Q\) is orthogonal, you can multiply both sides by \(Q^T\) (the transpose of \(Q\)) to simplify it:
\[
R\mathbf{x} = Q^T\mathbf{b}
\]
Now, you can solve this upper triangular system for \(\mathbf{x}\) using backward substitution, which is typically easier and more stable.
Least Squares Problems: In many data science applications, you want to find the best fit line or hyperplane for your data. QR factorization is particularly useful here because it helps in minimizing the error when the system is overdetermined (more equations than unknowns). You can solve the least squares problem by leveraging the QR decomposition to find:
\[
\hat{\mathbf{x}} = R^{-1}Q^T\mathbf{b}
\]
By converting the problem into a triangular system, QR factorization often provides a more stable numerical solution than other methods, especially for poorly conditioned matrices.
Eigendecomposition
Let \(A\in\mathbb{R}^{n\times n}\) have \(n\) linearly independent eigenvectors \(\mathbf{x}_i\) for \(i=1,\ldots, n\), then \(A\) can be factorized as
\[
A = X\Lambda X^{-1},
\]
where the columns of matrix \(X\) are the eigenvectors of \(A\), and
In this case, the matrix \(A\) is said to be diagonalizable.
Instead of calulating \(A\mathbf{x}\) directly, we can map it by first transforming it using \(X^{-1}\) and then stretching it by \(\Lambda\) and finally transforming it using \(X\).
\[
A \mathbf{x} = X \Lambda X^{-1} \mathbf{x}
\]
We’ll use eigendecomposition in Principal Component Analysis (PCA) to reduce dimensionality in datasets while preserving as much variance as possible.
A special case occurs when \(A\) is symmetric. Recall that a matrix is symmetric when \(A = A^T.\)
In this case, it can be shown that the eigenvectors of \(A\) are all mutually orthogonal. Consequently, \(X^{-1} = X^{T}\) and we can decompose \(A\) as:
\[A = XDX^T.\]
This is known as the spectral theorem and this decomposition of \(A\) is its spectral decomposition. The eigenvalues of a matrix are also called its spectrum.
Singular value decomposition
For the previous few examples, we required \(\mathbf{A}\) to be square. Now let \(A\in\mathbb{R}^{m\times n}\) with \(m>n\), then \(A\) admits a decomposition
\[
A = U\Sigma V^{T}.
\] The matrices \(U\in\mathbb{R}^{m\times m}\) and \(V\in\mathbb{R}^{n\times n}\) are orthogonal. The columns of \(U\) are the left singular vectors and the columns of \(V\) are the right singular vectors.
The matrix \(\Sigma\in\mathbb{R}^{m\times n}\) is a diagonal matrix of the form
When \(m > n\), the system is said to be overdetermined and in general has no solutions. When \(m < n\) the system is underdetermined and in general has infinitely many solutions. For the case when \(m=n\) and the matrix has \(n\) linearly dependent columns, the solution is always unique.
For an invertible square matrix \(A\mathbf{x}=\mathbf{b}\), the solution is always \(\mathbf{x}=A^{-1}\mathbf{b}\).
We can use matrix factorizations to help us solve a linear system of equation. We demonstrate how to do this with the LU decomposition. Observe that \[A\mathbf{x} = LU\mathbf{x} = \mathbf{b}.\]
Then
\[
\mathbf{x} = U^{-1}L^{-1}\mathbf{b}.
\]
The process of inverting \(L\) and \(U\) is called backward and forward substitution.
Least squares
In data science it is often the case that we have to solve the linear system
\[
A \mathbf{x} = \mathbf{b}.
\]
This problem may have no solution – perhaps due to noise or measurement error.
In such a case, we look for a vector \(\mathbf{x}\) such that \(A\mathbf{x}\) is a good approximation to \(\mathbf{b}.\)
The quality of the approximation can be measured using the distance from \(A\mathbf{x}\) to \(\mathbf{b},\) i.e.,
\[
\Vert A\mathbf{x} - \mathbf{b}\Vert_2.
\]
The general least-squares problem is given \(A\in\mathbb{R}^{m\times n}\) and and \(\mathbf{b}\in\mathbb{R}^{m}\), find a vector \(\hat{\mathbf{x}}\in\mathbb{R}^{n}\) such that \(\Vert A\mathbf{x}-\mathbf{b}\Vert_2\) is minimized, i.e.
This emphasizes the fact that the least squares problem is a minimization problem.
Minimizations problems are an example of a broad class of problems called optimization problems. In optimization problems we attempt to find an optimal solution that minimizes (or maximizes) a set particular set of equations (and possibly constraints).
We can connect the above minimization of the distance between vectors to the minimization of the sum of squared errors. Let \(\mathbf{y} = A\mathbf{x}\) and observe that
The above expression is the sum of squared errors. In statistics, the \(y_i\) are the estimated values and the \(b_i\) are the measured values. This is the most common measure of error used in statistics and is a key principle.
Minimizing the length of \(A\mathbf{x} - \mathbf{b}\) is equivalent to minimizing the sum of the squared errors.
We can find \(\hat{\mathbf{x}}\) using either
geometric arguments based on projections of the vector \(\mathbf{b}\),
by calculus (taking the derivative of the right-hand-side expression above and setting it equal to zero).
Either way, we obtain the result that \(\hat{\mathbf{x}}\) is the solution of:
\[A^TA\mathbf{x} = A^T\mathbf{b}.\]
This system of equations is called the normal equations.
We can prove that these equations always have at least one solution.
When \(A^TA\) is invertible, the system is said to be overdetermined. This means that there is a unique solution
\[\hat{\mathbf{x}} = (A^TA)^{-1}A^T\mathbf{b}.\]
Be aware that computing the solution using \((A^TA)^{-1}A^T\) can be numerically unstable. A more stable method is to use the QR decomposition of \(A\), i.e., \(\hat{\mathbf{x}} = R^{-1}Q^T\mathbf{b}\).
The NumPy function np.linalg.lstsq() solves the least squares problem in a stable way.
Consider the following example where the solution to the least squares problem is x=np.array([1, 1]).