Linear Algebra Review

CSE 547 / STAT 548 at the University of Washington

Acknowledgment: This note was originally compiled by Jessica Su for CS 224W at Standard with substantial modifications by Yikun Zhang in Winter 2023 and Spring 2024 for CSE 547 / STAT 548 at UW. Parts of this note are adapted from Lecture 8 of Professor Yen-Chi Chen’s 1 and Professor Michael Perlman’s lecture notes for STAT 512 at UW. Other good references about linear algebra includes and notes from CS 224W at Stanford:

Note: We only discuss the vectors and matrices with real entries in this note, though the stated results also hold for complex entries.

Vector Space, Span, and Linear Independence

Vector space: A vector space over the real numbers \(\mathbb{R}\) is a set of vectors that is closed under additions with an identity as the zero vector \(\mathbf{0}\) and additive inverses in the set. It is also closed under scalar multiplications of the vectors by elements in \(\mathbb{R}\).
The most common vector space in Machine Learning is the Euclidean space \(\mathbb{R}^n\), which consists of all ordered \(n\)-tuples of real numbers. A vector of \(\mathbb{R}^n\) can be denoted by \[\mathbf{x}= \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}\] or a row vector \(\mathbf{x}^T=\left[x_1,...,x_n\right]\), where \(x_i, i=1,...,n\) are called its components or coordinates.

Vector Operations

Dot/Inner product: The geometric properties of \(\mathbb{R}^n\) are derived from the Euclidean dot product defined as: \[\langle \mathbf{x}, \mathbf{y} \rangle=\mathbf{x}^T\mathbf{y}=x_1 y_1 + \cdots + x_n y_n=\sum_{i=1}^n x_iy_i,\] where \(\mathbf{x}=\left[x_1,...,x_n\right]^T\) and \(\mathbf{y}=\left[y_1,...,y_n\right]^T\) are in \(\mathbb{R}^n\).

Orthogonality: Two vectors in \(\mathbb{R}^n\) are orthogonal if and only if their dot product is zero. In \(\mathbb{R}^2\), we also call orthogonal vectors perpendicular.

Norm: The standard \(\ell_2\)-norm or length of a vector \(\mathbf{x}=[x_1,..., x_n]^T\in \mathbb{R}^n\) is given by \[\left|\left| \mathbf{x} \right|\right|_2 = \sqrt{x_1^2 + \cdots + x_n^2}.\] Other possible norms in \(\mathbb{R}^n\) include

When the context is clear, we often write the norm of a vector \(\mathbf{x}\) as \(\left|\left| \mathbf{x} \right|\right|\). The norms in \(\mathbb{R}^n\) can be used to measure distances between data points (or vectors) in \(\mathbb{R}^n\).

Triangle inequality: For two vectors \(\mathbf{x},\mathbf{y}\) and any norm \(\left|\left| \cdot \right|\right|\) in \(\mathbb{R}^n\), the triangle inequality states that \[\left|\left| \mathbf{x}+\mathbf{y} \right|\right| \leq \left|\left| \mathbf{x} \right|\right| + \left|\left| \mathbf{y} \right|\right|,\] and its reverse version goes as \[\left|\left| \mathbf{x}-\mathbf{y} \right|\right| \geq \Big|\left|\left| \mathbf{x} \right|\right| -\left|\left| \mathbf{y} \right|\right| \Big|.\]

Subspaces and Span

Subspace of \(\mathbb{R}^n\): A subspace of \(\mathbb{R}^n\) is a subset of \(\mathbb{R}^n\) that is, by itself, a vector space over \(\mathbb{R}\) using the same operations of vector addition and scalar multiplication in \(\mathbb{R}^n\). In other words, a subset of \(\mathbb{R}^n\) is a subspace precisely when it is closed under these two operations.

Linear combination: A linear combination of the vectors \(\mathbf{v}_1,..., \mathbf{v}_k\) (in \(\mathbb{R}^n\)) is any expression of the form \(a_1 \mathbf{v}_1 + \cdots + a_k \mathbf{v}_k\), where \(k\) is a positive integer and \(a_1,..., a_k \in \mathbb{R}\). Note that some of \(a_1,..., a_k\) may be zero.

Span: The span of a set \(\mathcal{S}\) of vectors consists of all possible linear combinations of finitely many vectors in \(\mathcal{S}\), i.e., \[\mathrm{span}\,\mathcal{S} = \left\{a_1 \mathbf{v}_1 + \cdots + a_k \mathbf{v}_k: \mathbf{v}_1,...,\mathbf{v}_k \in \mathcal{S}, a_1,...,a_k\in \mathbb{R}, \text{ and } k=1,2,... \right\}.\]

Linear Independence

The vectors \(\mathbf{v}_1,..., \mathbf{v}_n\) (in \(\mathbb{R}^n\)) are linearly dependent if and only if there exist \(a_1,...,a_n \in \mathbb{R}\), not all zero, such that \(a_1 \mathbf{v}_1 + \cdots + a_n \mathbf{v}_n = \mathbf{0}\).

A finite set of vectors \(\mathbf{v}_1,..., \mathbf{v}_n\) (in \(\mathbb{R}^n\)) is linearly independent if it is not linearly dependent. In other words, we cannot write any vector in \(\mathbf{v}_1,..., \mathbf{v}_n\) in terms of a linear combination of the other vectors.

Matrices

A \(m\times n\) matrix \(A\in \mathbb{R}^{m\times n}\) is an array of \(mn\) numbers as \[A = \begin{bmatrix} A_{11} & A_{12} & \cdots & A_{1n}\\ A_{21} & A_{22} & \cdots & A_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ A_{m1} & A_{m2} & \cdots & A_{mn} \end{bmatrix}.\] It represents the linear mapping (or linear transformation) from \(\mathbb{R}^n\) to \(\mathbb{R}^m\) as \[\mathbf{x} \mapsto A\mathbf{x} = \begin{bmatrix} A_{11} & A_{12} & \cdots & A_{1n}\\ A_{21} & A_{22} & \cdots & A_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ A_{m1} & A_{m2} & \cdots & A_{mn} \end{bmatrix} \begin{bmatrix} x_1\\ x_2\\ \vdots\\ x_n \end{bmatrix}= \begin{bmatrix} \sum\limits_{i=1}^n A_{1i} x_i\\ \sum\limits_{i=1}^n A_{2i} x_i\\ \vdots\\ \sum\limits_{i=1}^n A_{mi} x_i\\ \end{bmatrix}\quad \text{ for any } \mathbf{x} = \begin{bmatrix} x_1\\ x_2\\ \vdots\\ x_n \end{bmatrix}\in \mathbb{R}^n.\] Here, the linearity means that \(A(a\mathbf{x}+b\mathbf{y}) = aA\mathbf{x} + bA\mathbf{y}\) for any \(\mathbf{x},\mathbf{y}\in \mathbb{R}^n\) and \(a,b\in \mathbb{R}\). In particular, when \(m=n\), \(A\in \mathbb{R}^{n\times n}\) is called a square matrix.

Matrix Operations

Matrix addition: If \(A,B\) are both \(m\times n\) matrices, then the matrix addition is defined as elementwise additions as: \[[A+B]_{ij} = A_{ij} + B_{ij}.\]

Example 1. Here is an example of a matrix addition for two matrices in \(\mathbb{R}^{2\times 2}\) as \[\begin{bmatrix}1 & 2 \\ 3 & 4 \end{bmatrix} + \begin{bmatrix}5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 1 + 5 & 2 + 6\\ 3 + 7 & 4 + 8 \end{bmatrix} = \begin{bmatrix} 6 & 8\\ 10 & 12 \end{bmatrix}.\]

Matrix multiplication: For two matrices \(A\in \mathbb{R}^{m\times n}, B\in \mathbb{R}^{n\times p}\), the product \(AB\) is a \(m\times p\) matrix, whose \((i,j)\)-entry is \[[AB]_{ij} = \sum_{k=1}^n A_{ik} B_{kj}\] for all \(1\leq i\leq m\) and \(1\leq j\leq p\).

Example 2. Here is an example of the matrix multiplication for two square matrices in \(\mathbb{R}^{2\times 2}\) as \[\begin{bmatrix}1 & 2 \\ 3 & 4 \end{bmatrix} \cdot \begin{bmatrix}5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 1 \times 5 + 2 \times 7 & 1 \times 6 + 2 \times 8 \\ 3 \times 5 + 4 \times 7 & 3 \times 6 + 4 \times 8 \end{bmatrix} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}.\] We can also multiply non-square matrices when their dimensions are matched (i.e., the number of columns of the first matrix should be equal to the number of rows of the second matrix) as \[\begin{bmatrix}1 & 2 \\ 3 & 4 \\ 5 & 6\end{bmatrix} \cdot \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} = \begin{bmatrix} 1 \cdot 1 + 2 \cdot 4 & 1 \cdot 2 + 2 \cdot 5 & 1 \cdot 3 + 2 \cdot 6 \\ 3 \cdot 1 + 4 \cdot 4 & 3 \cdot 2 + 4 \cdot 5 & 3 \cdot 3 + 4 \cdot 6 \\ 5 \cdot 1 + 6 \cdot 4 & 5 \cdot 2 + 6 \cdot 5 & 5 \cdot 3 + 6 \cdot 6 \end{bmatrix} = \begin{bmatrix} 9 & 12 & 15 \\ 19 & 26 & 33 \\ 29 & 40 & 51 \end{bmatrix}.\]

Properties of matrix multiplications:

Matrix transpose: If \(A = [A_{ij}] \in \mathbb{R}^{m\times n}\), then its transpose \(A^T\) is a \(n\times m\) matrix, whose \((i,j)\)-entry is \(A_{ji}\). That is, \([A^T]_{ij} = A_{ji}\).

Example 3. Here is an example of transposing a \(3\times 2\) matrix, where we switch the matrix’s rows with its columns as \[\begin{bmatrix}1 & 2 \\ 3 & 4 \\ 5 & 6\end{bmatrix}^T = \begin{bmatrix}1 & 3 & 5 \\ 2 & 4 & 6\end{bmatrix}.\]

Properties of matrix transpose:

Identity matrix: The identity matrix \(\mathbf{I}_n\) is an \(n\times n\) (square) matrix given by \[\mathbf{I}_n = \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & 1 \end{bmatrix},\] where it has all \(1\)’s on the diagonal and \(0\)’s everywhere else. It is sometimes abbreviated \(\mathbf{I}\) when the dimension of the matrix is clear. For any \(A \in \mathbb{R}^{m\times n}\), it holds that \(A\mathbf{I}_n = \mathbf{I}_m A\).

Matrix inverse: Given a square matrix \(A\in \mathbb{R}^{n\times n}\), its inverse \(A^{-1}\) (if it exists) is the unique matrix satisfying \[A A^{-1} = A^{-1} A=\mathbf{I}_n.\] Notice that the inverse of a matrix may not always exist. Those matrices that have an inverse are called invertible or nonsingular.

Properties of matrix inverse: Whenever the matrices \(A,B\in \mathbb{R}^{n\times n}\) are invertible, we have the following properties.

Matrix rank: The rank of a matrix \(A\in \mathbb{R}^{m\times n}\) is the dimension of the linear space spanned by its rows (or columns). One can verify that

Matrix trace: For a square matrix \(A\in \mathbb{R}^{n\times n}\), the trace of \(A\) is defined as \[\mathrm{tr}(A) = \sum_{i=1}^n A_{ii},\] i.e., it is the sum of all the diagonal entries of \(A\). Specifically, the traces of matrices satisfy the following properties:

Determinant: For a square matrix \(A\in \mathbb{R}^{n\times n}\), its determinant \(\det(A)\) or \(|A|\) is defined as \[\det(A) = \sum_{\pi} \left(\mathrm{sign}(\pi) \prod_{i=1}^n A_{i\pi(i)}\right),\] where the sum is over all \(n!\) permutations \(\pi:\{1,...,n\} \to \{1,...,n\}\) and \(\mathrm{sign}(\pi)=1\) or \(-1\) according to whether the minimum number of transpositions (i.e., pairwise interchanges) necessary to achieve it starting from \(\{1,...,n\}\) is even or odd. One can also calculate \(\det(A)\) through the Laplace expansion by minor along row \(i\) or column \(j\) as \[\det(A) = \sum_{k=1}^n (-1)^{i+k} A_{ik} \, \det(M_{ik}) = \sum_{k=1}^n (-1)^{k+j} A_{kj} \,\det(M_{kj}),\] where \(M_{ik} \in \mathbb{R}^{(n-1)\times (n-1)}\) denotes the submatrix of \(A\) obtained by removing row \(i\) and column \(k\) of \(A\). Geometrically, the determinant of \(A = \left[\mathbf{a}_1, \mathbf{a}_2,...,\mathbf{a}_n\right]\in \mathbb{R}^{n\times n}\) gives the signed volume of a \(n\)-dimensional parallelotope \(\mathcal{P}=\left\{c_1\mathbf{a}_1+\cdots +c_n\mathbf{a}_n: c_1,...,c_n \in [0,1]\right\}\), i.e., \[\det A = \pm \mathrm{Volume}(\mathcal{P}),\] where \(\mathbf{a}_1,...,\mathbf{a}_n\) are column vectors of \(A\).

Example 4. We give explicit formulae for computing the determinants of square matrices with dimension less than 3 as: \[\begin{aligned} \det[A_{11}] &= A_{11}, \\ \det\begin{bmatrix} A_{11} & A_{12}\\ A_{21} & A_{22} \end{bmatrix} &= A_{11} A_{22} - A_{12} A_{21},\\ \det\begin{bmatrix} A_{11} & A_{12} & A_{13}\\ A_{21} & A_{22} & A_{23}\\ A_{31} & A_{23} & A_{33} \end{bmatrix} &= A_{11} A_{22} A_{33} + A_{12} A_{23} A_{31} + A_{13} A_{21} A_{32} \\ &\quad - A_{11} A_{23} A_{32} - A_{12} A_{21} A_{33} - A_{13} A_{22} A_{31}. \end{aligned}\]

Properties of determinant: For any \(A,B\in \mathbb{R}^{n\times n}\),

Special Types of Matrices

Diagonal matrix: A matrix \(D\in \mathbb{R}^{n\times n}\) is diagonal if \(D_{ij}=0\) whenever \(i\neq j\). We write a diagonal matrix \(D\) as \[D = \mathrm{diag}(d_1, d_2, \dots, d_n)=\begin{bmatrix}d_1 & 0 & \cdots & 0\\ 0 & d_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_n\end{bmatrix}.\]

One can verify that \[D^k = \begin{bmatrix}d_1^k & 0 & \cdots & 0\\ 0 & d_2^k & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_n^k\end{bmatrix}.\]

Triangular matrix: A matrix \(A\in \mathbb{R}^{n\times n}\) is lower triangular if \(A_{ij}=0\) whenever \(i<j\). That is, a lower triangular matrix has all its nonzero elements on or below the diagonal. Similarly, a matrix \(A\) is upper triangular if its transpose \(A^T\) is lower triangular. When \(A\) is a lower or upper triangular matrix, \(\det(A) = \prod_{i=1}^n A_{ii}\).

Orthogonal matrix: A square matrix \(U\in \mathbb{R}^{n\times n}\) is orthogonal if \(UU^T = U^TU = \mathbf{I}_n\). This implies that

Symmetric matrix: A square matrix \(A \in \mathbb{R}^{n\times n}\) is symmetric if \(A = A^T\), i.e., \(A_{ij} = A_{ji}\) for all entries of \(A\).

Projection matrix: A square matrix \(P\in \mathbb{R}^{n\times n}\) is a projection matrix if it is symmetric and idempotent: \(P^2=P\).

Positive definite matrix: A (real) symmetric matrix \(S \in \mathbb{R}^{n\times n}\) is positive semi-definite (PSD) if its quadratic form is nonnegative, i.e., \[\mathbf{x}^T S\mathbf{x} \geq 0\] for all \(\mathbf{x}\in \mathbb{R}^n\). Furthermore, \(S\) is positive definite (PD) if its quadratic form is strictly positive, i.e., \[\mathbf{x}^T S\mathbf{x} > 0\] for all \(\mathbf{x}\in \mathbb{R}^n\) with \(\mathbf{x} \neq \mathbf{0}\). Here are some useful properties of PSD or PD matrices.

Eigenvalues and Eigenvectors

Given a square matrix \(A \in \mathbb{R}^{n\times n}\), \(\lambda \in \mathbb{R}\) is an eigenvalue of \(A\) with the corresponding eigenvector \(\mathbf{x} \in \mathbb{R}^n\) and \(\mathbf{x} \neq \mathbf{0}\) if \(A\mathbf{x} = \lambda \mathbf{x}\).

Here, \(\mathbf{0} \in \mathbb{R}^n\) stands for a vector whose entries are all zero. By convention, the zero vector cannot be an eigenvector of any matrix.

Example 5. If \(A = \begin{bmatrix}2 & 1 \\ 1 & 2\end{bmatrix}\), then the vector \(\mathbf{x}=\begin{bmatrix}3 \\ -3\end{bmatrix}\) is an eigenvector with eigenvalue \(1\), because \[A\mathbf{x} = \begin{bmatrix}2 & 1 \\ 1 & 2\end{bmatrix}\begin{bmatrix}3 \\ -3\end{bmatrix} = \begin{bmatrix}3 \\ -3\end{bmatrix} = 1 \times \begin{bmatrix}3 \\ -3\end{bmatrix}.\]

Solving for eigenvalues and eigenvectors

We exploit the fact that \(A\mathbf{x} = \lambda \mathbf{x}\) if and only if \[\label{eigen_eq} (A - \lambda \mathbf{I}_n) \mathbf{x} = 0.\] (Note that \(\lambda \mathbf{I}_n\) is the diagonal matrix where all the diagonal entries are \(\lambda\), and all other entries are zero.)

The equation [eigen_eq] has a nonzero solution \(\mathbf{x}\) if and only if \(\det(A - \lambda \mathbf{I}_n)=0\); see Section 1.1 in . Therefore, we can obtain the eigenvalues of a matrix \(A\) by solving the characteristic equation \(\det(A - \lambda \mathbf{I}_n) = 0\) for \(\lambda\). Once we have done that, you can find the corresponding eigenvector for each eigenvalue \(\lambda\) by solving the system of equations \((A - \lambda \mathbf{I}_n) \mathbf{x} = \mathbf{0}\) for \(\mathbf{x}\).

Example 6. If \(A = \begin{bmatrix}2 & 1 \\ 1 & 2\end{bmatrix}\), then \[A - \lambda \mathbf{I}_n = \begin{bmatrix}2 - \lambda & 1 \\ 1 & 2 - \lambda\end{bmatrix}\] and \[\det(A - \lambda \mathbf{I}_n) = (2 - \lambda)^2 - 1 = \lambda^2 - 4 \lambda + 3.\] Setting it to \(0\) yields that \(\lambda = 1\) and \(\lambda = 3\) are possible eigenvalues.

(i) To find the eigenvectors for \(\lambda = 1\), we plug \(\lambda\) into the equation \((A - \lambda \mathbf{I}_n) \mathbf{x} = 0\). This gives us \[\begin{bmatrix}1 & 1 \\ 1 & 1\end{bmatrix} \begin{bmatrix}x_1 \\ x_2\end{bmatrix} = \begin{bmatrix}0 \\ 0\end{bmatrix}\] Any vector with \(x_2 = -x_1\) is a solution to this equation, and in particular, \(\begin{bmatrix} 3 \\ -3\end{bmatrix}\) is one solution.

(ii) To find the eigenvectors for \(\lambda = 3\), we again plug \(\lambda\) into the equation and obtain that \[\begin{bmatrix}-1 & 1 \\ 1 & -1\end{bmatrix}\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix}0 \\ 0\end{bmatrix}\] Any vector where \(x_2 = x_1\) is a solution to this equation.

\(\star\) Note: The above method is never used to calculate eigenvalues and eigenvectors for large matrices in practice. We will introduce the power iterative method in the lecture (Lecture 6: Dimensionality Reduction) to find eigenpairs instead.

Properties of eigenvalues and eigenvectors

Matrix Norms

Frobenius norm: Given a matrix \(A\in \mathbb{R}^{m\times n}\), its Frobenius norm is defined as \[\left|\left| A \right|\right|_F = \sqrt{\sum_{i,j} A_{ij}} = \mathrm{tr}(A^TA).\] We can compute \(\left|\left| A \right|\right|_F\) as \(\left|\left| A \right|\right|_F = \sqrt{\sigma_1(A)^2 + \cdots \sigma_q(A)^2}\), where \(\sigma_i(A), i=1,...,q\) are singular values of \(A\) and \(q=\min\{m,n\}\); see [sec:svd] for the definition of singular values. In particular, if \(A\) is a symmetric matrix in \(\mathbb{R}^{n\times n}\), then \(\left|\left| A \right|\right|_F = \sqrt{\sum_{i=1}^n \lambda_i^2}\) with \(\lambda_1,...,\lambda_n\) being the eigenvalues of \(A\).

Maximum norm: The maximum norm (or \(\ell_{\infty}\)-norm) for \(A\in \mathbb{R}^{m\times n}\) is defined as \(\left|\left| A \right|\right|_{\max} = \max_{i,j} |A_{ij}|\). Strictly speaking, \(\left|\left| \cdot \right|\right|_{\max}\) is not a matrix norm because it does not satisfy the submultiplicativity \(\left|\left| A B \right|\right| \leq \left|\left| A \right|\right| \left|\left| B \right|\right|\). However, it is a vector norm when we consider \(\mathbb{R}^{m\times n}\) as a \(mn\)-dimensional vector space; see Section 5.6 in .

Operator norm: For any matrix \(A\in \mathbb{R}^{m\times n}\) and \(\ell_p\)-norm for vectors in \(\mathbb{R}^m\) and \(\mathbb{R}^n\), then the corresponding operator norm \(\left|\left| A \right|\right|_p\) is defined as \[\left|\left| A \right|\right|_p = \sup_{\mathbf{x}\neq \mathbf{0}} \frac{\left|\left| A\mathbf{x} \right|\right|_p}{\left|\left| \mathbf{x} \right|\right|_p}.\] For the special cases when \(p=1,2,\infty\), these (induced) operator norms can be computed as

There are several useful inequalities between these matrix norms. For any \(A\in \mathbb{R}^{m\times n}\), \[\left|\left| A \right|\right|_2 \leq \left|\left| A \right|\right|_F \leq \sqrt{n} \left|\left| A \right|\right|_2, \; \left|\left| A \right|\right|_{\max} \leq \left|\left| A \right|\right|_2 \leq \sqrt{mn} \left|\left| A \right|\right|_{\max}, \; \text{ and } \; \left|\left| A \right|\right|_F \leq \sqrt{mn} \left|\left| A \right|\right|_{\max}.\]

Spectral Decomposition and Singular Value Decomposition (SVD)

Theorem 1 (Spectral Decomposition of a Real Symmetric Matrix). For a symmetric (square) matrix \(A\in \mathbb{R}^{n\times n}\), there exists a real orthogonal matrix \(U\in \mathbb{R}^{n\times n}\) such that \[A= U\Lambda U^T = \sum_{i=1}^n \lambda_i \mathbf{u}_i \mathbf{u}_i^T,\] where \(\Lambda = \mathrm{diag}(\lambda_1,...,\lambda_n)\), \(U=\left[\mathbf{u}_1, \mathbf{u}_2,...,\mathbf{u}_n\right]\), and \(\mathbf{u}_1,...,\mathbf{u}_n\) are orthonormal eigenvectors of \(A\) associated with eigenvalues \(\lambda_1,...,\lambda_n\).

The spectral decomposition also provides us with a convenient method for computing the power \(A^k = U \Lambda^k U^T\) and exponentiation \(\exp(A) = U \exp(\Lambda) U^T\) of a real symmetric matrix \(A\in \mathbb{R}^{n\times n}\).

While the spectral decomposition ([thm:spect_decomp]) only works for symmetric (square) matrices, it is also feasible to diagonalize a rectangular matrix \(A\in \mathbb{R}^{m\times n}\) through orthogonal matrices.

Theorem 2 (Singular Value Decomposition (SVD)). Let \(A\in \mathbb{R}^{m\times n}\) with \(q=\min\{m,n\}\). There exist orthogonal matrices \(\widetilde{U}=\left[\mathbf{u}_1,...,\mathbf{u}_m\right]\in \mathbb{R}^{m\times m}\) and \(\widetilde{V}=\left[\mathbf{v}_1,...,\mathbf{v}_n\right]\in \mathbb{R}^{n\times n}\) as well as a (square) diagonal matrix \(\Sigma_q = \mathrm{diag}(\sigma_1,...,\sigma_q) \in \mathbb{R}^{q\times q}\) such that \[A = \widetilde{U}\Sigma \widetilde{V}^T = \sum_{i=1}^q \sigma_q \mathbf{u}_i \mathbf{v}_i^T = U \Sigma_q V^T,\] where \(U = \left[\mathbf{u}_1,...,\mathbf{u}_q\right] \in \mathbb{R}^{m\times q}\), \(V=\left[\mathbf{v}_1,...,\mathbf{v}_q\right] \in \mathbb{R}^{n\times q}\), and \[\begin{aligned} \Sigma &= \Sigma_q \text{ if } m=n,\\ \Sigma &= \left[\Sigma_q \; \mathbf{0} \right] \in \mathbb{R}^{m\times n} \text{ if } n>m,\\ \Sigma &= \begin{bmatrix} \Sigma_q\\ \mathbf{0} \end{bmatrix} \in \mathbb{R}^{m\times n} \text{ if } m>n. \end{aligned}\] Here, \(\sigma_1,...,\sigma_q\) are called the singular values of \(A\), which are eigenvalues of \(AA^T\) when \(m\leq n\) or \(A^T A\) when \(m>n\).

Notice that the number of nonzero singular values of \(A\) determines the rank of \(A\). During the lecture (Lecture 6: Dimensionality Reduction), we will leverage the singular value decomposition to reduce the dimension (or matrix rank) of a user-movie rating matrix.