Review of Linear Algebra

Acknowledgment: This note was originally compiled by Jessica Su for CS224W at Standard with substantial modifications by Yikun Zhang at Winter 2023 for CSE547 / STAT548 at UW. Parts of this note are adapted from Lecture 8 of Professor Yen-Chi Chen’s ¹ and Professor Michael Perlman’s lecture notes ² for STAT512. Other good references about linear algebra includes Horn and Johnson’s textbook ³ and Axler’s textbook ⁴ and notes from CS224W at Stanford:

Note: We only discuss the vectors and matrices with real entries in this note, though the stated results also hold for complex entries.

Vector Space, Span, and Linear Independence

Vector space: A vector space over the real numbers \(\mathbb{R}\) is a set of vectors that is closed under additions with an identity as the zero vector \(\boldsymbol{\mathbf{0}}\) and additive inverses in the set. It is also closed under scalar multiplications of the vectors by elements in \(\mathbb{R}\).
The most common vector space in Machine Learning is the Euclidean space \(\mathbb{R}^n\), which consists of all ordered \(n\)-tuples of real numbers. A vector of \(\mathbb{R}^n\) can be denoted by \[\boldsymbol{\mathbf{x}}= \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}\] or a row vector \(\boldsymbol{\mathbf{x}}^T=\left[x_1,...,x_n\right]\), where \(x_i, i=1,...,n\) are called its components or coordinates.

Vector Operations

Dot product: The geometric properties of \(\mathbb{R}^n\) are derived from the Euclidean dot product defined as: \[\langle \boldsymbol{\mathbf{x}}, \boldsymbol{\mathbf{y}} \rangle=\boldsymbol{\mathbf{x}}^T\boldsymbol{\mathbf{y}}=x_1 y_1 + \cdots + x_n y_n=\sum_{i=1}^n x_iy_i,\] where \(\boldsymbol{\mathbf{x}}=\left[x_1,...,x_n\right]^T\) and \(\boldsymbol{\mathbf{y}}=\left[y_1,...,y_n\right]^T\) are in \(\mathbb{R}^n\).

Orthogonality: Two vectors in \(\mathbb{R}^n\) are orthogonal if and only if their dot product is zero. In \(\mathbb{R}^2\), we also call orthogonal vectors perpendicular.

Norm: The standard \(\ell_2\)-norm or length of a vector \(\boldsymbol{\mathbf{x}}=[x_1,..., x_n]^T\in \mathbb{R}^n\) is given by \[\left|\left| \boldsymbol{\mathbf{x}} \right|\right|_2 = \sqrt{x_1^2 + \cdots + x_n^2}.\] Other possible norms in \(\mathbb{R}^n\) include

When the context is clear, we often write the norm of a vector \(\boldsymbol{\mathbf{x}}\) as \(\left|\left| \boldsymbol{\mathbf{x}} \right|\right|\). The norms in \(\mathbb{R}^n\) can be used to measure distances between data points (or vectors) in \(\mathbb{R}^n\).

Triangle inequality: For two vectors \(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{y}}\) and any norm \(\left|\left| \cdot \right|\right|\) in \(\mathbb{R}^n\), the triangle inequality states that \[\left|\left| \boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{y}} \right|\right| \leq \left|\left| \boldsymbol{\mathbf{x}} \right|\right| + \left|\left| \boldsymbol{\mathbf{y}} \right|\right|,\] and its reverse version goes as \[\left|\left| \boldsymbol{\mathbf{x}}-\boldsymbol{\mathbf{y}} \right|\right| \geq \Big|\left|\left| \boldsymbol{\mathbf{x}} \right|\right| -\left|\left| \boldsymbol{\mathbf{y}} \right|\right| \Big|.\]

Subspaces and Span

Subspace of \(\mathbb{R}^n\): A subspace of \(\mathbb{R}^n\) is a subset of \(\mathbb{R}^n\) that is, by itself, a vector space over \(\mathbb{R}\) using the same operations of vector addition and scalar multiplication in \(\mathbb{R}^n\). In other words, a subset of \(\mathbb{R}^n\) is a subspace precisely when it is closed under these two operations.

Linear combination: A linear combination of the vectors \(\boldsymbol{\mathbf{v}}_1,..., \boldsymbol{\mathbf{v}}_n\) (in \(\mathbb{R}^n\)) is any expression of the form \(a_1 \boldsymbol{\mathbf{v}}_1 + \cdots + a_k \boldsymbol{\mathbf{v}}_k\), where \(k\) is a positive integer and \(a_1,..., a_k \in \mathbb{R}\). Note that some of \(a_1,..., a_k\) may be zero.

Span: The span of a set \(\mathcal{S}\) of vectors consists of all possible linear combinations of finitely many vectors in \(\mathcal{S}\), i.e., \[\mathrm{span}\,\mathcal{S} = \left\{a_1 \boldsymbol{\mathbf{v}}_1 + \cdots + a_k \boldsymbol{\mathbf{v}}_k: \boldsymbol{\mathbf{v}}_1,...,\boldsymbol{\mathbf{v}}_k \in \mathcal{S}, a_1,...,a_k\in \mathbb{R}, \text{ and } k=1,2,... \right\}.\]

Linear Independence

The vectors \(\boldsymbol{\mathbf{v}}_1,..., \boldsymbol{\mathbf{v}}_n\) (in \(\mathbb{R}^n\)) are linearly dependent if and only if there exist \(a_1,...,a_n \in \mathbb{R}\), not all zero, such that \(a_1 \boldsymbol{\mathbf{v}}_1 + \cdots + a_n \boldsymbol{\mathbf{v}}_n = \boldsymbol{\mathbf{0}}\).

A finite set of vectors \(\boldsymbol{\mathbf{v}}_1,..., \boldsymbol{\mathbf{v}}_n\) (in \(\mathbb{R}^n\)) is linearly independent if it is not linearly dependent. In other words, we cannot write any vector in \(\boldsymbol{\mathbf{v}}_1,..., \boldsymbol{\mathbf{v}}_n\) in terms of a linear combination of the other vectors.

Matrices

A \(m\times n\) matrix \(A\in \mathbb{R}^{m\times n}\) is an array of \(mn\) numbers as \[A = \begin{bmatrix} A_{11} & A_{12} & \cdots & A_{1n}\\ A_{21} & A_{22} & \cdots & A_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ A_{m1} & A_{m2} & \cdots & A_{mn} \end{bmatrix}.\] It represents the linear mapping (or linear transformation) from \(\mathbb{R}^n\) to \(\mathbb{R}^m\) as \[\boldsymbol{\mathbf{x}} \mapsto A\boldsymbol{\mathbf{x}} = \begin{bmatrix} A_{11} & A_{12} & \cdots & A_{1n}\\ A_{21} & A_{22} & \cdots & A_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ A_{m1} & A_{m2} & \cdots & A_{mn} \end{bmatrix} \begin{bmatrix} x_1\\ x_2\\ \vdots\\ x_n \end{bmatrix}= \begin{bmatrix} \sum\limits_{i=1}^n A_{1i} x_i\\ \sum\limits_{i=1}^n A_{2i} x_i\\ \vdots\\ \sum\limits_{i=1}^n A_{mi} x_i\\ \end{bmatrix}\quad \text{ for any } \boldsymbol{\mathbf{x}} = \begin{bmatrix} x_1\\ x_2\\ \vdots\\ x_n \end{bmatrix}\in \mathbb{R}^n.\] Here, the linearity means that \(A(a\boldsymbol{\mathbf{x}}+b\boldsymbol{\mathbf{y}}) = aA\boldsymbol{\mathbf{x}} + bA\boldsymbol{\mathbf{y}}\) for any \(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{y}}\in \mathbb{R}^n\) and \(a,b\in \mathbb{R}\). In particular, when \(m=n\), \(A\in \mathbb{R}^{n\times n}\) is called a square matrix.

Matrix Operations

Matrix addition: If \(A,B\) are both \(m\times n\) matrices, then the matrix addition is defined as elementwise additions as: \[[A+B]_{ij} = A_{ij} + B_{ij}.\]

Example 1. Here is an example of a matrix addition for two matrices in \(\mathbb{R}^{2\times 2}\) as \[\begin{bmatrix}1 & 2 \\ 3 & 4 \end{bmatrix} + \begin{bmatrix}5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 1 + 5 & 2 + 6\\ 3 + 7 & 4 + 8 \end{bmatrix} = \begin{bmatrix} 6 & 8\\ 10 & 12 \end{bmatrix}.\]

Matrix multiplication: For two matrices \(A\in \mathbb{R}^{m\times n}, B\in \mathbb{R}^{n\times p}\), the product \(AB\) is a \(m\times p\) matrix, whose \((i,j)\)-entry is \[[AB]_{ij} = \sum_{k=1}^n A_{ik} B_{kj}\] for all \(1\leq i\leq m\) and \(1\leq j\leq p\).

Example 2. Here is an example of the matrix multiplication for two square matrices in \(\mathbb{R}^{2\times 2}\) as \[\begin{bmatrix}1 & 2 \\ 3 & 4 \end{bmatrix} \cdot \begin{bmatrix}5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 1 \times 5 + 2 \times 7 & 1 \times 6 + 2 \times 8 \\ 3 \times 5 + 4 \times 7 & 3 \times 6 + 4 \times 8 \end{bmatrix} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}.\] We can also multiply non-square matrices when their dimensions are matched (i.e., the number of columns of the first matrix should be equal to the number of rows of the second matrix) as \[\begin{bmatrix}1 & 2 \\ 3 & 4 \\ 5 & 6\end{bmatrix} \cdot \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} = \begin{bmatrix} 1 \cdot 1 + 2 \cdot 4 & 1 \cdot 2 + 2 \cdot 5 & 1 \cdot 3 + 2 \cdot 6 \\ 3 \cdot 1 + 4 \cdot 4 & 3 \cdot 2 + 4 \cdot 5 & 3 \cdot 3 + 4 \cdot 6 \\ 5 \cdot 1 + 6 \cdot 4 & 5 \cdot 2 + 6 \cdot 5 & 5 \cdot 3 + 6 \cdot 6 \end{bmatrix} = \begin{bmatrix} 9 & 12 & 15 \\ 19 & 26 & 33 \\ 29 & 40 & 51 \end{bmatrix}.\]

Matrix transpose: If \(A = [A_{ij}] \in \mathbb{R}^{m\times n}\), then its transpose \(A^T\) is a \(n\times m\) matrix, whose \((i,j)\)-entry is \(A_{ji}\). That is, \([A^T]_{ij} = A_{ji}\).

Example 3. Here is an example of transposing a \(3\times 2\) matrix, where we switch the matrix’s rows with its columns as \[\begin{bmatrix}1 & 2 \\ 3 & 4 \\ 5 & 6\end{bmatrix}^T = \begin{bmatrix}1 & 3 & 5 \\ 2 & 4 & 6\end{bmatrix}.\]

Identity matrix: The identity matrix \(\boldsymbol{\mathbf{I}}_n\) is an \(n\times n\) (square) matrix given by \[\boldsymbol{\mathbf{I}}_n = \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & 1 \end{bmatrix},\] where it has all \(1\)’s on the diagonal and \(0\)’s everywhere else. It is sometimes abbreviated \(\boldsymbol{\mathbf{I}}\) when the dimension of the matrix is clear. For any \(A \in \mathbb{R}^{m\times n}\), it holds that \(A\boldsymbol{\mathbf{I}}_n = \boldsymbol{\mathbf{I}}_m A\).

Matrix inverse: Given a square matrix \(A\in \mathbb{R}^{n\times n}\), its inverse \(A^{-1}\) (if it exists) is the unique matrix satisfying \[A A^{-1} = A^{-1} A=\boldsymbol{\mathbf{I}}_n.\] Notice that the inverse of a matrix may not always exist. Those matrices that have an inverse are called invertible or nonsingular.

Properties of matrix inverse: Whenever the matrices \(A,B\in \mathbb{R}^{n\times n}\) are invertible, we have the following properties.

Matrix rank: The rank of a matrix \(A\in \mathbb{R}^{m\times n}\) is the dimension of the linear space spanned by its rows (or columns). One can verify that

Matrix trace: For a square matrix \(A\in \mathbb{R}^{n\times n}\), the trace of \(A\) is defined as \[\mathrm{tr}(A) = \sum_{i=1}^n A_{ii},\] i.e., it is the sum of all the diagonal entries of \(A\). Specifically, the traces of matrices satisfy the following properties:

Determinant: For a square matrix \(A\in \mathbb{R}^{n\times n}\), its determinant \(\det(A)\) or \(|A|\) is defined as \[\det(A) = \sum_{\pi} \left(\mathrm{sign}(\pi) \prod_{i=1}^n A_{i\pi(i)}\right),\] where the sum is over all \(n!\) permutations \(\pi:\{1,...,n\} \to \{1,...,n\}\) and \(\mathrm{sign}(\pi)=1\) or \(-1\) according to whether the minimum number of transpositions (i.e., pairwise interchanges) necessary to achieve it starting from \(\{1,...,n\}\) is even or odd. One can also calculate \(\det(A)\) through the Laplace expansion by minor along row \(i\) or column \(j\) as \[\det(A) = \sum_{k=1}^n (-1)^{i+k} A_{ik} \, \det(M_{ik}) = \sum_{k=1}^n (-1)^{k+j} A_{kj} \,\det(M_{kj}),\] where \(M_{ik} \in \mathbb{R}^{(n-1)\times (n-1)}\) denotes the submatrix of \(A\) obtained by removing row \(i\) and column \(k\) of \(A\). Geometrically, the determinant of \(A = \left[\boldsymbol{\mathbf{a}}_1, \boldsymbol{\mathbf{a}}_2,...,\boldsymbol{\mathbf{a}}_n\right]\in \mathbb{R}^{n\times n}\) gives the signed volume of a \(n\)-dimensional parallelotope \(\mathcal{P}=\left\{c_1\boldsymbol{\mathbf{a}}_1+\cdots +c_n\boldsymbol{\mathbf{a}}_n: c_1,...,c_n \in [0,1]\right\}\), i.e., \[\det A = \pm \mathrm{Volume}(\mathcal{P}),\] where \(\boldsymbol{\mathbf{a}}_1,...,\boldsymbol{\mathbf{a}}_n\) are column vectors of \(A\).

Example 4. We give explicit formulae for computing the determinants of square matrices with dimension less than 3 as: \[\begin{aligned} \det[A_{11}] &= A_{11}, \\ \det\begin{bmatrix} A_{11} & A_{12}\\ A_{21} & A_{22} \end{bmatrix} &= A_{11} A_{22} - A_{12} A_{21},\\ \det\begin{bmatrix} A_{11} & A_{12} & A_{13}\\ A_{21} & A_{22} & A_{23}\\ A_{31} & A_{23} & A_{33} \end{bmatrix} &= A_{11} A_{22} A_{33} + A_{12} A_{23} A_{31} + A_{13} A_{21} A_{32} \\ &\quad - A_{11} A_{23} A_{32} - A_{12} A_{21} A_{33} - A_{13} A_{22} A_{31}. \end{aligned}\]

Special Types of Matrices

Diagonal matrix: A matrix \(D\in \mathbb{R}^{n\times n}\) is diagonal if \(D_{ij}=0\) whenever \(i\neq j\). We write a diagonal matrix \(D\) as \[D = \mathrm{diag}(d_1, d_2, \dots, d_n)=\begin{bmatrix}d_1 & 0 & \cdots & 0\\ 0 & d_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_n\end{bmatrix}.\]

One can verify that \[D^k = \begin{bmatrix}d_1^k & 0 & \cdots & 0\\ 0 & d_2^k & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_n^k\end{bmatrix}.\]

Triangular matrix: A matrix \(A\in \mathbb{R}^{n\times n}\) is lower triangular if \(A_{ij}=0\) whenever \(i<j\). That is, a lower triangular matrix has all its nonzero elements on or below the diagonal. Similarly, a matrix \(A\) is upper triangular if its transpose \(A^T\) is lower triangular. When \(A\) is a lower or upper triangular matrix, \(\det(A) = \prod_{i=1}^n A_{ii}\).

Orthogonal matrix: A square matrix \(U\in \mathbb{R}^{n\times n}\) is orthogonal if \(UU^T = U^TU = \boldsymbol{\mathbf{I}}_n\). This implies that

Symmetric matrix: A square matrix \(A \in \mathbb{R}^{n\times n}\) is symmetric if \(A = A^T\), i.e., \(A_{ij} = A_{ji}\) for all entries of \(A\).

Projection matrix: A square matrix \(P\in \mathbb{R}^{n\times n}\) is a projection matrix if it is symmetric and idempotent: \(P^2=P\).

Positive definite matrix: A (real) symmetric matrix \(S \in \mathbb{R}^{n\times n}\) is positive semi-definite (PSD) if its quadratic form is nonnegative, i.e., \[\boldsymbol{\mathbf{x}}^T S\boldsymbol{\mathbf{x}} \geq 0\] for all \(\boldsymbol{\mathbf{x}}\in \mathbb{R}^n\). Furthermore, \(S\) is positive definite (PD) if its quadratic form is strictly positive, i.e., \[\boldsymbol{\mathbf{x}}^T S\boldsymbol{\mathbf{x}} > 0\] for all \(\boldsymbol{\mathbf{x}}\in \mathbb{R}^n\) with \(\boldsymbol{\mathbf{x}} \neq \boldsymbol{\mathbf{0}}\). Here are some useful properties of PSD or PD matrices.

Eigenvalues and Eigenvectors

Given a square matrix \(A \in \mathbb{R}^{n\times n}\), \(\lambda \in \mathbb{R}\) is an eigenvalue of \(A\) with the corresponding eigenvector \(\boldsymbol{\mathbf{x}} \in \mathbb{R}^n\) and \(\boldsymbol{\mathbf{x}} \neq \boldsymbol{\mathbf{0}}\) if \(A\boldsymbol{\mathbf{x}} = \lambda \boldsymbol{\mathbf{x}}\).

Example 5. If \[A = \begin{bmatrix}2 & 1 \\ 1 & 2\end{bmatrix},\] then the vector \(\boldsymbol{\mathbf{x}}=\begin{bmatrix}3 \\ -3\end{bmatrix}\) is an eigenvector with eigenvalue \(1\), because \[A\boldsymbol{\mathbf{x}} = \begin{bmatrix}2 & 1 \\ 1 & 2\end{bmatrix}\begin{bmatrix}3 \\ -3\end{bmatrix} = \begin{bmatrix}3 \\ -3\end{bmatrix} = 1 \times \begin{bmatrix}3 \\ -3\end{bmatrix}.\]

Solving for eigenvalues and eigenvectors

We exploit the fact that \(A\boldsymbol{\mathbf{x}} = \lambda \boldsymbol{\mathbf{x}}\) if and only if \[\label{eigen_eq} (A - \lambda \boldsymbol{\mathbf{I}}_n) \boldsymbol{\mathbf{x}} = 0.\] (Note that \(\lambda \boldsymbol{\mathbf{I}}_n\) is the diagonal matrix where all the diagonal entries are \(\lambda\), and all other entries are zero.)

The equation [eigen_eq] has a nonzero solution \(\boldsymbol{\mathbf{x}}\) if and only if \(\det(A - \lambda \boldsymbol{\mathbf{I}}_n)=0\); see Section 1.1 in Horn and Johnson’s textbook. Therefore, we can obtain the eigenvalues of a matrix \(A\) by solving the characteristic equation \(\det(A - \lambda \boldsymbol{\mathbf{I}}_n) = 0\) for \(\lambda\). Once we have done that, you can find the corresponding eigenvector for each eigenvalue \(\lambda\) by solving the system of equations \((A - \lambda \boldsymbol{\mathbf{I}}_n) \boldsymbol{\mathbf{x}} = \boldsymbol{\mathbf{0}}\) for \(\boldsymbol{\mathbf{x}}\).

Example 6. If \[A = \begin{bmatrix}2 & 1 \\ 1 & 2\end{bmatrix},\] then \[A - \lambda \boldsymbol{\mathbf{I}}_n = \begin{bmatrix}2 - \lambda & 1 \\ 1 & 2 - \lambda\end{bmatrix}\] and \[\det(A - \lambda \boldsymbol{\mathbf{I}}_n) = (2 - \lambda)^2 - 1 = \lambda^2 - 4 \lambda + 3.\] Setting it to \(0\) yields that \(\lambda = 1\) and \(\lambda = 3\) are possible eigenvalues.

(i) To find the eigenvectors for \(\lambda = 1\), we plug \(\lambda\) into the equation \((A - \lambda \boldsymbol{\mathbf{I}}_n) \boldsymbol{\mathbf{x}} = 0\). This gives us \[\begin{bmatrix}1 & 1 \\ 1 & 1\end{bmatrix} \begin{bmatrix}x_1 \\ x_2\end{bmatrix} = \begin{bmatrix}0 \\ 0\end{bmatrix}\] Any vector with \(x_2 = -x_1\) is a solution to this equation, and in particular, \(\begin{bmatrix} 3 \\ -3\end{bmatrix}\) is one solution.

(ii) To find the eigenvectors for \(\lambda = 3\), we again plug \(\lambda\) into the equation and obtain that \[\begin{bmatrix}-1 & 1 \\ 1 & -1\end{bmatrix}\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix}0 \\ 0\end{bmatrix}\] Any vector where \(x_2 = x_1\) is a solution to this equation.

\(\bullet\) Note: The above method is never used to calculate eigenvalues and eigenvectors for large matrices in practice. We will introduce the power iterative method in our lectures to find eigenpairs instead.)

Properties of eigenvalues and eigenvectors

Matrix Norms

Frobenius norm: Given a matrix \(A\in \mathbb{R}^{m\times n}\), its Frobenius norm is defined as \[\left|\left| A \right|\right|_F = \sqrt{\sum_{i,j} A_{ij}} = \mathrm{tr}(A^TA).\] We can compute \(\left|\left| A \right|\right|_F\) as \(\left|\left| A \right|\right|_F = \sqrt{\sigma_1(A)^2 + \cdots \sigma_q(A)^2}\), where \(\sigma_i(A), i=1,...,q\) are singular values of \(A\) and \(q=\min\{m,n\}\); see Section 3 for the definition of singular values. In particular, if \(A\) is a symmetric matrix in \(\mathbb{R}^{n\times n}\), then \(\left|\left| A \right|\right|_F = \sqrt{\sum_{i=1}^n \lambda_i^2}\) with \(\lambda_1,...,\lambda_n\) being the eigenvalues of \(A\).

Maximum norm: The maximum norm (or \(\ell_{\infty}\)-norm) for \(A\in \mathbb{R}^{m\times n}\) is defined as \(\left|\left| A \right|\right|_{\max} = \max_{i,j} |A_{ij}|\). Strictly speaking, \(\left|\left| \cdot \right|\right|_{\max}\) is not a matrix norm because it does not satisfy the submultiplicativity \(\left|\left| A B \right|\right| \leq \left|\left| A \right|\right| \left|\left| B \right|\right|\). However, it is a vector norm when we consider \(\mathbb{R}^{m\times n}\) as a \(mn\)-dimensional vector space; see Section 5.6 in Horn and Johnson’s textbook.

Operator norm: For any matrix \(A\in \mathbb{R}^{m\times n}\) and \(\ell_p\)-norm for vectors in \(\mathbb{R}^m\) and \(\mathbb{R}^n\), then the corresponding operator norm \(\left|\left| A \right|\right|_p\) is defined as \[\left|\left| A \right|\right|_p = \sup_{\boldsymbol{\mathbf{x}}\neq \boldsymbol{\mathbf{0}}} \frac{\left|\left| A\boldsymbol{\mathbf{x}} \right|\right|_p}{\left|\left| \boldsymbol{\mathbf{x}} \right|\right|_p}.\] For the special cases when \(p=1,2,\infty\), these (induced) operator norms can be computed as

There are several useful inequalities between these matrix norms. For any \(A\in \mathbb{R}^{m\times n}\), \[\left|\left| A \right|\right|_2 \leq \left|\left| A \right|\right|_F \leq \sqrt{n} \left|\left| A \right|\right|_2, \; \left|\left| A \right|\right|_{\max} \leq \left|\left| A \right|\right|_2 \leq \sqrt{mn} \left|\left| A \right|\right|_{\max}, \; \text{ and } \; \left|\left| A \right|\right|_F \leq \sqrt{mn} \left|\left| A \right|\right|_{\max}.\]

Spectral Decomposition and Singular Value Decomposition (SVD)

Theorem 1 (Spectral Decomposition of a Real Symmetric Matrix). For a symmetric (square) matrix \(A\in \mathbb{R}^{n\times n}\), there exists a real orthogonal matrix \(U\in \mathbb{R}^{n\times n}\) such that \[A= U\Lambda U^T = \sum_{i=1}^n \lambda_i \boldsymbol{\mathbf{u}}_i \boldsymbol{\mathbf{u}}_i^T,\] where \(\Lambda = \mathrm{diag}(\lambda_1,...,\lambda_n)\), \(U=\left[\boldsymbol{\mathbf{u}}_1, \boldsymbol{\mathbf{u}}_2,...,\boldsymbol{\mathbf{u}}_n\right]\), and \(\boldsymbol{\mathbf{u}}_1,...,\boldsymbol{\mathbf{u}}_n\) are orthonormal eigenvectors of \(A\) associated with eigenvalues \(\lambda_1,...,\lambda_n\).

The spectral decomposition also provides us with a convenient method for computing the power \(A^k = U \Lambda^k U^T\) and exponentiation \(\exp(A) = U \exp(\Lambda) U^T\) of a real symmetric matrix \(A\in \mathbb{R}^{n\times n}\).

While the spectral decomposition (Theorem 1) only works for symmetric (square) matrices, it is also feasible to diagonalize a rectangular matrix \(A\in \mathbb{R}^{m\times n}\) through orthogonal matrices.

Theorem 2 (Singular Value Decomposition (SVD)). Let \(A\in \mathbb{R}^{m\times n}\) with \(q=\min\{m,n\}\). There exist orthogonal matrices \(\widetilde{U}=\left[\boldsymbol{\mathbf{u}}_1,...,\boldsymbol{\mathbf{u}}_m\right]\in \mathbb{R}^{m\times m}\) and \(\widetilde{V}=\left[\boldsymbol{\mathbf{v}}_1,...,\boldsymbol{\mathbf{v}}_n\right]\in \mathbb{R}^{n\times n}\) as well as a (square) diagonal matrix \(\Sigma_q = \mathrm{diag}(\sigma_1,...,\sigma_q) \in \mathbb{R}^{q\times q}\) such that \[A = \widetilde{U}\Sigma \widetilde{V}^T = \sum_{i=1}^q \sigma_q \boldsymbol{\mathbf{u}}_i \boldsymbol{\mathbf{v}}_i^T = U \Sigma_q V^T,\] where \(U = \left[\boldsymbol{\mathbf{u}}_1,...,\boldsymbol{\mathbf{u}}_q\right] \in \mathbb{R}^{m\times q}\), \(V=\left[\boldsymbol{\mathbf{v}}_1,...,\boldsymbol{\mathbf{v}}_q\right] \in \mathbb{R}^{n\times q}\), and \[\begin{aligned} \Sigma &= \Sigma_q \text{ if } m=n,\\ \Sigma &= \left[\Sigma_q \; \boldsymbol{\mathbf{0}} \right] \in \mathbb{R}^{m\times n} \text{ if } n>m,\\ \Sigma &= \begin{bmatrix} \Sigma_q\\ \boldsymbol{\mathbf{0}} \end{bmatrix} \in \mathbb{R}^{m\times n} \text{ if } m>n. \end{aligned}\] Here, \(\sigma_1,...,\sigma_q\) are called the singular values of \(A\), which are eigenvalues of \(AA^T\) when \(m\leq n\) or \(A^T A\) when \(m>n\).

Notice that the number of nonzero singular values of \(A\) determines the rank of \(A\). During Lecture 6, we will leverage the singular value decomposition to reduce the dimension (or matrix rank) of a user-movie rating matrix.

See http://faculty.washington.edu/yenchic/20A_stat512.html.↩︎
M. Perlman. Probability and Mathematical Statistics I (STAT 512 Lecture Notes), 2020. URL https://sites.stat.washington.edu/people/mdperlma/STAT%20512%20MDP%20Notes.pdf.↩︎
R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2 edition, 2012.↩︎
S. Axler. Linear algebra done right. Springer, 3 edition, 2015.↩︎