Links + places that helped with this notebook:

So we have some dataset with $n$ examples and $d$ dimensions. For some data, $d$ could be large (~10,000 for example), meaning that the data matrix $n\ x\ d$ becomes quite cumbersome to deal with.

Having smaller dimensionality has several more benefits, including:

- making visualization easier (hard to visualize 3D/4D/RD)
- discovering 'intrinsic' dimensionality of the data
- data could actually have many features that are irrelevant

Our goal then could be to reduce the "dimensionality" of the data by significantly reducing the size of $d$. Essentially we want to learn a mapping: $$ f: R^{nxd} \rightarrow R^{nxk} $$ with $k << d$.

We want this mapping to also be a "pretty good" representation of the data, so we need to introduce a notion of "goodness" that lets us measure how good a given mapping is.

Loosely, we'll think about our "pretty good" representation as one that is able re-create the original data without much loss of information about it.

We're also going to do this in the unsupervised setting (we don't have the associated $y$'s).

For simplicity, let's assume I just have a 2 dimensional vector $x = (x_{1}, x_{2})$ that I want to reduce to dimensionality 1.

In [4]:

```
from IPython.core.display import Image
Image(filename='reconstruction.png', width = 600)
```

Out[4]:

Okay, so we have this idea of projection and then reconstuction:

- we project that data in the 2 dimensions down into the 1 dimension
- we reconstruct the data in the 2 dimensions using ONLY the vector we've discovered

Given $n$ data points: $x_1 = (x_{i}[1], x_{i}[2],...,x_{i}[d])$ for $i=1,...,n$.

We'll represent each point as a projection: $$ \hat{x}_{i} = \bar{x} + \sum_{j=1}^{k}\left[z_{i}[j]*u_{j}\right] $$ Note that we can write our original point as $$ x_{i} = \bar{x} + \sum_{j=1}^{d}\left[z_{i}[j]*u_{j}\right] $$

Note here that each point has its own $z_{i}[j]$, and that the $u_{[j]}$ are shared for all points. You'll note that we have to keep track of one of these $u_{j}$ for all $j=1,...,k$. Also: $$ z_{i}[j] = (x_{i} - \bar{x})\cdot u_{j} $$

Given k<d, find $(u_{1},...,u_{k})$ that minimize the reconstuction error: $$ \text{error}_k = \sum_{i=1}^{n}(x_{i} - \hat{x}_{i})^{2} $$ The $x_i$ here is the truth, and the $\hat{x}_i$ is the reconstructed version.

**In linear algebra terms:** identify the most meaningful basis to re-express a data set, in the hopes of filtering out noise and *reveal hidden structure*.

We'll transform the PCA problem into one you've probably seen before. We'll do this by re-writing the error: $$ \text{error}_k = \sum_{i=1}^{n}(x_{i} - \big[\bar{x} + \sum_{j=1}^{k}z_{i}[j]*u_{j}\big])^{2} $$ $$ = \sum_{i=1}^{n}\Big[\left[\bar{x} + \sum_{j=1}^{d}z_{i}[j]*u_{j}\right] - \left[\bar{x} + \sum_{j=1}^{k}z_{i}[j]*u_{j}\right]\Big]^{2} $$

In [6]:

```
from IPython.core.display import Image
Image(filename='reconstruction-math.png', width = 800)
```

Out[6]:

In [5]:

```
from IPython.core.display import Image
Image(filename='covar.png', width = 800)
```

Out[5]:

So we've completely rewritten our error such that we are:
**picking an ordered, orthonormal basis $(u_{1},...,u_{d})$ to minimize:**
$$
error_{k} = N \sum_{j=k+1}^{D}u_{j}^{T}\Sigma u_{j}
$$

So before we go full Math 308 on you, brief eigen-things review:

Def: **eigenvector**: A vector $u$ is an eigenvector of some matrix $A$ iff there exists some scalar $\lambda$ such that $Au = \lambda u$.

Def: **eigenvalue**: the $\lambda$ that corresponds to $u$, so every eigenvector is associated w/an eigenvalue and vice versa.

Def: **eigenspace**: the set of all eigenvectors corresponding to a given $\lambda$

Quickly doing some algebra: $$Au = \lambda u$$ $$Au - \lambda u = 0$$ $$(A - \lambda I)u = 0$$ So essentially we are looking for non-zero solutions to the above problem. Using the Big Theorem, the above equation has a non-zero solution iff $det(A - \lambda I) = 0$.

Example: Find the eigenvalues for: $$ A = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 2 \\ 0 & 4 & 3 \end{bmatrix} $$ So we solve and recover: $$ det(A - \lambda I) = (1- \lambda)(\lambda - 5)(\lambda +1) $$ We have 3 eigenvalues here: $\lambda_{1} = 5$, $\lambda_{2} = 1$, $\lambda_{3} = -1$.

Another way to think about this is minimizing the **sum of the $d-k$ eigenvalues of $\Sigma$**, or **keeping the top $k$ eigenvalues of $\Sigma$**.

- Start with $n$ by $d$ data matrix $X$
- Recenter: subtract mean from each row of $X$
- Compute the covariance matrix: $\Sigma := \frac{1}{N} X_{c}^{T}X_{c}$
- Find the eigenvalues and eigenvectors of $\Sigma$ (usually via
`numpy.linalg.eig`

) - The principal components: the $k$ eigenvectors with the highest values

In [4]:

```
from IPython.core.display import Image
Image(filename='eigen-ex.png', width = 1000)
```

Out[4]: