\documentclass[letterpaper]{article}
\title{CSE546 Machine Learning, Autumn 2013: Homework 1}
\date{Due: Monday, October $14^{th}$, beginning of class}
\usepackage{hyperref}
\usepackage[margin=1in]{geometry}
\usepackage{amsmath,amsfonts}
\usepackage{capt-of}
\usepackage{url}
\usepackage{graphicx}
\usepackage{color}
\usepackage{bbm}
\usepackage{enumerate}
\newcommand{\carlos}[1]{\textcolor{red}{Carlos: #1}}
\newcommand{\field}[1]{\mathbb{#1}}
\newcommand{\hide}[1]{#1}
\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\providecommand{\m}[1]{\mathbf{#1}}
\providecommand{\norm}[1]{\left\|#1\right\|}
\providecommand{\sign}[1]{\text{sign}\left(#1\right)}
\DeclareMathOperator*{\argmin}{arg\,min}
\providecommand{\what}{\m{\hat{w}}}
\begin{document}
\maketitle
\section{Probability [9 points]}
Let $X_1$ and $X_2$ be independent, continuous random variables
uniformly distributed on $[0,1]$. Let $X=\min(X_1,X_2)$. Compute
\begin{enumerate}
\item \emph{(3 points)} $E(X)$.
\item \emph{(3 points)} $Var(X)$.
\item \emph{(3 points)} $Cov(X,X_1)$.
\end{enumerate}
\section{MLE [13 points]}
This question uses a probability distribution called the
Poisson distribution. A discrete random variable $X$ follows
a Poisson distribution with parameter $\lambda$ if
$$\Pr(X=k)=\frac{\lambda^k}{k!}e^{-\lambda},
k \in \{0,1,2,\dots\}$$
Suppose the number of eggs laid by a turtle follows
the Poisson distribution with parameter $\lambda$.
Also suppose that all turtles are mutually independent.
Once laid, every egg
hatches after a while.
When ready, hatchlings tear their shells apart and dig
through the sand. When they reach the surface, they head towards the sea.
Due to opportunist predators, the probability that a hatchling will
reach the sea is $p$.
Assume that each egg hatches at a different time, so that
the survival of each hatchling is independent from the others.
A biologist studies 10 of these turtles, observing the number of eggs
laid by the turtle and the
number of hatchlings that reach the sea.
She records her observations in the following table.
\\
\begin{tabular}{l*{10}{c}r}
Turtle & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\
\hline
Eggs Laid & 8 & 9 & 6 & 4 & 1 & 5 & 2 & 12 & 9 & 7 \\
Hatchlings that Reached the Sea & 5 & 6 & 4 & 3 & 0 & 5 & 2 & 9 & 8 & 6 \\
\end{tabular}
\\
We will establish some notation for the general case.
Let $E=(E_1,\dots,E_n)$ be a random vector corresponding the
number of eggs laid by each of $n$ turtles. Let $H=(H_1,\dots,H_n)$ be a random vector
corresponding to
the number of hatchlings that reached the sea for each turtle.
(The above table gives realizations of $E$ and $H$.)
Compute
\begin{enumerate}
\item \emph{(5 points)}
the joint log-likelihood function of $E$ and $H$ given $\lambda$ and $p$.
\item \emph{(6 points)}
the MLE for $\lambda$ and $p$ in the general case.
\item \emph{(2 points)}
the MLE for $\lambda$ and $p$ using the observed values of $E$ and $H$.
\end{enumerate}
\section{Linear Regression and LOOCV [15 points]}
In class you learned about using cross validation
as a way to estimate the true error of a learning algorithm.
A solution that provides an almost unbiased estimate of this
true error is \emph{Leave-One-Out Cross Validation} (LOOCV),
but it can take a really long time to compute the LOOCV error.
In this problem you will derive an algorithm for
efficiently computing the LOOCV error for linear regression
using the \emph{Hat Matrix}.
\footnote{Unfortunately, such an efficient algorithm may not
be easily found for other learning methods.}
(This is the \emph{cool trick} alluded to in the slides!) \\
\noindent
Assume that there are $n$ given training examples, $(X_1,y_1),(X_2,y_2),\dots,(X_n,y_n)$,
where each input data point $X_i$, has $d$ real valued features.
The goal of regression is to learn to predict $y_i$ from $X_i$.
The \emph{linear} regression model assumes that the output $y$ is
a \emph{linear} combination of the input features plus Gaussian noise
with weights given by $w$. \\
\noindent
We can write this in matrix form by stacking the data points as
the rows of a matrix $X$ so that $x_{i}^{(j)}$ is the $i$-th feature
of the $j$-th data point. Then writing $Y$, $w$ and $\epsilon$ as column vectors,
we can write the matrix form of the linear regression model as:
\[
Y=Xw + \epsilon
\]
where:
\[
Y = \left[\begin{array}{c}
y_1 \\
y_2 \\
\vdots \\
y_n
\end{array}\right],
X = \left[\begin{array}{cccc}
x_{1}^{(1)} & x_{2}^{(1)} & \dots & x_{d}^{(1)} \\
x_{1}^{(2)} & x_{2}^{(2)} & \dots & x_{d}^{(2)} \\
\vdots & \vdots & \ddots & \vdots \\
x_{1}^{(n)} & x_{2}^{(n)} & \dots & x_{d}^{(n)} \\
\end{array}\right],
w = \left[\begin{array}{c}
w_1 \\
w_2 \\
\vdots \\
w_d
\end{array}\right],
\,\,\,\,\,\mbox{and}\,\,
\epsilon = \left[\begin{array}{c}
\epsilon_1 \\
\epsilon_2 \\
\vdots \\
\epsilon_n
\end{array}\right]
\]
Assume that $\epsilon_i$ is normally distributed with
variance $\sigma^2$. We saw in class that the maximum likelihood
estimate of the model parameters $w$ (which also happens to minimize the sum of
squared prediction errors) is given by the \emph{Normal equation}:
\[
\hat{w} = (X^TX)^{-1} X^TY
\]
\noindent
Define $\hat{Y}$ to be the vector of predictions using $\hat{w}$
if we were to plug in the original training set $X$:
\begin{eqnarray*}
\hat{Y} &=& X\hat{w} \\
&=& X(X^T X)^{-1} X^T Y \\
&=& H Y
\end{eqnarray*}
where we define $H=X(X^T X)^{-1} X^T$ ($H$ is often
called the \emph{Hat Matrix}). \\
\noindent
As mentioned above, $\hat{w}$, also minimizes the sum of squared errors:
\[
\mbox{SSE} = \sum_{i=1}^{n} (y_i-\hat y_i)^2
\]
Now recall that the Leave-One-Out Cross Validation score is defined to be:
\[
\mbox{LOOCV} = \sum_{i=1}^n (y_i - \hat{y}_i^{(-i)})^2
\]
where $\hat{Y}^{(-i)}$ is the estimator of $Y$ after removing the $i$-th observation
(i.e., it minimizes $\sum_{j\neq i} (y_j - \hat{y}_j^{(-i)})^2$).
\begin{enumerate}
\item \emph{(2 points)} What is the time complexity of computing
the LOOCV score naively?
(The naive algorithm is to loop through each point, performing
a regression on the $n-1$ remaining points at each iteration.)
\emph{Hint:} The complexity of matrix inversion is $O(k^3)$ for a $k\times k$ matrix
\footnote{There are faster algorithms out there but for simplicity we'll assume that we are
using the naive $O(k^3)$ algorithm.}.
\item \emph{(1 point)} Write $\hat y_i$ in terms of $H$ and $Y$.
\item \emph{(4 points)} Show that $\hat{Y}^{(-i)}$ is also the estimator which minimizes SSE for
$Z$ where
\[
Z_j = \left\{\begin{array}{cc}
y_j, & j\neq i \\
\hat y_i^{(-i)}, & j=i \\
\end{array}\right.
\]
\item \emph{(1 point)} Write $\hat y_i^{(-i)}$ in terms of $H$ and $Z$. By
definition, $\hat y_i^{(-i)} = Z_i$, but give an answer that is analogous to \emph{2}.
\item \emph{(3 points)}
Show that $\hat y_i - \hat y_i^{(-i)} = H_{ii}y_i - H_{ii}\hat y_i^{(-i)}$,
where $H_{ii}$ denotes the $i$-th element along the diagonal of $H$.
\item \emph{(4 points)}
Show that
\[
LOOCV = \sum_{i=1}^{n} \left(\frac{y_i - \hat y_i}{1-H_{ii}}\right)^2
\]
What is the algorithmic complexity of computing the LOOCV score using this
formula?
Note: We see from this formula that the diagonal elements of $H$ somehow
indicate the impact that each particular observation has on the result
of the regression.
\end{enumerate}
\section{Regularization Constants [13 points]}
We have discussed the importance of regularization as a technique to avoid overfitting our models. For linear regression, we have mentioned both LASSO (which uses the $L_1$ norm as a penalty), and ridge regression (which uses the squared $L_2$ norm as a penalty). In practice, the scaling factor of these penalties has a significant impact on the behavior of these methods, and must often be chosen empirically for a particular dataset. In this problem, we look at what happens when we choose our regularization factor poorly.
\subsection{LASSO Regression}
Recall that the loss function to be optimized under LASSO regression is:
\[
E_L = \sum_{i=1}^n(y_i - (\hat w_0 + x^{(i)}\hat{w}))^2 + \lambda \| \hat w \|_1
\]
where
\begin{equation} \label{eq:lassopenalty}
\lambda \| \hat w \|_1 = \lambda \sum_{i=1}^d |\hat w_i|
\end{equation}
and $\lambda$ is our regularization constant.
\smallskip
\noindent
\begin{enumerate}
\item Suppose our $\lambda$ is much too small; that is,
\[
\sum_{i=1}^n(y_i - X\hat{w})^2 + \lambda \| w \|_1 \approx \sum_{i=1}^n(y_i - X\hat{w})^2
\]
\noindent
How will this affect the magnitude of:
\begin{enumerate}
\item \emph{(1 point)}
The error on the training set?
\item \emph{(1 point)}
The error on the testing set?
\item \emph{(1 point)}
$\hat{w}$?
\item \emph{(1 point)}
The number of nonzero elements of $\hat{w}$?
\end{enumerate}
\noindent
\item Suppose instead that we overestimated on our selection of $\lambda$.
What do we expect to be the magnitude of:
\begin{enumerate}
\item \emph{(1 point)}
The error on the training set?
\item \emph{(1 point)}
The error on the testing set?
\item \emph{(1 point)}
$\hat{w}$?
\item \emph{(1 point)}
The number of nonzero elements of $\hat{w}$?
\end{enumerate}
\end{enumerate}
\subsection{Ridge Regression}
Recall that the loss function to be optimized under ridge regression is now:
\[
E_R = \sum_{i=1}^n(y_i - (\hat w_0 + x^{(i)}\hat{w}))^2 + \lambda \| \hat w \|_2^2
\]
where
\begin{equation} \label{eq:ridgepenalty}
\lambda\|\hat w \|_2^2 = \lambda \sum_{i=1}^d (\hat w_i)^2
\end{equation}
If $\lambda$ is too small, then the misbehavior of ridge regression is similar to that of LASSO in the previous section. Let's suppose $\lambda$ has been set too high, and compare the behavior of ridge regression to LASSO.
\bigskip
\noindent
To make this comparison, let's look at what happens to a particular feature weight $\hat w_i$. Since $\hat w$ is the solution that minimizes our error function $E$, it must be the case that $\pd{E}{\hat w_i} = 0$.
\begin{enumerate}
\item \emph{(1 points)} Suppose we use the LASSO loss function $E_L$ as defined in the previous section. Let's ignore the first term (which corresponds to the Sum Squared Error of our prediction), and calculate the partial derivative of the penalty term in Equation~\ref{eq:lassopenalty} with respect to $\hat w_i$. This can be thought of as the direction that LASSO ``pushes" us away from the SSE solution. \emph{Note that the absolute value is not differentiable at $\hat w_i = 0$, so you only need to answer for the case that $\hat w_i \neq 0$.}
\item \emph{(1 point)} Instead, suppose we use the ridge regression loss function $E_R$ from above. Again, ignore the SSE term and calculate the partial derivative with respect to $\hat w_i$ of Equation~\ref{eq:ridgepenalty} to find how ridge regression ``pushes" us away from the SSE solution.
\item \emph{(3 points)} Comparing these two derivatives, for what values of $\hat w_i$ will their behaviors differ? What does this mean for our estimate of $\hat w_i$ when $\lambda$ is very large?
\end{enumerate}
\section{Programming Question [50 points]}
\subsection{Implement coordinate descent to solve the Lasso}
The Lasso is the problem of solving
\[ \argmin_{\m{w}, w_0} \norm{\m{X} \m{w} + w_0 - \m{y}}_2^2 + \lambda \norm{\m{w}}_1 \]
Here $\m{X}$ is an $N \times d$ matrix of data, $\m{y}$ is an $N \times 1$ vector of response variables,
$\m{w}$ is a $d$ dimensional weight vector, $w_0$ is a scalar offset term, and $\lambda$ is a regularization
tuning parameter.
For the programming part of this homework, you are to implement the coordinate descent method to solve the Lasso.
Your solver should
\begin{itemize}
\item Include an offset term $w_0$ that is not regularized
\item Take optional initial conditions for $\m{w}$ and $w_0$
\item Be able to handle both dense and sparse $\m{X}$ matrices
\item Avoid unnecessary computation
\end{itemize}
You may use any language for your implementation, but we recommend Python. Python is a very useful language,
and you should find that Python achieves reasonable enough performance for this problem.
You may use common computing packages (such as NumPy or SciPy), but please, do not use an existing Lasso solver.
For a description of the algorithm, please refer to Murphy page 441. Note
that we recommend you initialize the algorithm slightly differently (see
the fourth hint below).
Before you get started, here are some hints that you may find helpful:
\begin{itemize}
\item With the exception of computing objective values or initial conditions, the only matrix operations required
are adding vectors, multiplying a vector by a scalar, and computing the dot product between two vectors.
If you find you are doing many large matrix operations, there is likely a more efficient implementation.
\item To ensure that a solution $(\what, \hat{w}_0)$ is correct, you can
compute the value \[ 2 \m{X}^T (\m{X} \what + \hat{w}_0 - \m{y}) \]
This is a $d$-dimensional vector that should take the value $-\lambda \text{sign}(\what_j)$ at $j$ for
each $\what_j$ that is nonzero.
For the zero indices of $\what$, this vector should take values lesser in magnitude than $\lambda$.
(This is similar to setting the gradient to zero, though more complicated
because the objective function is not differentiable.) Another simple check
is to ensure the objective value nonincreasing with each step.
\item It is up to you to decide on a suitable stopping condition. A common criteria is to stop when no element of
$\m{w}$ changes by more than a value $\delta$ during an iteration. If you need your algorithm to run faster,
an easy place to start is to loosen this condition.
\item For several problems, you will need to solve the Lasso on the same dataset for many values of $\lambda$. This
is called a regularization path. One way to do this efficiently is to start at a large $\lambda$, and then for
each consecutive solution, initialize the algorithm with the previous solution, decreasing $\lambda$ by a constant
ratio until finished.
\item The smallest value of $\lambda$ for which the solution $\what$ is entirely zero is given by
\[ \lambda_{max} = 2 \norm{\m{X}^T \left( y - \bar{y} \right)}_\infty \]
This is helpful for choosing the first $\lambda$ in a regularization path.
\end{itemize}
Finally here are some pointers toward useful parts of Python:
\begin{itemize}
\item \texttt{numpy}, \texttt{scipy.sparse}, and \texttt{matplotlib} are useful computation packages.
\item For storing sparse matrices, the \texttt{scipy.sparse.csc\_matrix} (compressed sparse column) format
is fast for column operations.
\item \texttt{scipy.io.mmread} reads sparse matrices in Matrix Market Format.
\item \texttt{numpy.random.randn} is nice for generating random Gaussian arrays.
\item \texttt{numpy.linalg.lstsq} works for solving unregularized least squares.
\item If you're new to Python but experienced with Matlab, consider reading NumPy for Matlab Users
at \url{http://wiki.scipy.org/NumPy_for_Matlab_Users}.
\end{itemize}
\subsection{Try out your work on synthetic data}
We will now try out your solver with some synthetic data. A benefit of the Lasso is that if we believe many features
are irrelevant for predicting $\m{y}$, the Lasso can be used to enforce a sparse solution,
effectively differentiating between the relevant and irrelevant features.
Let's see if it actually works. Suppose that $\m{x} \in \mathbb{R}^d, y \in \mathbb{R}, k < d$, and pairs of data $(\m{x}_i, y_i)$ are generated independently according to the model
\[ y_i = w^*_0 + w^*_1 x_{i,1} + w^*_2 x_{i,2} + \ldots w^*_k x_{i,k} + \epsilon_i \]
where $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ is some Gaussian noise. Note that since $k < d$, the features $k + 1$ through $d$
are unnecessary (and potentially even harmful) for predicting $y$.
With this model in mind, you are now tasked with the following:
\begin{enumerate}
\item \emph{(7 points)} Let $N = 50, d = 75, k = 5,$ and $\sigma = 1$. Generate
some data by drawing each element of $\m{X} \in \mathbb{R}^{N \times d}$ from a standard normal distribution $\mathcal{N}(0, 1)$.
Let $w_0^* = 0$ and create a $\m{w}^*$ by setting the first $k$ elements to $\pm 10$ (choose any sign pattern) and the remaining elements to $0$.
Finally, generate a Gaussian noise vector $\m{\epsilon}$ with variance $\sigma^2$ and form $\m{y} = \m{X} \m{w}^* + w_0^* + \m{\epsilon}$.
With your synthetic data, solve multiple Lasso problems on a regularization path, starting at $\lambda_{max}$, then
decreasing $\lambda$ by a constant ratio until few features are chosen correctly. Compare
the sparsity pattern of your Lasso solution $\what$ to that of the true model parameters $\m{w}^*$. Record
values for precision (number of correct nonzeros in $\what$/total number of nonzeros in $\what$) and recall
(number of correct nonzeros in $\what$/k).
How well are you able to discover the true nonzeros? Comment on how $\lambda$ affects these results and include
plots of precision and recall vs. $\lambda$.
\item \emph{(6 points)} Change $\sigma$ to 10, regenerate the data, and solve
the Lasso problem using a value of $\lambda$ that worked well for the case
when $\sigma = 1$. Run the code a few times.
What happens? How are precision and recall affected?
How might you change $\lambda$ in order to achieve better precision or recall?
\item \emph{(7 points)} Set $\sigma$ back to 1, and solve the Lasso for the
following 6 sets of values:
\begin{center}
\begin{tabular}{c c c }
$(N = 50, d = 75)$ & $(N = 100, d = 75)$ \\
$(N = 50, d = 150)$ & $(N = 100, d = 150)$ \\
$(N = 50, d = 5000)$ & $(N = 100, d = 5000)$ \\
\end{tabular}
\end{center}
Use a regularization path, and briefly discuss the precision and recall results for each problem.
Based on your results, what might you guess the
sample complexity is for recovering the sparsity pattern using the Lasso? Possible answers include
$N = \mathcal{O}(d^2)$, $N = \mathcal{O}(d)$, or $N = \mathcal{O}(\log(d))$. If needed, try additional
values for $N$ and $d$ or repeat the experiment multiple times.
(What we're getting at is if $d$ increases substantially, how does $N$
need to increase in order to still achieve good performance? Why might this
be significant?)
\end{enumerate}
\subsection{Become a data scientist at Yelp}
We'll now put the Lasso to work on some real data. Recently Yelp held a recruiting competition on the analytics
website Kaggle. Check it out at \url{http://www.kaggle.com/c/yelp-recruiting}. (As a side note, browsing other competitions
on the site may also give you some ideas for class projects.)
For this competition, the task is to predict the number of useful upvotes a particular review will receive. For extra fun,
we will add the additional task of predicting the review's number of stars based on the review's text alone.
For many Kaggle competitions (and machine learning methods in general), one of the most important requirements for doing well
is the ability to discover great features. We can use our Lasso solver for this as follows. First, generate a large
amount of features from the data, even if many of them are likely unnecessary. Afterward, use the Lasso to reduce the number of features
to a more reasonable amount.
Yelp provides a variety of data, such as the review's text, date, and restaurant,
as well as data pertaining to each business, user, and check-ins. This information has already been
preprocessed for you into the following files on the course website:
\begin{center}
\begin{tabular}{l l}
\texttt{upvote\_data.csv} & Data matrix for predicting number of useful votes \\
\texttt{upvote\_labels.txt} & List of useful vote counts for each review \\
\texttt{upvote\_features.txt} & Names of each feature for interpreting results \\
\texttt{star\_data.mtx} & Data matrix for predicting number of stars \\
\texttt{star\_labels.txt} & List of number of stars given by each review \\
\texttt{star\_features.txt} & Names of each feature \\
\end{tabular}
\end{center}
For each task, data files contain data matrices, while labels
are stored in separate text files. The first data matrix is stored in CSV format, each row corresponding to one review.
The second data matrix is stored in Matrix Market Format, a format for sparse matrices. Meta information for
each feature is provided in the final text files, one feature per line. For the upvote task, these are functions of
various data attributes. For the stars task, these are strings of one, two, or
three words (n-grams). The feature values correspond roughly to how often each
word appears in the review. All columns have also been normalized.
To get you started, the following code should load the data:
\begin{verbatim}
import numpy as np
import scipy.io as io
import scipy.sparse as sparse
# Load a text file of integers:
y = np.loadtxt("upvote_labels.txt", dtype=np.int)
# Load a text file of strings:
featureNames = open("upvote_features.txt").read().splitlines()
# Load a csv of floats:
A = np.genfromtxt("upvote_data.csv", delimiter=",")
# Load a matrix market matrix, convert it to csc format:
B = io.mmread("star_data.mtx").tocsc()
\end{verbatim}
For this part of the problem, you have the following tasks:
\begin{enumerate}
\item \emph{(6 points)} Solve lasso to predict the number of useful votes a
Yelp review will receive.
Use the first 4000 samples for training, the next 1000 samples for
validation, and the remaining samples for testing.
Starting at $\lambda_{max}$, run Lasso on the training set, decreasing $\lambda$ using
previous solutions as initial conditions to each problem. Stop when you have
considered enough $\lambda$'s that, based on validation error, you can
choose a good solution with confidence (for instance, when validation
error begins increasing or stops decreasing significant).
At each solution, record the root-mean-squared-error (RMSE) on training and validation data. In addition, record the number of nonzeros in each solution.
Plot the RMSE values together on a plot against $\lambda$. Separately plot the number of nonzeros as well.
\item \emph{(3 points)} Find the $\lambda$ that achieves best validation
performance, and test your model on the remaining set of test data. What RMSE value do you obtain?
\item \emph{(3 points)} Inspect your solution and take a look at the 10
features with weights largest in magnitude.
List the names of these features and their weights, and comment on if the weights generally make sense intuitively.
\item \emph{(6 points)} A significant issue with the Lasso is that in order to
shrink a large amount of features to zero, the $\ell_1$-regularization introduces a large amount of bias to the nonzero weights. As a result, we can often achieve better results
by debiasing the estimate (see Murphy page 439). This is done by first solving the Lasso to choose a sparsity pattern.
Afterward, we solve a least squares problem on only the nonzero features and use the resulting weights in
our final parameter vector.
Run Lasso on a regularization path. At each value of $\lambda$, debias the weight vector learned by Lasso.
Record RMSE values on validation data both before and after debiasing.
Plot RMSE values both with and without debiasing together vs. $\lambda$.
Does performance change with debiasing? How does the value of $\lambda$ affect this outcome?
\item \emph{(12 points)} Repeat parts 1, 2, and 3 using the data matrix and
labels for predicting the score of a review. To avoid using too much memory, your algorithm should keep the matrix in a sparse format. Use
the first 30,000 examples for training and divide the remaining samples between validation and testing.
\end{enumerate}
\end{document}