\documentclass{article}
% \def\showsolution{1}
\usepackage{amsmath,amsfonts,amsthm,amssymb,amsopn,bm}
% \usepackage{fullpage}
\usepackage[margin=.9in]{geometry}
\usepackage{graphicx}
% \usepackage{fullpage}
% \usepackage[paper=letterpaper,margin=1in,includeheadfoot,footskip=0.25in,headsep=0.25in]{geometry}
\usepackage{url}
\usepackage[usenames,dvipsnames]{color}
% \usepackage[pdfborder={0 0 1},colorlinks=true,citecolor=black,plainpages=false]{hyperref}
\usepackage{fancyhdr}
\usepackage[ruled]{algorithm2e}
% \usepackage{multirow}
\newcommand{\field}[1]{\mathbb{#1}}
\newcommand{\mb}[1]{\mathbf{#1}}
\newcommand{\1}{\mathbf{1}}
\newcommand{\E}{\mathbb{E}} % real domain
\newcommand{\e}{\mathbf{e}} % real domain
\newcommand{\V}{\mathbf{V}} % real domain
\newcommand{\X}{\mathbf{X}} % real domain
% \newcommand{\E}{\mathbb{E}} % real domain
\renewcommand{\P}{\mathbb{P}} % real domain
\newcommand{\R}{\field{R}} % real domain
% \newcommand{\C}{\field{C}} % complex domain
\newcommand{\F}{\field{F}} % functional domain
\newcommand{\T}{^{\textrm T}} % transpose
\def\diag{\text{diag}}
%% operator in linear algebra, functional analysis
\newcommand{\inner}[2]{#1\cdot #2}
\newcommand{\norm}[1]{\left\|#1\right\|}
\newcommand{\twonorm}[1]{\|#1\|_2^2}
% operator in functios, maps such as M: domain1 --> domain 2
\newcommand{\Map}[1]{\mathcal{#1}}
\renewcommand{\theenumi}{\alph{enumi}}
\newcommand{\Perp}{\perp \! \! \! \perp}
\newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}}
\def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}}
\newcommand{\vct}[1]{\boldsymbol{#1}} % vector
\newcommand{\mat}[1]{\boldsymbol{#1}} % matrix
\newcommand{\cst}[1]{\mathsf{#1}} % constant
\newcommand{\ProbOpr}[1]{\mathbb{#1}}
\newcommand{\grade}[1]{\small\textcolor{magenta}{\emph{[#1 points]}} \normalsize}
\date{{}}
\setlength\parindent{0px}
\begin{document}
\title{Homework \#3 problem 5 revisited - Optional}
\author{\normalsize{CSE 546: Machine Learning}\\
\normalsize{Prof. Kevin Jamieson} \\
\normalsize{Due: 12/12 11:59 PM}}
\maketitle
\textit{This homework assignment is an \textbf{optional} replacement for problem 5 on Homework 3. If you choose to complete this assignment and turn it in on gradescope, you will receive the maximum of the score you originally received for problem 5 on homework 3 and the score you receive on this homework assignment. If you do not submit this homework assignment, you will receive the score you received on the original assignemnt. If you received a score of $x \in \{1,\dots,8\}$ on the originally assigned problem, and you choose to do this assignment and receive a score of $y \in \{1,\dots,8\}$, the score you will be assigned for problem 5 on homework 3 will equal $\max\{x,y\}$. That is, if you already received a score of $8$ on the original assignment, you cannot achieve additional points by completing this assignment.}
\section{Matrix Completion}
5. \grade{8} You will build a personalized movie recommender system. There are $m = 500$ movies and $n = 1000$ users. As historical data, every user watched a subset of movies and rated them. The goal is to recommend movies the users haven't seen.
The historical rating is represented by a matrix $R\in \R^{n \times m}$. The entry $R_{i,j}$ represents the user $i$'s rating on movie $j$. The rating is a real number in $[-10,10]$: a higher value represents that the user is more satisfied with the movie.
You are provided with two files:
\begin{itemize}
\item $\tt train.txt$ contains the user-movie-score data representing the training set. Each line takes the form ``$\tt i,j,s$'', where $\tt i$ is the user index, $\tt j$ is the movie index, and $\tt s$ is the user's score in $[-10,10]$ describing how much they liked the movie (higher is better).
\item $\tt test.txt$ has the same format, with the same users rating movies held out from the training set.
\end{itemize}
Latent factor model is the state-of-the-art method for personalized recommendation. It learns a vector representation $u_i\in \R^d$ for each user and a vector representation $v_j\in \R^d$ for each movie, such that
the inner product $\langle u_i,v_j\rangle$ approximates the rating~$R_{i,j}$. You will build a simple latent factor model.
We will evaluate our learnt vector representations by two metrics
\begin{itemize}
\item Mean squared error: $\frac{1}{|S|} \sum_{(i,j)\in S} (\langle u_i,v_j\rangle - R_{ij})^2$ where $S$ (and the corresponding $R_{i,j}$ values) are from the test set
\item Mean absolute error: $\frac{1}{n} \sum_{i=1}^n \frac{1}{|\mathcal{N}_i|} \sum_{j \in \mathcal{N}_i} |\langle u_i,v_j\rangle - R_{ij}|$ where $\mathcal{N}_i$ are the movies rated by user $i$ in the test set
\end{itemize}
You will implement multiple estimators and use the inner product $\langle u_i,v_j\rangle$ to predict if user $i$ likes movie $j$ in the test data.
You will choose hyperparameters like $d$ or the amount of regularization by creating a validation set from the training set.
\begin{enumerate}
\item The first estimator pools all the users together and just predicts what the average user in the training set rated the movie. This is equivalent to $d=1$ with $u$ as the all ones vector and $v$ minimizing least squares.
\item Now replace all missing values in $R_{i,j}$ no in the training set by zero. Then use singular value decomposition (SVD) to learn a lower dimensional vector representation for users and movies. Recall this means to project the data vectors to lower dimensional subspaces of their corresponding spaces, spanned by singular vectors. Refer to the lecture materials on SVD, PCA and dimensionality reduction. You should use an efficient solver, I recommend \texttt{scipy.sparse.linalg.svds}. Try $d=1,2,5,10,20,50$ and plot the error metrics on the train and test as a function of $d$.
\item For sparse data, replacing all missing values by zero is not a completely satisfying solution. A missing value means that the user has not read the movie, but doesn't mean that the rating should be zero. A more reasonable choice is to minimize the MSE only on rated movies. Let's define a loss function:
\[
L\Big(\{u_i\},\{v_j\}\Big) := \sum_{(i,j)\in T} (\langle u_i,v_j\rangle - R_{i,j})^2
+ \lambda \sum_{i=1}^n \|u_i\|_2^2 + \lambda \sum_{j=1}^m \|v_j\|_2^2,
\]
where $T$ and $R_{i,j}$ here are from the training set and $\lambda > 0$ is the regularization coefficient. Implement an algorithm to learn vector representations by minimizing the loss function $L(\{u_i\},\{v_j\})$.
Try $d=1,2,5,10,20,50$ and plot the error metrics on the train and test as a function of $d$.
Note that you may need to tune the hyper-parameter $\lambda$ to optimize the performance.
{\bf Hint:} you may want to employ an alternating minimization scheme. First, randomly initialize $\{u_i\}$ and $\{v_j\}$. Then minimize the loss function with respective to $\{u_i\}$ by treating $\{v_j\}$ as constant vectors, and minimize the loss function with respect to $\{v_j\}$ by treating $\{u_i\}$ as constant vectors. Iterate these two steps until both $\{u_i\}$ and $\{v_j\}$ converge. Note that when one of $\{u_i\}$ or $\{v_j\}$ is given, minimizing the loss function with respect to the other part has closed-form solutions. You should never be allocating an $m \times n$ matrix for this problem.
\end{enumerate}
\end{document}