%
% This is the LaTeX template file for lecture notes for CS294-8,
% Computational Biology for Computer Scientists. When preparing
% LaTeX notes for this class, please use this template.
%
% To familiarize yourself with this template, the body contains
% some examples of its use. Look them over. Then you can
% run LaTeX on this file. After you have LaTeXed this file then
% you can look over the result either by printing it out with
% dvips or using xdvi.
%
% This template is based on the template for Prof. Sinclair's CS 270.
\documentclass[twoside]{article}
\usepackage[usenames,dvipsnames,svgnames,table]{xcolor}
\definecolor{darkgreen}{rgb}{0.0,0,0.9}
\usepackage[colorlinks=true,pdfpagemode=UseNone,citecolor=OliveGreen,linkcolor=BrickRed,urlcolor=BrickRed,
%pagebackref,
pdfstartview=FitW]{hyperref}
%\usepackage[backref=true, backend=biber, isbn=false, url=true, firstinits=true, maxnames=20, style=alphabetic,]{biblatex}
%\addbibresource{references.bib}
\usepackage{graphics,amsmath,amsthm,thmtools,amssymb,mathtools,enumerate}
\setlength{\oddsidemargin}{0 in}
\setlength{\evensidemargin}{0 in}
\setlength{\topmargin}{-0.6 in}
\setlength{\textwidth}{6.5 in}
\setlength{\textheight}{8.5 in}
\setlength{\headsep}{0.75 in}
\setlength{\parindent}{0 in}
\setlength{\parskip}{0.1 in}
\def\X{b}
\def\tX{\tilde{b}}
\def\mX{B}
\def\tmX{\tilde{B}}
\def\bone{{\bf 1}}
\def\eps{\epsilon}
\def\R{\mathbb{R}}
\def\mP{\mathbb{P}}
\def\mE{\mathbb{E}}
\def\cT{{\cal T}}
\def\reff{\textup{Reff}}
%
% The following commands set up the lecnum (lecture number)
% counter and make various numbering schemes work relative
% to the lecture number.
%
\newcounter{lecnum}
\renewcommand{\thepage}{\thelecnum-\arabic{page}}
\renewcommand{\thesection}{\thelecnum.\arabic{section}}
\renewcommand{\theequation}{\thelecnum.\arabic{equation}}
\renewcommand{\thefigure}{\thelecnum.\arabic{figure}}
\renewcommand{\thetable}{\thelecnum.\arabic{table}}
\newcommand{\PP}[2]{\mP_{#1}\left[#2\right]}
\newcommand{\weight}[1]{w(#1)}
\renewcommand{\P}[1]{\mP\left[#1\right]}
\newcommand{\E}[1]{\mE\left[#1\right]}
\newcommand{\EE}[2]{\mE_{#1}\left[#2\right]}
\newcommand{\norm}[1]{\|#1\|}
\DeclareMathOperator{\trace}{trace}
\DeclareMathOperator{\rank}{rank}
\DeclareMathOperator{\Var}{Var}
%
% The following macro is used to generate the header.
%
\newcommand{\lecture}[4]{
\pagestyle{myheadings}
\thispagestyle{plain}
\newpage
\setcounter{lecnum}{#1}
\setcounter{page}{1}
\noindent
\begin{center}
\framebox{
\vbox{\vspace{2mm}
\hbox to 6.28in { {\bf CSE 521: Design and Analysis of Algorithms
\hfill Winter 2017} }
\vspace{4mm}
\hbox to 6.28in { {\Large \hfill Midterm \hfill} }
\vspace{2mm}
\hbox to 6.28in { \hfill {\it Deadline: #2} \hfill}
\vspace{2mm}}
}
\end{center}
\markboth{Midterm}{Midterm}
% {\bf Disclaimer}: {\it These notes have not been subjected to the
% usual scrutiny reserved for formal publications. They may be distributed
% outside this class only with the permission of the Instructor.}
% \vspace*{4mm}
}
%
% Convention for citations is authors' initials followed by the year.
% For example, to cite a paper by Leighton and Maggs you would type
% \cite{LM89}, and to cite a paper by Strassen you would type \cite{S69}.
% (To avoid bibliography problems, for now we redefine the \cite command.)
% Also commands that create a suitable format for the reference list.
\renewcommand{\cite}[1]{[#1]}
\def\beginrefs{\begin{list}%
{[\arabic{equation}]}{\usecounter{equation}
\setlength{\leftmargin}{2.0truecm}\setlength{\labelsep}{0.4truecm}%
\setlength{\labelwidth}{1.6truecm}}}
\def\endrefs{\end{list}}
\def\bibentry#1{\item[\hbox{[#1]}]}
%Use this command for a figure; it puts a figure in wherever you want it.
%usage: \fig{NUMBER}{SPACE-IN-INCHES}{CAPTION}
\newcommand{\fig}[3]{
\vspace{#2}
\begin{center}
Figure \thelecnum.#1:~#3
\end{center}
}
% Use these for theorems, lemmas, proofs, etc.
\newtheorem{theorem}{Theorem}[lecnum]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}[theorem]{Definition}
\newenvironment{proofof}[1]{{\em Proof of #1.}}{\hfill%\rule{2mm}{2mm}
\qed}
% **** IF YOU WANT TO DEFINE ADDITIONAL MACROS FOR YOURSELF, PUT THEM HERE:
\begin{document}
%FILL IN THE RIGHT INFO.
%\lecture{**LECTURE-NUMBER**}{**DATE**}{**LECTURER**}{**SCRIBE**}
\lecture{2}{Feb 10th (at 12:00PM) in \href{http://canvas.uw.edu}{Canvas}}
%\footnotetext{These notes are partially based on those of Nigel Mansell.}
In this assignment you are not allowed to collaborate. But, you may contact the instructor or TA for hints. We will study locally sensitive hash functions in more depth. We also have one question related to the Shwartz-Zippel Lemma.
\begin{enumerate}[1)]
%\item Show that all eigenvalues of any symmetric matrix are real.
\item Let $a,b$ be arbitrary real numbers. Fix $w>0$ and let $s\in [0,w)$ chosen uniformly at random. Show that
$$ \P{\left\lfloor \frac{a-s}{w} \right\rfloor = \left\lfloor \frac{b-s}{w} \right\rfloor} = \max\left\{0,1-\frac{|a-b|}{w} \right\}.$$
Recall that for any real number $c$, $\lfloor c\rfloor$ is the largest integer which is at most $c$.
{\bf Hint: }Start with the case where $a=0$.
\item In this problem we design an LSH for points in $\R^d$, with the $\ell_1$ distance, i.e.
$$d(p,q) =\sum_i |p_i - q_i|.$$
Define a class of hash functions as follows: Fix $w \gg r$. Each hash function is defined via a choice of $d$ independently selected random real numbers $s_1,s_2,\dots,s_d$, each uniform in $[0,w)$. The hash function associated with this random set of choices is
$$h(x_1,\dots ,x_d) = \left(\left\lfloor \frac{x_1 - s_1}{w}\right\rfloor ,\left\lfloor \frac{x_2 - s_2}{w}\right\rfloor,\dots,\left\lfloor \frac{x_d - s_d}{w}\right\rfloor\right).$$
Let $\alpha_i = |p_i - q_i|$. What is the probability that $h(p) = h(q)$ in terms of the $\alpha_i$ values? For what values of $p_1$ and $p_2$ is this family of functions $(r,c\cdot r,p_1,p_2)$-sensitive? You can do your calculations assuming that you are in a regime where $1-x$ is well approximated by $e^{-x}$. %
\item In this problem we design an LSH for points in $\R^d$ with the $\ell_2$ distance function,
$$ d(p,q) = \|p-q\|_2 = \sqrt{\sum_i (p_i-q_i)^2}.$$
Let $w\gg r$, and let $s$ be uniformly distributed in $[0,w)$.
Let $g$ be a $d$-dimensional Gaussian vector, i.e., for all $1\leq i\leq d$, $g_i\sim {\cal N}(0,1)$ and all coordinates of $g$ are chosen independently.
Consider the hash function
$$ h(p)=\left\lfloor \frac{\langle g,p\rangle - s}{w} \right\rfloor $$
\begin{enumerate}[a)]
\item Show that for any two points $p,q$, $\langle g,p\rangle-\langle g,q\rangle $ is distributed as a normal random variable. What is the mean and variance of this random variable?
In this part you can use the fact that any linear combination of independent normal random variables is also a normal random variable.
\item Use Problem (1) to estimate the probability that $h(p)=h(q)$. Note that this probability is over the randomness of $g$ and $s$. In this part you can use the fact that for a random variable $X\sim {\cal N}(0,\sigma^2)$, $\E{|X|}=\sigma \sqrt{2/\pi}$. To make calculations simple, assume that $w$ is large enough such that $\P{|\langle g,p\rangle - \langle g,q\rangle| > w} =0$.
\item Use the statement of part (b) to determine for what values of $p_1,p_2$, is this family $(r,c\cdot r, p_1,p_2)$ sensitive?
\end{enumerate}
\item In this problem we design an algorithm to estimate the size of the largest matching of a bipartite graph. Recall that for a matrix $A$, $\rank(A)$ is the number of linearly independent columns of $A$; it is also the same as the number of linearly independent rows of $A$. It turns out that for any matrix $A\in \R^{n\times n}$, $\det(A)\neq 0$ if and only if $\rank(A)=n$. Let $G=(X,Y,E)$ be a given bipartite graph. Using the above terminology, we can rewrite the algorithm that tests whether $G$ has a perfect matching as follows: For each edge $(x_i,y_j)$ of $G$, choose $A_{i,j}$ uniformly and independently from the set $\{0,1,\dots,n^2\}$, and let the rest of entries of $A$ be $0$. Return yes if $\rank(A)=n$ and no otherwise. Since $\det(A)$ is a polynomial of degree $n$, you will use the Shwartz-Zippel Lemma to show this algorithm succeeds with probability $1-1/n$.
\begin{enumerate}[a)]
%\item {\bf extra credit:} Design an $O(n^3)$ time algorithm that computes $\rank(A)$.
\item Let $A$ be the following matrix: For each nonadjacent pair $x_i,y_j$, let $A_{i,j}=0$; choose the rest of the entries of $A$ arbitrarily. Show that if $\rank(A)=k$, then $G$ has a matching of size at least $k$.
\item Now, suppose for any edge $(x_i,y_j)$ we choose $A_{i,j}$ uniformly and independently from $\{0,1,\dots,n^2\}$, and we let the rest of the entries be $0$. Show that with high probability, $\rank(A)$ is equal to the size of the largest matching of $G$.
\end{enumerate}
\item {\bf Extra Credit.} In this problem you are supposed to implement the NNS algorithm for the hamming distance. You are given $n$ points $P\subseteq \{0,1\}^d$ that you are supposed to preprocess and store based on the algorithm that we discussed in class.
Then, you will be given $t$ query points; for each query point you need to find a point at distance no more than twice the closest point.
In the input file \href{lsh.in}{lsh.in} you are given $n,d,t$ in this order. The input is followed by points of $P$, the $i+1$-st row of the input is contains the $i$-th point of $P$.
Then, the input is followed by query points (so the $n+1+i$-th row of the input has the $i$-th query point). In the $i$-th line of the output, write the index of the point $P$ that is closest to the $i$-th query point. Please submit your code together with the output to Canvas.
\end{enumerate}
%\printbibliography
\end{document}