%
% This is the LaTeX template file for lecture notes for CS294-8,
% Computational Biology for Computer Scientists. When preparing
% LaTeX notes for this class, please use this template.
%
% To familiarize yourself with this template, the body contains
% some examples of its use. Look them over. Then you can
% run LaTeX on this file. After you have LaTeXed this file then
% you can look over the result either by printing it out with
% dvips or using xdvi.
%
% This template is based on the template for Prof. Sinclair's CS 270.
\documentclass[twoside]{article}
\usepackage[usenames,dvipsnames,svgnames,table]{xcolor}
\definecolor{darkgreen}{rgb}{0.0,0,0.9}
\usepackage[colorlinks=true,pdfpagemode=UseNone,citecolor=OliveGreen,linkcolor=BrickRed,urlcolor=BrickRed,
%pagebackref,
pdfstartview=FitW]{hyperref}
%\usepackage[backref=true, backend=biber, isbn=false, url=true, firstinits=true, maxnames=20, style=alphabetic,]{biblatex}
%\addbibresource{references.bib}
\usepackage{graphics,amsmath,amsthm,thmtools,amssymb,mathtools,enumerate}
\setlength{\oddsidemargin}{0 in}
\setlength{\evensidemargin}{0 in}
\setlength{\topmargin}{-0.6 in}
\setlength{\textwidth}{6.5 in}
\setlength{\textheight}{8.5 in}
\setlength{\headsep}{0.75 in}
\setlength{\parindent}{0 in}
\setlength{\parskip}{0.1 in}
\def\X{b}
\def\tX{\tilde{b}}
\def\mX{B}
\def\tmX{\tilde{B}}
\def\bone{{\bf 1}}
\def\eps{\epsilon}
\def\R{\mathbb{R}}
\def\mP{\mathbb{P}}
\def\mE{\mathbb{E}}
\def\cT{{\cal T}}
\def\reff{\textup{Reff}}
%
% The following commands set up the lecnum (lecture number)
% counter and make various numbering schemes work relative
% to the lecture number.
%
\newcounter{lecnum}
\renewcommand{\thepage}{\thelecnum-\arabic{page}}
\renewcommand{\thesection}{\thelecnum.\arabic{section}}
\renewcommand{\theequation}{\thelecnum.\arabic{equation}}
\renewcommand{\thefigure}{\thelecnum.\arabic{figure}}
\renewcommand{\thetable}{\thelecnum.\arabic{table}}
\newcommand{\PP}[2]{\mP_{#1}\left[#2\right]}
\newcommand{\weight}[1]{w(#1)}
\renewcommand{\P}[1]{\mP\left[#1\right]}
\newcommand{\E}[1]{\mE\left[#1\right]}
\newcommand{\EE}[2]{\mE_{#1}\left[#2\right]}
\newcommand{\norm}[1]{\|#1\|}
\DeclareMathOperator{\trace}{trace}
\DeclareMathOperator{\vol}{vol}
\DeclareMathOperator{\Var}{Var}
%
% The following macro is used to generate the header.
%
\newcommand{\lecture}[4]{
\pagestyle{myheadings}
\thispagestyle{plain}
\newpage
\setcounter{lecnum}{#1}
\setcounter{page}{1}
\noindent
\begin{center}
\framebox{
\vbox{\vspace{2mm}
\hbox to 6.28in { {\bf CSE 521: Design and Analysis of Algorithms
\hfill Winter 2017} }
\vspace{4mm}
\hbox to 6.28in { {\Large \hfill Problem Set #1 \hfill} }
\vspace{2mm}
\hbox to 6.28in { \hfill {\it Deadline: #2} \hfill}
\vspace{2mm}}
}
\end{center}
\markboth{Problem Set #1: #2}{Problem Set #1: #2}
% {\bf Disclaimer}: {\it These notes have not been subjected to the
% usual scrutiny reserved for formal publications. They may be distributed
% outside this class only with the permission of the Instructor.}
% \vspace*{4mm}
}
%
% Convention for citations is authors' initials followed by the year.
% For example, to cite a paper by Leighton and Maggs you would type
% \cite{LM89}, and to cite a paper by Strassen you would type \cite{S69}.
% (To avoid bibliography problems, for now we redefine the \cite command.)
% Also commands that create a suitable format for the reference list.
\renewcommand{\cite}[1]{[#1]}
\def\beginrefs{\begin{list}%
{[\arabic{equation}]}{\usecounter{equation}
\setlength{\leftmargin}{2.0truecm}\setlength{\labelsep}{0.4truecm}%
\setlength{\labelwidth}{1.6truecm}}}
\def\endrefs{\end{list}}
\def\bibentry#1{\item[\hbox{[#1]}]}
%Use this command for a figure; it puts a figure in wherever you want it.
%usage: \fig{NUMBER}{SPACE-IN-INCHES}{CAPTION}
\newcommand{\fig}[3]{
\vspace{#2}
\begin{center}
Figure \thelecnum.#1:~#3
\end{center}
}
% Use these for theorems, lemmas, proofs, etc.
\newtheorem{theorem}{Theorem}[lecnum]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}[theorem]{Definition}
\newenvironment{proofof}[1]{{\em Proof of #1.}}{\hfill%\rule{2mm}{2mm}
\qed}
% **** IF YOU WANT TO DEFINE ADDITIONAL MACROS FOR YOURSELF, PUT THEM HERE:
\begin{document}
%FILL IN THE RIGHT INFO.
%\lecture{**LECTURE-NUMBER**}{**DATE**}{**LECTURER**}{**SCRIBE**}
\lecture{2}{Jan 30th (at 12:00PM) in \href{http://canvas.uw.edu}{Canvas}}
%\footnotetext{These notes are partially based on those of Nigel Mansell.}
The goal of this problem set is to learn the idea of {\em minhash}. Minhash is a hash function which is commonly used in practice to estimate the {\em Jaccard similarity} of two sets.
\begin{enumerate}[1)]
%\item Show that all eigenvalues of any symmetric matrix are real.
\item Suppose we have a universe $U$ of elements. For$A,B\subseteq U$, the Jaccard distance of $A,B$ is defined as
$$ J(A,B)=\frac{|A\cap B|}{|A\cup B|}.$$
This definition is used practice to calculate a notion of similarity of documents, webpages, etc.
For example, suppose $U$ is the set of English words, and any set $A$ represents a document considered as a bag of words. Note that for any two $A,B\subseteq U$, $0\leq J(A,B)\leq 1$. If $J(A,B)$ is close to 1, then we can say $A\approx B$.
Let $h: U\to [0,1]$ where for each $i\in U$, $h(i)$ is chosen uniformly and independently at random.
For a set $S\subseteq U$, let $h_S:=\min_{i\in S} h(i)$. Show that
$$ \P{h_A=h_B} = J(A,B).$$
Now, if we have sets $A_1, A_2,\dots,A_n$, we can use the above idea to figure out which pair of sets are ``close'' in time essentially $O(n|U|)$. We can also obtain a $1\pm\eps$ approximation of $J(A,B)$ with high probability by using $O(\log(n)/\eps^2)$ independently chosen hash functions. Note that the naive algorithm would take $O(n^2|U|)$ to calculate all pairwise similarities.
\item Let $X_1,\dots,X_n$ be independent random variables uniformly distributed in $[0,1]$ and let $Y=\min\{X_1,\dots,X_n\}$. Show that $\E{Y}=\frac{1}{n+1}$ and $\Var(Y)\leq \frac{1}{(n+1)^2}$.
\item Consider the following algorithm for estimating $F_0$, the number of unique elements in a sequence $x_1,\dots,x_m$ in the set $\{0,1,\dots,n-1\}$. Let $h: \{0,1,\dots,n-1\}\to [0,1]$ s.t., $h(i)$ is chosen uniformly and independently at random in $[0,1]$ for each $i$. We start with $Y=1$. After reading each element $x_i$ in the sequence we let $Y=\min\{Y,h(x_i)\}$.
\begin{enumerate}[a)]
\item Show that by the end of the stream $\frac{1}{\E{Y}}-1$ is equal to $F_0$.
\item Use the above idea to design a streaming algorithm to estimate the number of distinct elements in the sequence with multiplicative error $1\pm \eps$. For the analysis you can assume that you have access to $k$ independent hash functions as described above. Show that $k\leq O(1/\eps^2)$ many such hash functions is enough to estimate the number of distinct elements within $1+\eps$ factor with probability at least $9/10$.
\end{enumerate}
\item We are given a directed graph $G=(V,E)$ with $n$ vertices and $m$ edges. For each vertex $v\in V$, let $N_v$ denote the set of all nodes reachable from $v$ (including $v$ itself) and let $n_v=|N_v|$. Our goal is to find $n_v$ for each vertex $v\in V$. The best exact algorithm for this problem runs in time $O(\min\{m\cdot n, n^{2.373}\})$. Here, we will explore a randomized approximation algorithms that runs in time $O(m\log n)$. The algorithm works as follows:
\begin{enumerate}[Step 1)]
\item For each vertex $v\in V$ choose a uniformly and independently random number $s_v\in[0,1]$. Call this number the {\em minhash} of $v$.
\item For each vertex $v\in V$, let $h_v:=\min_{w\in N_v} s_w$. The $h_v$ values for all vertices $v\in V$ can be found in time $O(m \log n)$!
\item Let $\tilde{n}_v=\frac{1}{h_v}$.
\end{enumerate}
Repeat all of the above steps $k$ times and for each $v$ output as your estimate for $n_v$, the median of $\tilde{n}_v$ values from the $k$ runs.
\begin{enumerate}[a)]
\item Describe an implementation of step 2 that runs in $O(m \log n)$ time.
\item Implement the above algorithm for $k=20$ to approximate $n_v$ for each vertex $v$ in the vertex set of a directed graph. Implement also an exact algorithm for computing these $n$ numbers. Run your algorithm on \href{http://courses.cs.washington.edu/courses/cse521/17wi/oregon2\_010331.txt}{oregon2\_010331.txt} file in the course website. The first line of the input indicates the number of nodes and directed edges and each following line indicates the source and destination of a directed edge in order.
Use the following format for your output: In each line write 3 numbers separated by space corresponding to a node $v$, (i) the id of $v$, (ii) $n_v$ (iii) $\tilde{n}_v$ returned by your implementation of the randomized algorithm. Sort all lines increasingly according to the node ids.
Please write down your output in the Canvas textbox for Problem 3.
\item Give the best bound you can on the error of the estimate of $n_v$'s output by the approximate algorithm as a function of $k$. You can use Chebycheff bounds or Chernoff bounds.
\end{enumerate}
\end{enumerate}
%\printbibliography
\end{document}