\documentclass[letterpaper,11pt]{article}
\newcommand{\removefromfile}[1]{}
\newcommand{\Prob}[1]{\mathbb{P} \left(#1 \right)}
\newcommand{\expect}[1]{\mathbb{E}\left[#1\right]}
\newcommand{\Var}[1]{\mathrm{Var}\left(#1\right)}
\newcommand{\Exp}[1]{\expect{#1}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\V}{\mathrm{Var}}
\renewcommand{\Pr}{\mathbb{P}}
\renewcommand{\P}[1]{\Prob{#1}}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{enumitem}
\usepackage{hyperref}
\usepackage[most]{tcolorbox}
\usepackage[margin=1.0in]{geometry}
\begin{document}
\title{{\bf Problem Set 5} }
\author{Name: TODO}
\date{}
\maketitle
\section*{Collaborators}
TODO: List collaborators, if any.
\newpage
\section*{Problem 1 (On Day or Off Day?)}
A student is getting ready to take an important oral examination and is concerned about the possibility of having an ``on" day or an ``off" day. If the student has an on day, each of the examiners will pass him independently with probability 0.8, whereas if he has an off day, this probability will be reduced to 0.4. Suppose that the student will pass the exam if a majority of the examiners pass him. If the student feels that he is twice as likely to have an off day as he is to have an on day, should he request an examination with 3 examiners or with 5 examiners?
\begin{tcolorbox}
TODO: Your solution here.
\end{tcolorbox}
\newpage
\section*{Problem 2 (Geometric and Poisson)}
Let $X$ be Geometric with parameter $p$, let $Y$ be Poisson with parameter $\lambda$ and $Z = \max(X,Y)$. Assume that $X$ and $Y$ are independent. For each of the following problems, your final answers should not have summations. You may want to use the Taylor series expansion of $e^x$, that is, that for any $x$, $$e^x = \sum_{k=0}^\infty \frac{x^k}{k!} $$
\begin{enumerate}[label=\alph*.]
\item (3 points) What is $\P{X > k}$? [Think about it from first principles]
\item (6 points) Compute $\P{X > Y}$. Hint: Use the law of total probability to obtain that $$\P{X >Y} = \sum_{k=0}^\infty \P{X > Y | Y=k}\cdot \P{Y=k}.$$
\item (4 points) Compute $\P{Z \ge X}$.
\item (5 points) Compute $\P{Z \le Y}$.
\end{enumerate}
\begin{tcolorbox}
TODO: Your solution here.
\end{tcolorbox}
\newpage
\section*{Problem 3 (Partnerships)}
In a class of $N$ students, there are $M=\binom{N}{2}$ possible ``partnerships" (where a partnership is just a pair of people that will work on a problem set together).
Answer each of the following questions. Make sure that each of your answers is {\textbf not} in the form of a summation for this problem. {\em In each case include the expectation and variance of $N$ as part of your answer.}\\
Hint: Be careful about exactly when rules involving expectations apply.
\begin{enumerate}[label=\alph*.]
\item What is the expected value of $M$ if $N$ equals some fixed positive integer $c$ with probability 1? (Your answer will be a function of $c$.)
\begin{tcolorbox}
TODO: Your solution here.
\end{tcolorbox}
\item What is the expected value of $M$ if $N$ has a Poisson distribution with parameter $\lambda$? (Your answer will be a function of $\lambda$.)
\begin{tcolorbox}
TODO: Your solution here.
\end{tcolorbox}
\item What is the expected value of $M$ if $N$ has a geometric distribution with parameter $p$? (Your answer will be a function of $p$.)
\begin{tcolorbox}
TODO: Your solution here.
\end{tcolorbox}
\item What is the expected value of $M$ if $N= 10 X + 7$, where $X$ is a Bernoulli random variable with parameter $p$? (Your answer will be a function of $p$.)
\begin{tcolorbox}
TODO: Your solution here.
\end{tcolorbox}
\end{enumerate}
\newpage
\section*{Problem 4 (Roll Away)}
Suppose that a fair 6-sided die is rolled repeatedly, with each roll independent of the others.
Let $Z$ be the number of rolls until (and including) the first time either a 2 or a 3 is rolled,
and let $W$ be the number of 6's rolled until the first 2 or 3 is rolled.
So, for example if the sequence of die values until the first 2 or 3 is 1,5,4,4,5,6,1,6,2,
then $Z$ is 9 and $W$ is 2.
Define $$p(j) := \begin{cases}
\mathbb{P} (W = j \mid Z = i)& j \in \{0, 1, \ldots, i-1\}\\
0 & \text{otherwise}
\end{cases}$$
Show that $p(j)$ is the probability mass function of a binomially distributed random variable and determine its parameters $n$ and $p$.
\begin{tcolorbox}
TODO: Your solution here.
\end{tcolorbox}
\newpage
\section*{Problem 5 (Sample Sampling Algorithm)}
Consider the following algorithm for generating a random sample $S$ of size $n$ from the set of integers $\{1, 2, \ldots, N\}$, where $0 < n < N$.
\begin{quote}
{\bf Sample}$(N, n)$:\\
\hspace*{1em} $S \leftarrow \emptyset$ \hfill\ // $S$ is a set of distinct integers, initially an empty set\\
\hspace*{1em} {\bf while} $|S| < n$ {\bf do}\\
%\hspace*{2em} $I = I + 1$ \hfill // $I$ is counting the total number of rolls of the die\\
\hspace*{2em} $x \leftarrow $ RollDie($N$) \hfill\ // $x$ is the outcome of rolling a fair $N$-sided die\\
\hspace*{2em} $S \leftarrow S \cup \{x\}$\hfill\ // if $x$ is already in $S$ it doesn't change\\
\hspace*{1em} {\bf return} $S$
\end{quote}
% I is the number of rolls it takes to get a sum above n.
Let $I$ be the number of die rolls until $S$ is returned. Also, let $I_i$ be the random variable which describes the number of rolls it takes from the time the set $S$ has
$i - 1$ values to the first time a new value is added after that (i.e., the set $S$ has $i$ values).
\begin{enumerate}[label=\alph*.]
\item What type of random variable from our zoo is $I_i$ and what is/are the relevant parameter(s) for that random variable?
\begin{tcolorbox}
TODO: Your solution here.
\end{tcolorbox}
\item What is $I$ in terms of the random variables $I_i$?
Calculate $\expect{I}$, expressing the result as a summation that depends on both $N$ and $n$.
\begin{tcolorbox}
TODO: Your solution here.
\end{tcolorbox}
\item
What is $\Var{I}$? You can leave your answer in summation form.
\begin{tcolorbox}
TODO: Your solution here.
\end{tcolorbox}
\end{enumerate}
\newpage
\section*{Problem 6 (Bloom Filters)}
Google Chrome has a huge database of malicious URLs, but it takes a long time to do a database lookup (think of this as a typical {\tt Set}). They want to have a quick check in the web browser itself, so a space-efficient data structure must be used. A \textbf{Bloom filter} is a \textbf{probabilistic data structure} which only supports the following two operations:
\begin{itemize}
\item {\tt add(x)}: Add an element $x$ to the structure.
\item {\tt contains(x)}: Check if an element $x$ is in the structure. If either returns ``definitely not in the set'' or ``could be in the set''.
\end{itemize}
It does \textbf{not} support the following two operations:
\begin{itemize}
\item delete an element from the structure.
\item return an element that is in the structure.
\end{itemize}
The idea is that we can check our Bloom filter to see if a URL is in the set. The Bloom filter is always correct in saying a URL definitely isn't in the set, but may have false positives -- it may say that a URL is in the set when it isn't. Only in these rare cases does Chrome have to perform an expensive database lookup to know for sure.
Suppose that we have $k$ \textbf{bit arrays} $t_1,\dots,t_k$ each of length $m$ (all entries are 0 or 1), so the total space required is only $km$ bits or $km/8$ bytes (as a byte is 8 bits). Suppose that the universe of URL's is the set $\mathcal{U}$ (think of this as all strings with less than 100 characters), and we have $k$ \textit{\textbf{independent and uniform}} hash functions $h_1,\dots,h_k:\mathcal{U}\to\{0,1,\dots,m-1\}$. That is, for an element $x$ and hash function $h_i$, pretend $h_i(x)$ is a \textbf{discrete} $\textrm{Unif}[0,m-1]$ random variable. Suppose that we implement the {\tt add} and {\tt contains} function as follows:
\begin{algorithm}
\caption{Bloom Filter Operations}
\begin{algorithmic}[1]
\Function{initialize}{\textsf{k,m}}
\For {$i=1,\dots,k$:}
\State $t_i=\text{new bit array of m 0's}$
\EndFor
\EndFunction
\Function{add}{\textsf{x}}
\For {$i=1,\dots,k$:}
\State $t_i[h_i(x)]=1$
\EndFor
\EndFunction
\Function{contains}{\textsf{x}}
\Return $t_1[h_1(x)] == 1\wedge t_2[h_2(x)]==1 \wedge\dots\wedge t_k[h_k(x)]==1$
\EndFunction
\end{algorithmic}
\end{algorithm}
Refer to Section 9.4 of the textbook and the relevant lecture for more details on Bloom filters.
\begin{enumerate}[label=\alph*.]
\item Implement the functions {\tt add} and {\tt contains} in the {\tt BloomFilter} class of {\tt bloom.py}.
To solve this task, we have set up a corresponding edstem lesson \href{https://edstem.org/us/courses/32027/lessons/51249/slides/286994}{here}. Press the Mark button above the terminal to run the unit tests we have written for you. Passing these unit tests is not enough. We have written a number of different tests for the Gradescope autograder. Your score on Gradescope will be your actual score - you have unlimited attempts to submit.
\begin{tcolorbox}
Nothing to write for this question.
\end{tcolorbox}
\item Let's compare this approach to using a typical {\tt Set} data structure. Google wants to store 1 million URLs, with each URL taking (on average) 23 bytes.
\begin{itemize}
\item How much space (in MB, 1 MB = 1 million bytes) is required if we store all the elements in a set?
\item How much space (in MB) is required if we store all the elements in a Bloom filter with $k=10$ hash functions and $m=800,000$ buckets? Recall that 1 byte = 8 bits.
\end{itemize}
\begin{tcolorbox}
TODO: Your solution here.
\end{tcolorbox}
\item Let's analyze the time improvement as well. Let's say an average Chrome user attempts to visit 36,500 URLs in a year, only 1,000 of which are actually malicious. Suppose it takes half a second for Chrome to make a call to the database (the {\tt Set}), and only 1 millisecond for Chrome to check containment in the Bloom filter. Suppose the false positive rate on the Bloom filter is $4\%$; that is, if a website is not malicious, the Bloom filter will incorrectly report it as malicious with probability $0.04$. What is the time (in seconds) taken if we only use the database, and what is the \textit{expected} time taken (in seconds) to check all 36,500 strings if we used the Bloom filter + database combination described earlier?
\begin{tcolorbox}
TODO: Your solution here.
\end{tcolorbox}
\end{enumerate}
\end{document}