CSE 312 – Section 9 Solutions
Spring 2026
Review of Main Concepts
Law of Total Expectation
- Conditional Expectation: Let \(X\) and \(Y\) be discrete random variables. Then, the conditional expectation of \(X\) given \(Y=y\) is \[\expect {X|Y=y}=\sum _x{x\cdot p_{X|Y}(x|y)}=\sum _x{x\cdot \Pr (X=x|Y=y)}. \] Note that linearity of expectation still applies to conditional expectation: \[\expect {X+Y|A}=\expect {X|A} + \expect {Y|A}.\]
- Discrete Law of Total Expectation (event version): Let \(A_1 \ldots , A_n\) be a partition of the sample space. Then \[\expect {X}=\sum _{i=1}^n \expect {X|A_i} \Pr (A_i).\]
- Discrete Law of Total Expectation (r.v. version): Let \(X\) and \(Y\) be two random variables. Then \[\expect {X}=\sum _{y \in \Omega _Y} \expect {X|Y=y} \cdot \Pr (Y=y).\]
- Continuous Law of Total Expectation: \[\expect {X}=\int _{y \in \Omega _Y}{\expect {X|Y=y}f_Y(y)\dif y}\]
- Expected value of \(X\) conditioned on r.v. \(Y\): Suppose that \(Y\) is a random variable that takes values \(y_1, \ldots , y_k\). Then \(\expect {X|Y}\) is the following random variable \[\expect {X|Y} = \begin {cases} \expect {X|Y=y_1} & \text {with probability }\Pr (Y=y_1)\\ \expect {X|Y=y_2} & \text {with probability }\Pr (Y=y_2)\\ \ldots & \\ \expect {X|Y=y_k} & \text {with probability }\Pr (Y=y_k)\\ \end {cases}\]
- Law of total expectation (rewritten): Given the above definition, we can write \[\expect {X} = \expect {\expect {X | Y}} = \sum _{i=1}^k \expect {X|Y=y_i}\cdot \Pr (Y=y_i).\]
- Covariance: We may not get to this in class, but it’s important to know about. To find out more, check out section 5.4 in the Tsun book. And now the definition: For any two random variables \(X,Y\) the covariance is defined as \[\Covariance {X,Y}=\expect {(X-\expect {X})(Y-\expect {Y})}.\] It can also be shown that \[\Covariance {X,Y}=\expect {XY}- \expect {X}\expect {Y}.\]
-
Conditional distributions: We are not explicitly covering this topic in class, but it is highly recommended that you study it. Much of the above can be more appropriately rewritten in terms of conditional distributions. See Tsun, Section 5.3.
Discrete Continuous Conditional PMF/PDF \(p_{X|Y}(x|y)=\frac {p_{X,Y}(x,y)}{p_Y(y)}\) \(f_{X|Y}(x|y)=\frac {f_{X,Y}(x,y)}{f_Y(y)}\) Conditional Expectation \(\expect {X|Y=y}=\sum _{x}{xp_{X|Y}(x|y)}\) \(\expect {X|Y=y}=\int _{-\infty }^\infty {xf_{X|Y}(x|y)\dif x}\)
Maximum Likelihood Estimation
We assume access to samples \(X_1, X_2, \ldots , X_n\) from a distribution (e.g. Bernoulli, Geometric, Normal, etc) with unknown parameters. Our goal is to estimate this parameters. When the parameter is unknown we denote it by \(\theta \).
- Realization/Sample: A realization/sample \(x\) of a random variable \(X\) is the value that is actually observed.
- Likelihood: Let \(X_1 =x_{1},\ldots X_n=x_{n}\) be iid samples of a random variable with probability mass function \(p_{X}( \text {x\ } ; \theta )\) (if \(X\) discrete) or density \(f_{X}( \text {x\ } ; \theta )\) (if \(X\) continuous), where \(\theta \) is a parameter (or a vector of parameters). We define the likelihood function to be the probability of seeing the data (in the discrete case). If \(X\) is discrete: \[L(X_1 =x_{1},\ldots X_n=x_{n} \,;\, \theta ) = \prod _{i = 1}^{n}{p_{X}( x_{i} \,;\, \theta )}\] If \(X\) is continuous: \[L( X_1 =x_{1},\ldots X_n=x_{n} \,;\, \theta ) = \prod _{i = 1}^{n}{f_{X}( x_{i} \,;\, \theta )}\]
- Log-Likelihood: We define the log-likelihood as the natural logarithm of the likelihood function. Since the logarithm is a strictly increasing function, the value of \(\theta \) that maximizes the likelihood will be exactly the same as the value that maximizes the log-likelihood. If \(X\) is discrete: \[\ln {L( x_{1},\ldots ,x_{n} \,;\, \theta )} = \sum _{i = 1}^{n}{\ln {p_{X}( x_{i} \,;\, \theta )}}\] If \(X\) is continuous: \[\ln {L( x_{1},\ldots ,x_{n} \,;\, \theta )} = \sum _{i = 1}^{n}{\ln {f_{X}( x_{i} \,;\, \theta )}}\]
-
Maximum Likelihood Estimator (MLE): We denote the MLE of \(\theta \) as \({\hat {\theta }}_{\text {MLE}}\) or simply \(\hat {\theta }\), the parameter (or vector of parameters) that maximizes the likelihood function (probability of seeing the data) or equivalently, maximizes the log-likelihood. For a fixed set of realizations/samples \(x_1, \ldots , x_n\) \[\ {\hat {\theta }}_{\text {MLE}} (x_1, \ldots , x_n)= \arg \max _{\theta } L( x_{1},\ldots ,x_{n} \,;\, \theta ) = \arg \max _{\theta } \ln {L( x_{1},\ldots ,x_{n} \,;\, \theta )} \]
Viewing the estimator \(\hat {\theta }_{\text {MLE}}\) as a function of the random variables \(X_1, \ldots , X_n\), the estimator \(\hat {\theta }_{\text {MLE}}(X_1, \ldots , X_n) \) is itself a random variable.
- Bias: The bias of an estimator \(\hat {\theta }\) for a true parameter \(\theta \) is defined as \[\text {Bias}( \hat {\theta },\theta ) = \mathbb {E}[ \hat {\theta }(X_1, \ldots , X_n)] - \theta .\] Here, \(\hat {\theta }\) is treated as a random variable, computed from i.i.d. random variables \(X_1,\dots . X_n\).
- Unbiased Estimator: An estimator \(\hat {\theta }\) of \(\theta \) is unbiased iff \(\text {Bias}( \hat {\theta },\theta ) = 0\), or equivalently \(\mathbb {E}[ \hat {\theta }(X_1, \ldots , X_n)] = \theta \).
-
Steps to find the maximum likelihood estimator \(\hat {\theta }\) :
- a)
- Find the likelihood and log-likelihood of the data.
- b)
- Take the derivative of the log-likelihood
- c)
- Set it to \(0\) to find a candidate for the MLE, \(\hat {\theta }\). (note: at this step, we change from the \(\theta \) to the \(\hat {\theta }\) because in this step we are solving for the maximum likelihood estimator for \(\theta \))
- d)
- Take the second derivative and show that \(\hat {\theta }\) indeed is a maximizer, that \(\frac {\partial ^{2}L}{\partial \theta ^{2}} < 0\) at \(\hat {\theta }\). Also ensure that it is the global maximizer: check points of non-differentiability and boundary values.
Similar steps apply when the distribution has multiple parameters, e.g. a normal distribution with unknown mean and variance. In this case, maximizing the likelihood or log-likelihood involves maximizing a function of multiple variables (the different parameters). See, for example, problem 11 on this worksheet.
Tail bounds ++
-
Union Bound: The union bound is a simple application of inclusion/exclusion and says that for any two events \(A\) and \(B\), it holds that \[\Pr (A\cup B) \le \Pr (A) + \Pr (B).\] The union bound is used to bound probabilities when, e.g. \(\Pr (A \cap B)\) is difficult to compute.
We may or may not get to cover the following tail bounds.
- Markov’s Inequality: Let \(X\) be a non-negative random variable, and \(\alpha >0\). Then, \[\Pr \left ( X \geq \alpha \right ) \leq \frac {\expect {X}}{\alpha }\]
- Chebyshev’s Inequality: Suppose \(Y\) is a random variable with \(\expect {Y} = \mu \) and \(\text {Var}(Y) = \sigma ^{2}\). Then, for any \(\alpha >0\), \[\Pr \left ( \left | Y - \mu \right | \geq \alpha \right ) \leq \frac {\sigma ^{2}}{\alpha ^{2}}\]
-
(Multiplicative) Chernoff Bound: Let \( X_1, X_2, ..., X_n \) be independent Bernoulli random variables. Let \(X = \Sigma _{i=1}^n X_i\), and \( \mu = \expect {X}\). Then, for any \(0 \le \delta \le 1\),
- \(\Pr \left ( X \ge \left ( 1 + \delta \right )\mu \right ) \leq e^{- \ \frac {\delta ^{2}\mu }{3}}\)
- \(\Pr \left ( X \le \left ( 1 - \delta \right )\mu \right ) \leq e^{- \ \frac {\delta ^{2}\mu }{2}}\)
Section plan
- Content review I
- Problem 7
- Problem 8
- Problem 11
- If time, Problem 12
1 Content Review I
- a)
- Law of Total Expectation. Let \(X\) and \(Y\) be discrete random variables.
Which of the following represents the Law of Total Expectation for
\(X\)?
- \(\Exp [X] = \sum _y \Exp [X \mid Y=y] p_X(x)\)
- \(\Exp [X] = \sum _y \Exp [X \mid Y=y] p_Y(y)\)
- \(\Exp [X] = \sum _x \Exp [X \mid Y=y] p_Y(y)\)
- \(\Exp [X] = \sum _y \Exp [X] p_{X|Y}(x|y)\)
Answer: \(\Exp [X] = \sum _y \Exp [X \mid Y=y] p_Y(y)\)
The Law of Total Expectation states that the overall expected value of \(X\) is the weighted average of the conditional expectations of \(X\) given \(Y=y\), weighted by the probability of \(Y=y\) occurring (\(p_Y(y)\)). (Note: Option 1 uses the wrong PMF weight. Option 3 sums over the wrong variable. Option 4 makes no mathematical sense.) - b)
- Linearity of Conditional Expectation. True or False: Linearity of
expectation does not apply to conditional expectations. That is, \(\Exp [X+Y \mid A]\) is
generally NOT equal to \(\Exp [X \mid A] + \Exp [Y \mid A]\).
- True
- False
Answer: False
Linearity of expectation holds unconditionally and conditionally. As long as the expectations exist, \(\Exp [X+Y \mid A] = \Exp [X \mid A] + \Exp [Y \mid A]\) is always true. - c)
- Suppose \(X\) and \(Y\) are random variables and \(A\) is an event. Given that \(\expect {X|A} = 4\) and \(\expect {Y|A} = 10\),
what is \(\expect {2X+\frac {Y}{2} \:\middle |\: A}\)?
- 14
- 18
- 9
- 13
Choice d is the correct answer since the linearity of expectation still applies to conditional expectation: \[\expect {2X+Y/2 |A} = \expect {2X|A} + \expect {Y/2|A} = 2\expect {X|A} + \expect {Y|A}/2 = 2 \cdot 4 + 10/2 = 8+5 = 13.\]
- d)
- True or False: We get a different answer if we maximize log-likelihood
rather than likelihood, but because it is close enough and easier to
compute, we use it for our estimate of \(\theta \).
False: Since the logarithm is a strictly increasing function, the value of \(\theta \) that maximizes the likelihood will be exactly the same as the value that maximizes the log-likelihood.
- e)
- True or False: [Notation question] \(\hat {\theta }\) is the true parameter and \(\theta \) is our
estimate.
False: It is the other way around. Remember to switch to \(\hat {\theta }\) when you set your equation to zero!
- f)
- True or False: An estimator is unbiased if \(\text {Bias}( \hat {\theta },\theta ) = \mathbb {E}[ \hat {\theta }(X_1,\dots ,X_n)] - \theta \) = 0 or equivalently
\(\mathbb {E}[\hat {\theta }(X_1, \ldots , X_n)] = \theta \)
True by definition.
- g)
- You flip a coin 10 times and observe HHHTHHTHHH (8 heads, 2
tails). What is the MLE of \(\theta \), where \(\theta \) is the true probability of seeing
tails?
- \(\hat {\theta } = .2\)
- \(\hat {\theta } = .25\)
- \(\hat {\theta } = .8\)
- \(\hat {\theta } = .3\)
Option 1: \(\hat {\theta } = .2\)
- h)
- True or False: The Union Bound always gives a result in \([0,1]\).
False. Consider \(X\) and \(Y\), which are independent indicator random variables. Suppose \(p_X(x) = \begin {cases} 0.75 & x=0 \\ 0.25 & x=1 \end {cases}\) and \(p_Y(y) = \begin {cases} 0.75 & y=0 \\ 0.25 & y=1 \end {cases}\). Then we may apply the Union Bound to place a bound on \(P(X = 0 \cup Y=0)\): \[P(X = 0 \cup Y=0) \leq P(X=0) + P(Y=0) = 0.75 + 0.75 = 1.5.\] In these cases, the Union Bound tells us very little, since the probability of any event occurring is at most \(1\).
2 Content Review II (material not properly covered in class)
- a)
- Suppose \(C\) and \(D\) are discrete random variables. Then \(\expect {C|D = d} =\)
- \(\sum _d dp_{D|C} (d|c)\)
- \(\sum _c cp_{C|D} (c|d)\)
- \(\int _{-\infty }^{\infty } cf_{C|D} \dif x\)
- \(\frac {\expect {C}}{\expect {D}}\)
Choice b is the correct answer from the definition of conditional expectation for discrete random variables.
- b)
- True or false: Markov’s Inequality always gives a non-negative result.
True. Markov’s Inequality is \[\Pr (X \geq \alpha ) \leq \frac {\expect {X}}{\alpha }\] as long as \(X\) is a non-negative random variable and \(\alpha > 0\). Since \(X\) is a non-negative random variable, \(\expect {X} \geq 0\), so \(\frac {\expect {X} }{\alpha } \geq 0\).
- c)
- True or false: Chebyshev’s Inequality can best be described as giving an
upper bound on the distribution’s right tail.
False. Chebyshev’s Inequality gives an upper bound on the sum of the probabilities of the left and right tails of the distribution.
3 Trapped Miner
A miner is trapped in a mine containing 3 doors.
- \({D}_{1}\): The \(1^{st}\) door leads to a tunnel that will take him to safety after 3 hours.
- \({D}_{2}\): The \(2^{nd}\) door leads to a tunnel that returns him to the mine after 5 hours.
- \({D}_{3}\): The \(3^{rd}\) door leads to a tunnel that returns him to the mine after a number of hours that is Binomial with parameters \((12, \frac {1}{3})\).
At all times, he is equally likely to choose any one of the doors. What is the expected number of hours for this miner to reach safety?
Let T = number of hours for the miner to reach safety. (T is a random
variable)
Let \({D}_{i}\) be the event the \({i}^{th}\) door is chosen. \(i \in \{1, 2, 3\}\). Finally, let \(T_3\) be the time it takes to return
to the mine in the third case only (a random variable). Note that the expectation
of \(T_3\) is \(12 * \frac {1}{3}\) because it is binomially distributed with parameters \(n = 12, p = \frac {1}{3}\). By Law of Total
Expectation, linearity of expectation, and by applying the conditional
expectations given by the problem statement: \[ \begin {aligned} \expect {T} &= \expect {T|{D}_{1}}\Pr ({D}_{1}) + \expect {T|{D}_{2}}\Pr ({D}_{2}) + \expect {T|{D}_{3}}\Pr ({D}_{3}) \\ &= 3 \cdot \frac {1}{3} + (5 + \expect {T}) \cdot \frac {1}{3} + (\expect {T_3 + T}) \cdot \frac {1}{3} \\ &= 3 \cdot \frac {1}{3} + (5 + \expect {T}) \cdot \frac {1}{3} + (\expect {T_3} + \expect {T}) \cdot \frac {1}{3} \\ &= 3 \cdot \frac {1}{3} + (5 + \expect {T}) \cdot \frac {1}{3} + (4 + \expect {T}) \cdot \frac {1}{3} \end {aligned} \] Solving this equation for \(\expect {T}\), we get
\(\expect {T} = 12.\)
Therefore, the expected number of hours for this miner to reach safety is 12.
4 The Compound Server
A web server receives requests according to a Poisson process with a rate of \(\lambda \) requests per minute. Each incoming request is independently classified as a “database query” with probability \(p\), or a “static asset request” with probability \(1-p\). Let \(N\) be the total number of requests in a minute, and \(X\) be the number of database queries in a minute.
- a)
- Derive the marginal distribution of \(X\). Also, write down the marginal distribution of \(Y\), defined as the number of static asset requests in a minute. (You only need to prove your answer for \(X\).)
- b)
- Show that \(X\) and \(Y\) are independent.
- c)
- Given that exactly \(k\) database queries were received in a particular minute, what is the expected total number of requests \(\Exp [N \mid X = k]\) received during that minute?
- a)
- Marginal Distribution: We are given \(N \sim \text {Poisson}(\lambda )\) and \(X\) conditioned on the value of \(N\) is \(\text {Binomial}(N, p)\). Using the Law of Total Probability: \[ \begin {aligned} \Pr (X = x) &= \sum _{n=x}^{\infty } \Pr (X = x \mid N = n) \Pr (N = n) \\ &= \sum _{n=x}^{\infty } \binom {n}{x} p^x (1-p)^{n-x} \frac {e^{-\lambda } \lambda ^n}{n!} \end {aligned}. \] Rearranging terms and factoring out constants relative to the sum: \[ \begin {aligned} \Pr (X = x) &= \frac {p^x e^{-\lambda }}{x!} \sum _{n=x}^{\infty } \frac {(1-p)^{n-x} \lambda ^n}{(n-x)!} \end {aligned}. \] Let \(j = n - x\): \[ \begin {aligned} \Pr (X = x) &= \frac {(\lambda p)^x e^{-\lambda }}{x!} \sum _{j=0}^{\infty } \frac {(\lambda (1-p))^j}{j!} \\ &= \frac {(\lambda p)^x e^{-\lambda }}{x!} e^{\lambda (1-p)} \\ &= \frac {(\lambda p)^x e^{-\lambda p}}{x!} \end {aligned}. \] This proves \(X \sim \text {Poisson}(\lambda p)\), a classic result known as Poisson thinning. Similarly \(Y\sim \text {Poisson}(\lambda (1-p)).\)
- b)
- Independence: To show that \(X\) and \(Y\) are independent, we must show
that their joint PMF is equal to the product of their marginal PMFs:
\(\Pr (X=x, Y=y) = \Pr (X=x)\Pr (Y=y)\) for all integers \(x, y \ge 0\).
We can find the joint PMF by relating \(X\) and \(Y\) back to \(N\). Note that the event \(\{X=x \cap Y=y\}\) is equivalent to the event \(\{X=x \cap N=x+y\}\). \[ \begin {aligned} \Pr (X=x, Y=y) &= \Pr (X=x, N=x+y) \\ &= \Pr (X=x \mid N=x+y) \Pr (N=x+y) \\ &= \binom {x+y}{x} p^x (1-p)^y \frac {e^{-\lambda } \lambda ^{x+y}}{(x+y)!} \end {aligned} \]
Expanding the binomial coefficient, substituting \(\lambda ^{x+y} = \lambda ^x \lambda ^y\), and splitting the \(e^{-\lambda }\) term into \(e^{-\lambda p} e^{-\lambda (1-p)}\): \[ \begin {aligned} \Pr (X=x, Y=y) &= \frac {(x+y)!}{x!y!} p^x (1-p)^y \frac {e^{-\lambda p} e^{-\lambda (1-p)} \lambda ^x \lambda ^y}{(x+y)!} \\ &= \left ( \frac {(\lambda p)^x e^{-\lambda p}}{x!} \right ) \left ( \frac {(\lambda (1-p))^y e^{-\lambda (1-p)}}{y!} \right ) \\ &= \Pr (X=x) \Pr (Y=y) \end {aligned} \]
Since the joint PMF factors perfectly into the product of the marginal PMFs for all valid \(x\) and \(y\), \(X\) and \(Y\) are independent.
- c)
- Conditional Expectation: Let \(Y = N - X\) be the number of static asset requests. By the same thinning logic, \(Y \sim \text {Poisson}(\lambda (1-p))\), and crucially, \(X\) and \(Y\) are independent. Because \(N = X + Y\): \[ \begin {aligned} \Exp [N \mid X=k] &= \Exp [X + Y \mid X=k] \\ &= \Exp [k + Y \mid X=k] \end {aligned}. \] Due to the independence of the thinned processes, \(\Exp [Y \mid X=k] = \Exp [Y] = \lambda (1-p)\). \[ \begin {aligned} \Exp [N \mid X=k] &= k + \lambda (1-p) \end {aligned}. \]
5 Lemonade Stand
Suppose I run a lemonade stand, which costs me $100 a day to operate. I sell a drink of lemonade for $20. Every person who walks by my stand either buys a drink or doesn’t (no one buys more than one). If it is raining, \(n_1\) people walk by my stand, and each buys a drink independently with probability \(p_1\). If it isn’t raining, \(n_2\) people walk by my stand, and each buys a drink independently with probability \(p_2\). It rains each day with probability \(p_3\), independently of every other day. Let \(X\) be my profit over the next week. In terms of \(n_1, n_2, p_1, p_2\) and \(p_3\), what is \(\expect {X}\)?
Let \(R\) be the event it rains. Let \(X_i\) be how many drinks I sell on day \(i\) for \(i=1,...,7\). We are interested in \(X=\sum _{i=1}^7{(20X_i-100)}\). We have \(X_i|R\sim \textsf {Binomial}(n_1,p_1)\), so \(\expect {X_i|R}=n_1p_1\). Similarly, \(X_i|R^C\sim \textsf {Binomial}(n_2,p_2)\), so \(\expect {X_i|R^C}=n_2p_2\). By the law of total expectation, \[\mu =\expect {X_i}=\expect {X_i|R}\Pr (R)+\expect {X_i|R^C}\Pr (R^C)=n_1p_1p_3+n_2p_2(1-p_3)\] Hence, by linearity of expectation, \[\expect {X}=\expect {\sum _{i=1}^7{(20X_i-100)}}=20\sum _{i=1}^7{\expect {X_i}}-700=140\mu -700\] \[= 140\cdot (n_1p_1p_3+n_2p_2(1-p_3))-700.\]
6 The Dice and Coin Cascade
Alice repeatedly rolls a fair six-sided die until she rolls a 6. Let \(N\) be the total number of rolls she makes (including the final roll of 6). Once Alice is finished, Bob flips a fair coin exactly \(N\) times. Let \(H\) be the total number of Heads Bob flips.
- a)
- Compute the exact probability that Bob flips zero Heads, \(\Pr (H = 0)\).
- b)
- Compute the expected value \(\Exp [H]\) and variance \(\Var (H)\).
- a)
- Probability of Zero Heads: We have \(N \sim \text {Geometric}(1/6)\) and \(H\) conditioned on \(N\) is \(\text {Binomial}(N, 1/2)\). \[ \begin {aligned} \Pr (H = 0) &= \sum _{n=1}^{\infty } \Pr (H = 0 \mid N = n) \Pr (N = n) \\ &= \sum _{n=1}^{\infty } \left (\frac {1}{2}\right )^n \left (\frac {5}{6}\right )^{n-1} \left (\frac {1}{6}\right ) \end {aligned}. \] Factoring out constants to create a clean geometric series: \[ \begin {aligned} \Pr (H = 0) &= \frac {1/6}{5/6} \sum _{n=1}^{\infty } \left (\frac {5}{12}\right )^n \\ &= \frac {1}{5} \left ( \frac {5/12}{1 - 5/12} \right ) \\ &= \frac {1}{5} \left ( \frac {5/12}{7/12} \right ) = \frac {1}{7} \end {aligned}. \]
- b)
- Expectation and Variance: Using the Law of Total Expectation: \[ \begin {aligned} \Exp [H] &= \Exp [\Exp [H \mid N]] \\ &= \sum _{n=1}^\infty \Exp [ H |N=n]\Pr (N = n)\\ &= \sum _{n=1}^\infty \frac {n}{2} \Pr (N = n)\\ &= \frac {1}{2} \Exp [N] = \frac {1}{2}(6) = 3 \end {aligned}. \] To get the variance, we can use the Law of Total Variance (which we have not covered in class): \[ \begin {aligned} \Var (H) &= \Exp [\Var (H \mid N)] + \Var (\Exp [H \mid N]) \\ &= \Exp \left [N\left (\frac {1}{2}\right )\left (\frac {1}{2}\right )\right ] + \Var \left (\frac {N}{2}\right ) \\ &= \frac {1}{4} \Exp [N] + \frac {1}{4} \Var (N) \end {aligned}. \] For \(N \sim \text {Geometric}(p)\), \(\Var (N) = \frac {1-p}{p^2} = \frac {5/6}{1/36} = 30\). Therefore, \[ \begin {aligned} \Var (H) &= \frac {1}{4}(6) + \frac {1}{4}(30) = \frac {36}{4} = 9 \end {aligned}. \]
7 Nested Uniforms
Let \(X \sim \textsf {Unif}(0, 1)\). We draw a random variable \(Y\) such that, conditioned on \(X=x\), \(Y \sim \textsf {Unif}(0, x)\). What is the expected value of \(Y\), \(\expect {Y}\)?
We can compute this using the Continuous Law of Total Expectation: \[ \expect {Y} = \int _{-\infty }^{\infty } \expect {Y \mid X=x} f_X(x) \dif x \] Because \(X \sim \textsf {Unif}(0, 1)\), its PDF is \(f_X(x) = 1\) on the interval \([0, 1]\) and \(0\) elsewhere. Therefore, we can restrict our bounds of integration to 0 and 1.
We are given that \(Y \mid X=x \sim \textsf {Unif}(0, x)\). The expected value of a continuous uniform random variable on \((a, b)\) is \(\frac {a+b}{2}\), so the conditional expectation is: \[ \expect {Y \mid X=x} = \frac {0 + x}{2} = \frac {x}{2} \]
Substituting this back into the Law of Total Expectation integral: \[ \begin {aligned} \expect {Y} &= \int _{0}^{1} \frac {x}{2} \cdot 1 \dif x \\ &= \frac {x^2}{4} \bigg |_0^1 \\ &= \frac {1}{4} - 0 = \frac {1}{4} \end {aligned} \]
8 A Red Poisson
Suppose that \(x_{1},\ldots ,x_{n}\) are i.i.d. samples from a Poisson(\(\theta \)) random variable, where \(\theta \) is unknown. Find the MLE for \(\theta \).
Because each Poisson RV is i.i.d., the likelihood of seeing that data is just the PMF of the Poisson distribution multiplied together for every \(x_i\). From there, take the log-likelihood, then the first derivative, set it equal to 0 and solve for \(\hat {\theta }\). \[ \begin {aligned} L( x_{1},\ldots ,x_{n} \,;\, \theta ) & = & \prod _{i = 1}^{n} e^{-\theta }\frac {\theta ^{x_i}}{x_i!}\\ \ln {L( x_{1},\ldots ,x_{n} \,;\, \theta )} & = & \sum _{i = 1}^{n}{\left [-\theta - \ln (x_i!) + x_i \ln (\theta )\right ]}\\ \frac {\partial }{\partial \theta }\ln {L( x_1,\ldots ,x_n \,;\, \theta )} & = & \sum _{i = 1}^n{\left [-1 + \frac {x_i}{\theta }\right ]}\\ -n + \frac {\Sigma _{i = 1}^n x_i}{\hat {\theta }} & = & 0 \\ \hat {\theta } &= &\frac {\Sigma _{i = 1}^n x_i}{n}\;. \end {aligned} \] Notice that the \(-\ln (x_i!)\) term disappears since it is a constant relative to \(\theta \), of which we take the derivative.
9 Independent Shreds, You Say?
You are given 100 independent samples \(X_1=x_1, X_2=x_2, \ldots , X_{100}=x_{100}\) from Bernoulli\((\theta )\), where \(\theta \) is unknown. (Each sample is either a 0 or a 1). These 100 samples sum to 30. We saw in class that the maximum likelihood estimate for \(\theta \) is \[\hat {\theta }(x_1, \ldots , x_{100}) = \frac {1}{100}\sum _{i=1}^{100}x_i = \frac {30}{100}.\] Is \(\hat {\theta }(X_1, \ldots , X_{100}) = \frac {\sum _{i=1}^{100} X_i}{100}\) an unbiased estimator of \(\theta \)?
An estimator is unbiased if the expectation of the estimator (as a function of the random variables \(X_1, \ldots , X_{100}\) is equal to the original parameter, i.e., \(E[\hat {\theta }(X_1, \ldots , X_{100})] = \theta \). Setting up the expectation of our estimator and plugging it in for the generic case, we get the following: \[ \begin {aligned} \mathbb {E}[\hat {\theta }(X_1, \ldots , X_{100})] &= \expect {\frac {1}{n}\sum _{i=1}^{100} X_i}\\ &= \frac {1}{100}\sum _{i=1}^{100} \expect {X_i}\\ &= \frac {1}{100} \cdot 100\theta = \theta . \end {aligned} \] So it is unbiased.
10 Laplace MLE
Suppose \(x_{1},\ldots ,x_{2n}\) are iid realizations from the Laplace density (double exponential density): for \(x \in \mathbb {R}\),
\[f_{X}\left ( x \mid \theta \right ) = \frac {1}{2}e^{- |x - \theta |}\]
Find the MLE for \(\theta \). For this problem, you need not verify that the MLE is indeed a maximizer. You may find the sign function useful:
\[\operatorname {sgn}{(x)} = \left \{ \begin {matrix} + 1,\ \ & x \geq 0 \\ - 1,\ \ & x < 0 \\ \end {matrix} \right .\ \]
We begin by setting up the likelihood as we do in any case. Since these are i.i.d. realizations, we can multiply all their PDFs together. From there, take the log-likelihood, then the first derivative, where we notice that the derivative of \(-\ln 2 - |x_i - \theta |\) is just the sign function of \(x_i - \theta \). Then, set that equal to 0 and solve for \(\hat {\theta }\). \[ \begin {aligned} L\left ( x_{1},\ldots ,x_{2n}\ \right |\ \theta ) & = & \prod _{i = 1}^{2n}{\frac {1}{2}e^{- |x_{i} - \theta |}}\\ \ln {L\left ( x_{1},\ldots ,x_{2n}\ \right |\ \theta )} & = & \sum _{i = 1}^{2n}{\lbrack - \ln 2 - |x_{i} - \theta |\rbrack }\\ \frac {\partial }{\partial \theta }\ln {L\left ( x_{1},\ldots ,x_{2n}\ \right |\ \theta )} & = & \sum _{i = 1}^{2n}{\operatorname {sgn}{(x_{i} - \theta )}} = 0\\ \hat {\theta } & = & \text {any\ }\text {value}\ \text {in}\ \lbrack x_{n}^{'},x_{n + 1}^{'}\rbrack \end {aligned} \] where \(x_{i}^{'}\) is the \(i^{\text {th}}\) order statistic: the \(i^{\text {th}}\) smallest observation (see 5.10 in the textbook for more details). If you wanted to argue that this is a global maximizer, note that the log likelihood is the sum of concave functions, so every critical point is a global maximizer.
11 Bird Watching
You are an ornithologist studying a rare species of birds in a nature reserve. Over a period of \(50\) days, you record the number of sightings of this bird (you see \(x_1, x_2, ..., x_{50}\) birds on each day). Your research has shown that the number of sightings of this species depends on the average number of monkeys in the reserve, \(\theta _1\), and the average number of eagles in the reserve, \(\theta _2\). After years of studying this rare species in other environments, you’ve found that the number of birds observed on a particular day follows the following distribution: \[p_X(k)=\frac {1}{k!}(\theta _1^k \cdot e^{-\theta _1}\cdot \theta _2^k \cdot e^{-3\theta _2})\] Find the MLE for \(\theta _1\) and \(\theta _2\) (i.e., find \(\hat {\theta _1}\) and \(\hat {\theta _2}\)).
- a)
- What is the likelihood function?
Once again, the likelihood of seeing the above samples is just their PDFs multiplied together. \[L(x;\theta _1, \theta _2) =\prod _{i=1}^{50}\left (\frac {1}{x_i!}(\theta _1^{x_i} e^{-\theta _1}\cdot \theta _2^{x_i} e^{-3\theta _2})\right )\]
- b)
- What is the log-likelihood function?
We take the log of the above and simplify: \begin {align} \ln (L(x;\theta _1, \theta _2)) &=\ln (\prod _{i=1}^{50}(\frac {1}{x_i!}(\theta _1^{x_i} \cdot e^{-\theta _1}\cdot \theta _2^{x_i} \cdot e^{-3\theta _2})))\\ &= \sum _{i=1}^{50}(\ln (\frac {1}{x_i!}) + \ln (\theta _1^{x_i}) + \ln (e^{-\theta _1}) + \ln (\theta _2^{x_i}) + \ln (e^{-3\theta _2}))\\ &=\sum _{i=1}^{50}(\ln (\frac {1}{x_i!})+ x_i\ln (\theta _1) -\theta _1 + x_i \ln (\theta _2) + -3\theta _2) \end {align}
- c)
- We want to find values of \(\theta _1\) and \(\theta _2\) that maximize the likelihood function. To
do this, we will take the partial derivative with respect to each of these
parameters and solve for the values that make them both zero. First, take
the partial derivative of the likelihood function with respect to
\(\theta _1\).
When taking the partial derivative with respect to a certain variable, we take the derivative as usual, but treat other variables as constants! So here, we treat \(\theta _2\) as a constant and differentiate with respect to \(\theta _1\). \begin {align} \frac {\partial }{\partial \theta _1}(\ln (L(x;\theta _1, \theta _2))) &= \sum _{i=1}^{50}(\frac {x_i}{\theta _1} -1) \\ &= \sum _{i=1}^{50}(\frac {x_i}{\theta _1}) - 50 \end {align}
- d)
- Now, take the partial derivative with respect to \(\theta _2\).
Now, we treat \(\theta _1\) as a constant and take the derivative with respect to \(\theta _2\). \begin {align} \frac {\partial }{\partial \theta _2}(\ln (L(x;\theta _1, \theta _2))) &=\sum _{i=1}^{50} ( \frac {x_i}{\theta _2} -3)\\ &=\sum _{i=1}^{50} ( \frac {x_i}{\theta _2} ) -150 \end {align}
- e)
- Set both these partial derivatives to 0, and solve for \(\hat {\theta _1}\) and \(\hat {\theta _2}\).
We end up with the equations (notice we added the hats to the thetas at this point - since we set these derivatives to 0, we are now solving for the maximum likelihood estimator): \[\sum _{i=1}^{50}(\frac {x_i}{\hat {\theta _1}}) - 50 = 0\] \[\sum _{i=1}^{50} ( \frac {x_i}{\hat {\theta _2}} ) -150 = 0\] We now solve this system of equations for \(\hat {\theta _1}\) and \(\hat {\theta _2}\). The first equation gives us: \[\sum _{i=1}^{50}(\frac {x_i}{\hat {\theta _1}}) = 50\] \[(\frac {\sum _{i=1}^{50}x_i}{\hat {\theta _1}}) = 50\] \[ \hat {\theta _1}=(\frac {\sum _{i=1}^{50}x_i}{50}) \] With similar steps, we get that: \[ \hat {\theta _2}=(\frac {\sum _{i=1}^{50}x_i}{150}) \]
Remark: Similar to the single-variable case, we treat the points where both partial derivatives are 0 (also known as the critical point) directly as the maximum of the function. This is just a convenient simplification for 312. For general functions, we need to do the second derivative test to confirm whether a critical point is indeed a local maximum/minimum. For multi-variable functions, the second derivative test is much more complicated - it is not enough to take the partial second derivatives, you would have to check the Hessian matrix.
12 A biased estimator
In class, we showed that the maximum likelihood estimate of the variance \(\theta _2\) of a normal distribution (when both the true mean \(\mu \) and true variance \(\sigma ^2\) are unknown) is what’s called the population variance. That is \[\hat \theta _2 = \left (\frac {1}{n} \sum _{i=1}^n (x_i - \hat \theta _1)^2 \right )\] where \(\hat \theta _1 = \frac {1}{n}\sum _{i=1}^n x_i\) is the MLE of the mean. Is \(\hat \theta _2\) unbiased?
Let \(\overline X = \frac {1}{n}\sum _{i=1}^n X_i\). Then the estimator \(\hat {\theta _2}\) as a random variable can be written as \[ \hat {\theta _2}\left (X_1,\dots , X_n\right ) = E\left (\frac {1}{n} \sum _{i=1}^n (X_i - \overline X)^2 \right ) \] Therefore, \[ \begin {aligned} E(\hat \theta _2) = E\left (\frac {1}{n} \sum _{i=1}^n (X_i - \overline X)^2 \right ) & = E\left ( \frac {1}{n} \sum _{i=1}^n (X_i^2 - 2 X_i \overline X + \overline X^2 )\right ) \\ &= \frac {1}{n} \sum _{i=1}^n E(X_i^2) - E\left (\frac {2}{n} \overline X \sum _{i=1}^n X_i\right ) + E(\overline X^2 )\\ & = \frac {1}{n} \sum _{i=1}^n E(X_i^2) - 2 E(\overline X^2) + E(\overline X^2 )\\ & = \frac {1}{n} \sum _{i=1}^n E(X_i^2) - E(\overline X^2). \quad \quad (**) \end {aligned} \] The second equality follows from the first by linearity of expectation (and distributing the sum). We know that for any random variable \(Y\), since \(Var(Y) = E(Y^2) - (E(Y))^2\) it holds that \[ E(Y^2) = Var(Y) + (E(Y))^2.\] Also, we have \(E(X_i) = \mu , ~Var(X_i) = \sigma ^2\) \(\forall i\) and \(E(\overline X) = \mu , ~Var(\overline X) = \frac {\sigma ^2}{n}\). Combining these facts, we get \[E(X_i^2) = \sigma ^2 + \mu ^2~~\forall i\quad \text { and } \quad E(\overline X^2) = \frac {\sigma ^2}{n} + \mu ^2.\] Substituting these equations into (**) we get \[ \begin {aligned} E\left (\frac {1}{n} \sum _{i=1}^n (X_i - \overline X)^2 \right ) & = \frac {1}{n} \sum _{i=1}^n E(X_i^2) - E(\overline X^2) = \sigma ^2 + \mu ^2 - \left (\frac {\sigma ^2}{n} + \mu ^2\right ) \\ & = \left (1-\frac {1}{n}\right ) \sigma ^2. \end {aligned} \] Thus \(\hat \theta _2\) is not unbiased.
13 It Means Nothing
Suppose \(x_1, x_2, \ldots , x_n\) are samples from a normal distribution whose mean is known to be \(\mu \) but the variance is unknown. How does the maximum likelihood estimator for the variance differ from the maximum likelihood estimator when both mean and variance are unknown? Which, if any, is unbiased?
Begin with the same derivation as before, however we now use \(\mu \) instead of the mean of 0, which gets us: \[ \begin {aligned} L(x_1, \ldots , x_{n} \,;\, \sigma ^2) &=& \prod _{i=1}^n \frac {1}{\sqrt {2 \pi \sigma ^2}}\exp {\frac {-(x_i-\mu )^2}{2\sigma ^2}} \\ \ln {L(x_1, \ldots , x_{n} \,;\, \sigma ^2)} &=& \sum _{i = 1}^{n} -\ln {\sqrt {2 \pi \sigma ^2}} - \frac {(x_i - \mu )^2}{2\sigma ^2} \\ &=& \sum _{i = 1}^{n} -\frac {1}{2}\ln {2 \pi \sigma ^2} - \frac {(x_i - \mu )^2}{2\sigma ^2} \\ &=& \sum _{i = 1}^{n} -\frac {1}{2}\ln {2 \pi } - \frac {1}{2}\ln {\sigma ^2} - \frac {(x_i - \mu )^2}{2\sigma ^2} \\ &=& -\frac {n}{2}\ln {2 \pi } - \frac {n}{2}\ln {\sigma ^2} - \frac {\Sigma _{i = 1}^{n} (x_i - \mu )^2}{2 \sigma ^2} \\ \frac {\partial }{\partial \sigma ^2} \ln {L(x_1, \ldots , x_{n} \,;\, \sigma ^2)} & = & - \frac {n}{2 \sigma ^2} + \frac {\Sigma _{i = 1}^{n} (x_i - \mu )^2}{2 \sigma ^4} \\ - \frac {n}{2 \hat {\sigma }^2} + \frac {\Sigma _{i = 1}^{n} (x_i - \mu )^2}{2 \hat {\sigma }^4} & = & 0\\ \frac {\Sigma _{i = 1}^{n} (x_i - \mu )^2}{2 \hat {\sigma }^4} & = & \frac {n}{2\hat {\sigma }^2} \\ \hat {\sigma }^2 & = & \frac {1}{n}\sum _{i = 1}^{n} (x_i - \mu )^2 \end {aligned} \] Then, we do the same but with two parameters, \(\theta _1, \theta _2\), the former being the mean, and the latter being the variance. We can take the derivative with respect to \(\theta _2\), and do effectively the same as before. \begin {align*} L(x_1, \ldots , x_{n} \,;\, \theta _1, \theta _2) &=& \prod _{i=1}^n \frac {1}{\sqrt {2 \pi \theta _2}}\exp {\frac {-(x_i-\theta _1)^2}{2\theta _2}} \\ \ln {L(x_1, \ldots , x_{n} \,;\, \theta _1, \theta _2)} &=& \sum _{i = 1}^{n} -\ln {\sqrt {2 \pi \theta _2}} - \frac {(x_i - \theta _1)^2}{2\theta _2} \\ &=& \sum _{i = 1}^{n} -\frac {1}{2}\ln {2 \pi \theta _2} - \frac {(x_i - \theta _1)^2}{2\theta _2} \\ &=& \sum _{i = 1}^{n} -\frac {1}{2}\ln {2 \pi } - \frac {1}{2}\ln {\theta _2} - \frac {(x_i - \theta _1)^2}{2\theta _2} \\ &=& -\frac {n}{2}\ln {2 \pi } - \frac {n}{2}\ln {\theta _2} - \frac {\Sigma _{i = 1}^{n} (x_i - \theta _1)^2}{2 \theta _2} \\ \frac {\partial }{\partial \theta _2} \ln {L(x_1, \ldots , x_{n} \,;\, \theta _1, \theta _2)} & = & - \frac {n}{2 \theta _2} + \frac {\Sigma _{i = 1}^{n} (x_i - \theta _1)^2}{2 \theta _2^2}\\ - \frac {n}{2 \hat {\theta _2}} + \frac {\Sigma _{i = 1}^{n} (x_i - \hat {\theta _1})^2}{2 \hat {\theta _2}^2} &=&0\\ \frac {\Sigma _{i = 1}^{n} (x_i - \hat {\theta _1})^2}{2 \hat {\theta _2}^2} & = & \frac {n}{2\hat {\theta _2}} \\ \hat {\theta _2} & = & \frac {1}{n}\sum _{i = 1}^{n} (x_i - \hat {\theta _1})^2 \end {align*}
Now, we just need to find if both estimators are biased or unbiased. We do this by seeing if their expected value is equal to the original parameter or not. Let’s start with the former. We move the expectation in with linearity of expectation, and then can identify that the remaining expectation is just the definition of variance (expected deviation from the mean squared) and see that it is unbiased. \[E[\hat {\sigma }^2] = \expect {\frac {1}{n}\sum _{i = 1}^{n} (X_i - \mu )^2} = \frac {1}{n}\sum _{i = 1}^{n} \expect {(X_i - \mu )^2} = \frac {1}{n}\sum _{i = 1}^{n} \sigma ^2 = \frac {1}{n} n \sigma ^2 = \sigma ^2\] For the second estimator, following the same computation in Task 12, we obtain \[E[\hat {\theta _2}] = \left (1-\frac 1n\right ) \sigma ^2\] suggesting that \(\hat {\theta _2}\) is not unbiased.
14 Maximum Likelihood Estimators
- a)
- Let \(x_{1},\ldots ,x_{n}\) i.i.d. samples from a random variable that follows a Rademacher
distribution with unknown parameter \(p \in [0,1]\), i.e., a distribution from the
family \[ \Prob {x; p} = \left \{ \begin {array}{ll} p & \textrm {if $x = +1$}\\ 1-p & \textrm {if $x = -1$} \end {array} \right . \;, \] What is the maximum likelihood estimator for \(p\)? Write this as
a function of the \(x_i\)’s and \(n\).
We can use the solution from class for Bernoulli random variables (which we used for the case of a magic coin with unknown bias). We just denote as \(n^+\) the number of \(+1\)’s. We then now that the MLE \(\hat {p}\) of \(p\) is \(\hat {p} = n^+/n\). Note that \[ \sum _{i=1}^n x_i = n^+ - (n - n^+) = 2n^+ - n \;, \] or equivalently, \[ n^+ = \frac {n + \sum _{i=1}^n x_i}{2} \;. \] And thus, \[ \hat {p} = \frac {n + \sum _{i=1}^n x_i}{2n} = \frac {1}{2} + \frac {\sum _{i=1}^n x_i}{2n} \;. \]
- b)
- Let \(x_1,\ldots ,x_n\) be i.i.d samples that follow a \(\mathsf {Two}(\theta )\) distribution with unknown parameter \(\theta \in [0,1]\),
where the probabilities from the family are given by \[\Prob {x;\theta }=\begin {cases}(1-\theta )^2&x=0\\ 2\theta (1-\theta )&x=1\\ \theta ^2&x=2 \end {cases}\] Suppose that in the
sample there are \(n_0\) 0’s, \(n_1\) 1’s, and \(n_2\) 2’s. What is the maximum likelihood
estimator for \(\theta \) in terms of \(n, n_0,n_1,n_2\)?
The likelihood is \[ \begin {aligned} \mathcal {L}(x_1, x_2, \ldots , x_n \,; \theta ) &= \left (1-\theta \right )^{2 \cdot n_0} \cdot \left (2\theta \left (1-\theta \right )\right )^{n_1} \cdot \theta ^{2n_2} \\ &= \left (1 - \theta \right )^{2n_0} \cdot \left (1 - \theta \right )^{n_1} \cdot 2^{n_1} \cdot \theta ^{n_1} \cdot \theta ^{2n_2} \\ &= (1-\theta )^{n_1+2n_0} \cdot 2^{n_1}\cdot \theta ^{2n_2+n_1} \;. \end {aligned} \]
Therefore the log-likelihood is \[ \begin {aligned} \ln \mathcal {L}(x_1, x_2, \ldots , x_n \,; \theta ) &= \ln ( (1-\theta )^{n_1+2n_0}) + \ln (2^{n_1})+ \ln (\theta ^{2n_2+n_1})\\ &= (n_1+2n_0)\ln (1-\theta ) + n_1 \ln 2+ (2n_2+n_1) \ln \theta \;. \end {aligned} \]
We take the derivative of the log-likelihood with respect to the parameter \(\theta \): \[ \frac {\partial }{\partial \theta }\ln \mathcal {L}(x_1, x_2, \ldots , x_n \,; \theta ) = \frac {(2n_2+n_1)}{\theta } - \frac {(n_1+2n_0)}{1 - \theta } \;. \]
Now, we set the derivative to \(0\) and solve (here we replace \(\theta \) with \(\hat \theta \)): \[ \begin {aligned} \frac {(2n_2+n_1)}{\hat \theta } - \frac {(n_1+2n_0)}{1 - \hat \theta } &= 0 \\ \frac {(2n_2+n_1)}{\hat \theta } &= \frac {(n_1+2n_0)}{1 - \hat \theta } \\ (2n_2+n_1) (1-{\hat \theta }) &= (n_1+2n_0)\hat \theta \\ (2n_2+n_1) &= 2(n_0+n_1+n_2)\hat \theta \\ \end {aligned} \]
Therefore \[ \boxed {\hat \theta = \frac {n_1 + 2n_2}{2(n_0+n_1+n_2)} = \frac {n_1 + 2n_2}{2n}} \;. \]
- c)
- Let \(x_{1},\ldots ,x_{n}\) be i.i.d. samples from a random variable that follows a so-called Borel
distribution with unknown parameter \(\theta \), i.e., a distribution from the family \[ \Prob {k; \theta } = \frac {e^{-\theta k} (\theta k)^{k-1}}{k!} \;, \]
where \(0 < \theta \leq 1\) is a real number, and \(k \geq 1\) is an integer. What is the maximum
likelihood estimator for \(\theta \)?
The likelihood is \[ \mathcal {L}(x_1, x_2, \ldots , x_n \,; \theta ) = \prod _{i=1}^n \frac {e^{-\theta x_i} (\theta x_i)^{x_i-1}}{x_i!} \;. \]
Therefore the log-likelihood is \[ \ln \mathcal {L}(x_1, x_2, \ldots , x_n \,; \theta ) = \sum _{i=1}^n \left [-\theta x_i +(x_i - 1) (\ln (\theta ) + \ln (x_i)) - \ln (x_i!) \right ] \]
We take the derivative of the log-likelihood with respect to the parameter \(\theta \): \[ \frac {\partial }{\partial \theta }\ln \mathcal {L}(x_1, x_2, \ldots , x_n \,; \theta ) = \sum _{i=1}^n \left (-x_i + \frac {x_i - 1}{\theta }\right ) \;. \]
Now, we set the derivative to \(0\) and solve (here we replace \(\theta \) with \(\hat {\theta }\): \[ \begin {aligned} \sum _{i=1}^n \left (-x_i + \frac {x_i - 1}{\theta }\right ) &= 0 \\ \frac {1}{\hat \theta }\sum _{i=1}^{n}(x_i - 1) &= \sum _{i=1}^{n}x_i \\ \end {aligned} \]
Therefore \[ \boxed {\hat \theta = \frac {\sum _{i=1}^n (x_i - 1)}{\sum _{i=1}^n x_i} = 1 - \frac {n}{\sum _{i=1}^n x_i}} \;. \]
Check that it’s a global maximum (not required): \[\frac {\partial ^2}{\partial \theta ^2}\ln \mathcal {L}(x_1, x_2, \ldots , x_n \,; \theta ) = - \sum _{i=1}^n \frac {x_i-1}{\theta ^2} < 0\] so \(\ln \mathcal {L}(x_1, x_2, \ldots , x_n \,; \theta )\) is concave downward everywhere.
- d)
- If the samples from the Borel distribution are 5, 7, 10, 2, 7, 5, 12, 13, 11,
what is the maximum likelihood estimator for \(\theta \)? Give an exact answer as a
simplified fraction.
\[\hat {\theta } = 1 - \frac {9}{5+ 7+ 10+ 2+ 7+ 5+ 12+ 13+ 11} = 1 - \frac {1}{8} = \boxed {\frac {7}{8}} \;.\]
The remaining problems cover material we may not get to this quarter.
15 Tail bounds
Suppose \(X\sim \textsf {Binomial}(6, 0.4)\). We will bound \(\Pr (X\ge 4)\) using the tail bounds we’ve learned, and compare this to the true result.
- a)
- Give an upper bound for this probability using Markov’s inequality.
Why can we use Markov’s inequality?
We know that the expected value of a binomial distribution is \(np\), so: \(\Pr (X\ge 4)\le \frac {\expect {X}}{4}=\frac {2.4}{4}=0.6\). We can use it since \(X\) is nonnegative.
- b)
- Give an upper bound for this probability using Chebyshev’s inequality. You
may have to rearrange algebraically and it may result in a weaker
bound.
\(\Pr (X\ge 4)=\Pr (X-2.4 \ge 1.6)\le \Pr (|X-2.4|\ge 1.6)\) we can add those absolute value signs because that only adds more possible values, so it is an upper bound on the probability of \(X - 2.4 \ge 1.6\). Then, using Chebyshev’s inequality we get:
\(\Pr (|X-2.4|\ge 1.6)\le \frac {Var(X)}{1.6^2}=\frac {1.44}{1.6^2} = 0.5625\) - c)
- Give an upper bound for this probability using the Chernoff bound.
First, we solve for the values of \(\delta \) that will allow us to use the Chernoff bound. We want \((1+\delta )\expect {X}=(1+\delta )2.4=4\). Solving for \(\delta \) here gives use \(\delta =\frac {2}{3}\). Now, we can directly plug into the Chernoff bound. \(\Pr (X\ge 4)=\Pr (X\ge (1+\frac {2}{3})2.4)\le e^{-(\frac {2}{3})^2 \expect {X}/3}=e^{-4\times 2.4 / 27}\approx 0.7\)
- d)
- Give the exact probability.
Since \(X\) is a binomial, we know it has a range from 0 to \(n\) (or in this case 0 to 6). Thus, the possible values to satisfy \(X \ge 4\) are 4, 5, or 6. We plug in the PMF for each to get: \(\Pr (X\ge 4)=\Pr (X=4)+\Pr (X=5)+\Pr (X=6)=\binom {6}{4}(0.4)^4(0.6)^2+\binom {6}{5}(0.4)^5(0.6)+\binom {6}{6}0.4^6 \approx 0.1792 \)
16 Exponential Tail Bounds
Let \(X \sim \mbox {Exp}(\lambda )\) and \(k > 1/\lambda \).
- a)
- Use Markov’s inequality to bound \(\Pr (X \geq k)\).
We can use Markov’s inequality here because \(X\) is non-negative since it is an exponential distribution. We also know that \(\expect {X}= 1/ \lambda \) because \(X\sim \text {Exp}(\lambda )\). By Markov’s inequality, we get that: \[\Pr (X \geq k) \leq \frac {1}{\lambda k}\]
- b)
- Use Markov’s inequality to bound \(\Pr (X < k)\).
From Markov’s inequality (and our answer in (a)), we know that \(\Pr (X\geq k)\leq \frac {1}{\lambda k}\). Then, \[ \begin {aligned} \Pr (X\geq k) & \leq \frac {1}{\lambda k}\\ -\Pr (X\geq k) & \geq -\frac {1}{\lambda k} && \text {multiplying be a negative flips the inequality} \\ 1-\Pr (X\geq k) & \geq 1-\frac {1}{\lambda k}\\ \Pr (X< k) & \geq 1-\frac {1}{\lambda k} && \text {by definition of complement} \end {aligned} \] Note that because we took the complement and the sign flipped, we have now found a lower bound for \(\Pr (X <k)\).
- c)
- Use Chebyshev’s inequality to bound \(\Pr (X \geq k)\).
We rearrange algebraically to get into the form to apply Chebyshev’s inequality. We then plug in the corresponding values and \(Var(X)=\frac {1}{\lambda ^2}\). \[\Pr (X \geq k) = \Pr \left (X-\frac {1}{\lambda } \geq k-\frac {1}{\lambda }\right ) \leq \Pr \left (\left |X-\frac {1}{\lambda }\right | \geq k-\frac {1}{\lambda }\right ) \leq \frac {1}{\lambda ^2(k-1/\lambda )^2} = \frac {1}{(\lambda k - 1)^2}\]
- d)
- What is the exact formula for \(\Pr (X \geq k)\)?
Using the CDF for an exponential distribution and the definition of complement: \[\Pr (X \geq k) =1-\Pr (X\leq k)= 1-(1-e^{-\lambda k}) = e^{-\lambda k}\]
- e)
- For \(\lambda k \geq 3\), how do the bounds given in parts (a), (c), and (d) compare?
\[e^{-\lambda k} < \frac {1}{(\lambda k - 1)^2} < \frac {1}{\lambda k}\] so Markov’s inequality gives the worst bound.
17 Robbie’s Late!
Suppose the probability Robbie is late to teaching lecture on a given day is at most 0.01. Do not make any independence assumptions.
- a)
- Use a Union Bound to bound the probability that Robbie is late at
least once over a 30-lecture quarter.
Let \(R_i\) be the event Robbie is late to lecture on day \(i\) for \(i = 1,...,30\). Then, by the union bound, \[ \begin {aligned} \Pr (\text {late at least once}) &= \Pr (\bigcup _{i=1}^{30}R_i)\\ &\leq \sum _{i=1}^{30} \Pr (R_i) && [\text {union bound}]\\ &\leq \sum _{i=1}^{30} 0.01 && [\Pr (R_i) \leq 0.01]\\ &= 0.30 \end {aligned} \]
- b)
- Use a Union Bound to bound the probability that Robbie is never late
over a 30-lecture quarter.
As in the previous part, let \(R_i\) be the event Robbie is late to lecture on day \(i\) for \(i = 1,...,30\). Then, by the union bound, we found that \[ \begin {aligned} \Pr (\text {late at least once}) &\leq 0.30 \end {aligned} \] The probability Robbie is never late is the complement of the probability he is late at least once over the 30 lectures. Taking the complement and doing algebra: \[ \begin {aligned} \Pr (\text {late at least once}) &\leq 0.30 \\ -\Pr (\text {late at least once}) &\geq -0.30 && [\text {multiplying by negative flips the inequality}] \\ 1 -\Pr (\text {late at least once}) &\geq 1-0.30 \\ \Pr (\text {never late}) &\geq 0.70 \end {aligned} \]
Note that we have now found a lower bound for this probability using the union bound because of taking the complement.
- c)
- Use a Union Bound to bound the probability that Robbie is late at least
once over a 120-lecture quarter.
Let \(R_i\) be the event Robbie is late to lecture on day \(i\) for \(i = 1,...,120\). Then, by the union bound, \[ \begin {aligned} \Pr (\text {late at least once}) &= \Pr (\bigcup _{i=1}^{120}R_i)\\ &\leq \sum _{i=1}^{120} \Pr (R_i) && [\text {union bound}]\\ &\leq \sum _{i=1}^{120} 0.01 && [\Pr (R_i) \leq 0.01]\\ &= 1.20 \end {aligned} \] Notice that \(\Pr (\text {late at least once}) \leq 1.20\) is not a very helpful bound since probabilities have to be at most 1 already.