Acknowledgment: This document has been adapted from
a similar review session for CS224W at Standard with substantial
modifications by Yikun Zhang at Winter 2023 for CSE547 / STAT548 at UW.
Some good references about writing proofs and basic probability theory
include
In this section, we will review several mathematical techniques for
writing rigorous proofs. They are useful in the field of machine
learning when we want to formally state and verify some theoretical
properties of our proposed algorithms.
Terminologies
Conjecture: a statement that we
think might be true and can be proven (but hasn’t been proven
yet).
Theorem: a statement that has been
rigorously proved.
Proof: a valid argument for showing
that a theorem is true.
Premise: a condition for the
theorem.
Lemma: a small theorem (or
preliminary result) that we need to prove the main theorems or
results.
Proposition: an intuitive theorem
with a short proof.
Corollary: a small theorem that
follows from the main theorems or results.
Universal and Existence
Statements
Universal statement: To prove
a universal statement, like “the square of any odd number is odd”, it is
always easy to show that this is true for some specific cases – for
example, , which is an odd
number, and , which is
another odd number. However, to rigorously prove the statement, we must
show that it works for all odd numbers, which is difficult as
we cannot enumerate all of them.
On the contrary, if we want to disprove a universal
statement, we only need to find one counterexample. For instance, if we
want to disprove the statement “the square of any odd number is even”,
it suffices to provide a specific example of an odd number whose square
is not even. (For example, ,
which is not an even number.)
As a summary, it leads to a rule of thumb for proving or
disproving
To prove a universal statement, we must show that it
works for all cases.
To disprove a universal statement, it suffices to find
one counterexample.
Existence statement: The above rule of
thumb is reversed when it comes to the “existence” statements. For
example, if the statement to be proved is that “there exists at least
one odd number whose square is odd, then proving the statement just
requires finding one specific case, e.g., , while disproving the statement
would require showing that none of the odd numbers have squares that are
odd.
Using the statement that “the square of any odd number is odd” as an
example, we showcase how to prove a universal statement. To this end, we
can pick an arbitrary odd number and try to prove the statement for that
number. In the proof, we cannot assume anything about other than that it is an odd number.
(In other words, we cannot simply set to be a specific number, like , because then our proof might rely on
special properties of the number
that do not generalize to all odd numbers).
Example 1. Prove that the square of any odd number
is odd.
Proof. Let be an
arbitrary odd number. By definition, an odd number is an integer that
can be written in the form ,
for some integer . This means that
we can write , where
is some integer. Thus, . Since is an integer,
is also an integer, so we
can write , where
is an integer.
Therefore, is odd.
Since this logic works for any odd number , we have shown that the square of any
odd number is odd. ◻
Special Proof Techniques
In addition to the technique of “picking an arbitrary element”, here
are several other techniques commonly seen in proofs.
Proof by contrapositive: Consider the
statement that
(a) “If it is raining today, then I do not go to class.”
This is logically equivalent to the statement that
(b) “If I go to class, then it is not raining today.”
Therefore, if we want to prove statement (a), then it suffices to
prove statement (b). Statement (b) is called the
contrapositive of statement (a).
It is worth mentioning that statement (a) is not
logically equivalent to the statement:
(c) “If I do not go to class, then it is raining today.”
which is a converse of statement (a). This non-equivalence is also
known as the fallacy of the converse.
Example 2. Let be an integer. Prove that is an odd number if and only if is an odd number.
Proof. The “if and only if” in this statement requires us to
prove both directions of the implication. First, we must prove that if
is an odd number, then is an odd number. Then we should
prove that if is an odd number,
then is an odd number.
As we already proved the first statement in Example 1, we only need to prove the second
statement. The second statement is logically equivalent to its
contrapositive, so it suffices to prove that “if is an even number, then is even.”
Suppose is an even number.
Then, we can write for some
integer . It implies that . Since is an integer, is also an integer, so we can write
for the integer . By definition, this means
that is an even number. ◻
Proof by contradiction: When proving a
statement by contradiction, we would assume that our statement is not
true and then derive a contradiction. This is a special case of proving
by contrapositive (where our “if” is all of mathematics, and our “then”
is the statement to be proved).
Example 3. Prove that is irrational.
Proof. Suppose that was rational. By definition,
this means that can be
written as for some integers
and . Since , it follows that , which in turn shows that
. Now any square number
must have an even number of
prime factors, since any prime factor found in the first must also appear in the second . Therefore, must have an even number of prime
factors. However, since must
also have an even number of prime factors, and is a prime number, must have an odd number of prime
factors. This is a contradiction, since we claimed that , and no number can
simultaneously have an even number of prime factors and an odd number of
prime factors. Therefore, our initial assumption was wrong, and must be irrational. ◻
Proof by cases: Sometimes, it might be
difficult to prove the entire theorem at once. As a result, we consider
splitting the proof into several cases and proving the theorem
separately for each case.
Example 4. Let be an integer. Show that if is not divisible by , then for some integer .
Proof. If is not
divisible by , then either or for some integer .
Case 1: Suppose . Then . Since is an integer, it follows that
we can write for .
Case 2: Suppose . Then . Hence, we
can write for .
Since Case 1 and Case 2 reflect all possible possibilities, the proof
is completed. ◻
Proof by induction
We can prove a statement by induction when showing that the statement
is valid for all positive integers . Note that this is not the only
situation in which we can use induction, and that induction is (usually)
not the only way to prove a statement for all positive integers.
To use induction, we need to establish two results:
Base case: The statement is true when .
Inductive step: If the statement is true for
, then the statement is also
true for .
It allows for an infinite chain of implications:
The statement is true for
If the statement is true for , then it is also true for
If the statement is true for , then it is also true for
If the statement is true for , then it is also true for
…
Together, these implications prove the statement for all positive
integer values of . (It does not
prove the statement for non-integer values of , or values of less than .)
Example 5. Prove that for all integers .
Proof. We proceed by induction.
Base case: If , then the statement becomes , which is true.
Inductive step: Suppose that the statement is true
for . This means . We want
to show the statement is true for , i.e.,
By the induction hypothesis (i.e., because the statement is
true for ), we have . This equals , which is equal to . This proves the inductive step.
Therefore, the statement is true for all integers . ◻
Strong induction
Strong induction (or complete induction) is a useful variant of
induction. Here, the inductive step is changed to
Base case: The statement is true when .
Inductive step: If the statement is true for all
values of , then the
statement is also true for .
This also produces an infinite chain of implications:
The statement is true for
If the statement is true for , then it is true for
If the statement is true for both and , then it is
true for
If the statement is true for , , and , then it is true for
…
Strong induction works on the same principle as weak induction, but
is generally easier to prove theorems under its stronger induction
hypothesis.
Example 6. Prove that every integer greater than or equal to can be factored into prime numbers.
Proof. We proceed by (strong) induction.
Base case: If , then is a prime
number, and its factorization is itself.
Inductive step: Suppose that is some integer larger than , and assume that the statement is true
for all numbers . Then,
there are two cases:
Case 1: is prime.
Then, its prime factorization is
itself.
Case 2: is composite.
This means that it can be decomposed into a product , where and are both greater than and less than . Since and are both less than , both and can be factored into prime numbers (by
the inductive hypothesis). That is, and , where
and are prime
numbers.
Thus, can be written as ,
which is a factorization into prime numbers. It also completed the proof
of the statement. ◻
Useful Results in Calculus
The definition of the exponential function states that In particular, it indicates that and .
Gamma Function and Stirling’s Formula:
For , the Gamma
function is . While the exact value of is intractable for some , one can approximate
when is large by Stirling’s
formula: This implies
that when is a sufficiently
large integer, we can approximate by .
More precisely, the following bound for holds for all rather than only asymptotically:
Basic Probability Theory
Parts of this section are adapted from Chapter 1 of 1 and
lecture notes of STAT 512 by Professor Yen-Chi Chen’s 2 and
Professor Michael D. Perlman 3.
Probability Space
Sample space: The sample space is the collection of all
possible outcomes of a random experiment. For example, if we roll a die
with 6 faces, then the sample space will be .
Events: A subset of the sample space,
, is called an
event. For example, the event “the number from the above die is
less than 4” can be represented by the subset . The event “we roll a 6 from
the above die” can be represented by the subset .
-algebra: While the
collection of all possible events (i.e., all subsets of ) is sometimes too large to define
a valid probability space, we introduce the concept of -algebra as a collections of subsets
of satisfying:
(Nonemptiness) and , where is the empty set.
(Closure under complementation) If , then .
(Closure under countable unions) If , then .
The subsets of in are said to be
measurable, and is called
measurable space.
Probability space: A probability
measure (or probability function)
is a mapping from -algebra
to real numbers in
satisfying the following
three axioms:
for all .
If are mutually exclusive events, then .
The triplet is called a
probability space.
Example 7. For a fair dice with 6 faces, we can
define the probability function as: All events in the
probability space can be represented as unions of these six disjoint
events. For instance, Note that we can add probabilities here because
the events , , and are disjoint.
Properties of Probability
Measure
Theorem 1 (Principle of inclusion-exclusion).
Let be a
probability space. Given two subsets that are not necessarily disjoint, we have that
Proof. We can derive this theorem from the probability
axioms. Note that can be
split into three disjoint events: , , and . Furthermore, can be
split into and , and can be split into and . Thus, The result follows. ◻
Example 8. Suppose is chosen uniformly at random from the
integer set .
(This means that the probability of getting each integer is .) Find the probability that is divisible by or .
Solution. By the principle of inclusion-exclusion (Theorem
1), There are
numbers divisible by , numbers divisible by , and numbers divisible by (i.e., divisible by both
and ). Therefore, the probability is
Theorem 2 (Union bound or Boole’s inequality).
Let be a
probability space. For any collection of events , we have that
Proof. We can prove this result by induction (for finite
).
Base case: When , the statement becomes , which is true.
Inductive step: Suppose that the statement is true
for . We must prove that the
statement is true for .
Note that and by the
principle of inclusion-exclusion (Theorem 1), By the
induction hypothesis, the first term is less than or equal to . Hence, The proof is completed. ◻
Example 9. Suppose that the chance of winning a Mega
Million is in every time a person buys a lottery
ticket. If Tim buys one ticket every day of the year, how likely will he
win the Mega Million at least once?
Answer. The union bound will not tell us the exact
probability for Tim winning the Mega Million. However, it gives us an
upper bound of this probability as .
Other useful properties of probability
measure: Let be a probability
space.
If , then . More generally, .
For any
and some mutually exclusive with ,
Monotone continuity: For a sequence of subsets with for all , we have
that . Similarly, if
for all , we have that .
Conditional
Probability, Independence, and Bayes’ Rule
Example 10. Suppose that you are administering the
GRE, and you discover that 2.5% of students get a perfect score on the
math section.4 By itself, this is not a very useful
statistic, because the scores on the math section vary substantially by
major. You dive a little deeper and find that 7.5% of physical sciences
students get a perfect score, 6.3% of engineering students get a perfect
score, and most other majors do substantially worse.
In the language of conditional probability, we would say that the
probability of getting a perfect score conditional on engineering majors
is 6.3%, i.e., If we want to
compute this probability, we would take the number of engineering majors
that receive a perfect score, and divide it by the total number of
engineering majors. This is equivalent to computing the formula: In general, we can
replace “perfect score” and “engineering major” with any two events and
obtain the formal definition of conditional probability.
Conditional Probability: For two events
and with , the conditional probability of
given is defined as: Notice
that when the event is fixed,
is another probability
measure.
Example 11. Suppose that we toss a fair coin three
times. What is the probability that all three tosses come up heads,
given that the first toss came up heads?
Answer. This probability is
Independence and conditional
independence: Two events and are independent if In other words, the occurrence of
event does not affect the
probability that event
happens.
For three events , we say
that and are conditionally independent
given if There are no implications between independence and conditional
independence!
Bayes’ rule: Given an event and some mutually exclusive events
with , the Bayes’
rule states that
Example 12. Suppose that 5% of students enrolled in
CSE547 at UW will get 4.0, and a student with 4.0 in CSE547 has 80%
chance of getting recruited by Google. A student without getting 4.0 in
CSE547 still has 40% chance of getting recruited by Google. What is the
probability of a student getting 4.0 in CSE547, given that he/she has
been recruited by Google?
Answer. By Bayes’ Rule,
Random variables
Random variable: Let be a probability
space. A random variable is a (measurable)
function satisfying The probability that takes on a value in a Borel set5 is written as:
Example 13. Suppose that we are tossing three fair
coins. Let be the number of coins
that come up heads. Then, .
Cumulative distribution function (CDF):
The CDF of a
random variable is a right
continuous and nondecreasing function with left limits satisfying In particular, and
.
Probability mass function (PMF) and probability density
function (PDF):
If the range of a random variable is countable, it is called a
discrete random variable, whose distribution can be
characterized by the PMF as:
If the range of a random variable has an absolutely continuous CDF , then we can describe its distribution
through the PDF as: In this case, , and for a
single number .
Expectation and Variance
Expectation: The expected
value (or mean) of a random variable with range can be
interpreted as a weighted average.
For a discrete random variable, .
For a continuous random variable with PDF ,
Example 14. Suppose that Tim’s happiness scores
when it is sunny outside and
when it is raining outside. It is
sunny 60% of the time at Seattle and raining 40%. What is the expected
value of Tim’s happiness at Seattle?
Answer..
Linearity of expectation: If are two random variables and is a constant in , then This is true even if and are not independent.
Variance and covariance: The
variance of a random variable measures how far away it is, on
average, from the mean. It is defined as The covariance between random variables is defined as: For a random variable and a constant , we have and
. We do not have unless
and are uncorrelated (which means
they have covariance ). In
particular, independent random variables are always uncorrelated,
although the reverse doesn’t hold in general.
Pearson’s correlation coefficient: For
two random variables and , their (Pearson’s) correlation
coefficient is defined as: where by the Cauchy-Schwarz
inequality; see Section 3.7. It measures the linear
relation between two random variables.
Known Probability
Distributions
Discrete random variables
Bernoulli: If is a Bernoulli random variable denoted
by ,
then A Bernoulli random variable with parameter
can be interpreted as a coin flip
that comes up heads with probability and tails with probability . We know that
Binomial: If is a binomial random variable denoted
by ,
then A binomial random
variable with parameters and
models the number of successes in
trials, each of which has a
successful probability . When
, it reduces to a Bernoulli
random variable. We know that
Geometric: If is a geometric random variable denoted
by ,
then A geometric random variable with parameter
models the number of trials until
the first success occurs, where each trial has a successful probability
. We know that
Poisson: If is a Poisson random variable denoted by
,
then
A Poisson random variable often appears in counting processes. For
instance, the number of laser photons hitting a detector in a particular
time interval can be modeled as a Poisson random variable. We know that
Indicator random variable: For an event
, an indicator random variable
takes value 1 when occurs and 0
otherwise, i.e.,
The expectation of an indicator random variable is just the
probability of the event occurring, i.e., and its variance is .
Multinomial: Suppose that is a categorical random variable with
range and for . We generate independently and
identically distributed data with the above distribution
and take Then, the random
vector
follows a multinomial distribution denoted by with , whose PMF is given by
Here, takes
integer values within a simplex .
Continuous random variables
Uniform: If is a uniform random variable over the
interval denote by , then its PDF
is given by We know that
Normal: If is a normal random variable with
parameters and denoted by , then its PDF is
given by where . We know that
Exponential: If is an exponential random variable with
parameter denoted by , then its PDF
is given by We know that A double
exponential random variable
satisfies that , so its PDF is given by In particular, and .
Sometimes, is also called a
Laplace random variable6.
Cauchy: If is a Cauchy random variable with
parameters denoted by
, then its PDF is given by
where . Note
that both the expectation and variance of a Cauchy distribution do
not exist. The parameter
represents its median.
Gamma: A Gamma random variable is characterized by two parameters
and has a PDF
where is the Gamma function; see Section 2. We denote and
have that
Beta: A Beta random variable with parameters has its PDF as: where . Given that the Beta
random variable has a continuous distribution on
, it is often used to model a
ratio or probability. We know that
Logistic: A logistic random variable
with parameters has its
CDF with the form of a logistic function as: Thus, its PDF is given by
Dirichlet: A Dirichlet random vector
generalizes the Beta distribution to its
multivariate version (or extend the multinomial distribution to its
continuous version). It has a PDF defined on the simplex as: where with is a -dimensional parameter vector. The
Dirichlet distribution is particularly useful in modeling the prior
probabilities of a multinomial distribution that generates the latent
topics of a document 7. When , its mean vector is
.
Inequalities
Markov’s inequality: Let be a nonnegative random variable. Then,
for any ,
Proof. For any , we consider splitting the expectation into two parts as: The result follows by dividing on both sides of the above
inequality. ◻
Chebyshev’s inequality: Let be a random variable with . Then, for
any , The Chebyshev’s inequality
can be proved by applying Markov’s inequality to the nonnegative random
variable . It
is a simple instance of general concentration inequalities that give a
probabilistic bound on the deviation of away from its mean.
Chernoff bound: Suppose that there is a
constant such that the
moment generating function of a random variable exists when , where . Given that we can optimize our choice of to obtain the Chernoff
bound as:
Cauchy-Schwarz inequality: Given two
random variables and , where equality holds if and only if either , or , or for some nonzero
constant . A useful
corollary of the Cauchy-Schwarz inequality is that
Hölder inequality: Given two random
variables and , with and , where equality
holds if and only if for some nonzero constant . Specifically, when , .
Minkowski Inequality: Given two random
variables and , for , where equality holds if and only if for some nonzero
constant , or , or .
Jensen’s Inequality: Given a convex
function and a random
variable , where equality holds if and only if
either for some constant
or for every line that is tangent to at , .
Big and Symbols
In machine learning, big and
little symbols are used to
characterize the time or space complexity of our algorithm with respect
to the sample size or data dimension. In general, these symbols describe
the growth rate of functions as follows.
Let be the function to be
estimated on and be the comparison function that is
strictly positive when is large
on .
Big
symbol: We write if there exist constants and such that8 Using the limit superior notation, we know
that
Notice that .
Little
symbol: Similarly, we write if for any , there exists a constant
such that Under the limit notation, we have that
Big
symbol: Sometimes, depending on the context, we may
encounter the big symbol in
machine learning literature. In most cases, the definition of follows
from 9, so we write if there
exist constants and such that or equivalently,
Taking into account the randomness of input data, it may not be
possible to bound a quantity or random variable in our algorithm through
the above big and little symbols. We introduce the and symbols10
to handle the stochastic rate of convergence for a sequence of random
variables .
Little
symbol: We write for a sequence of constants if converges to 0 in
probability as . That
is, for any ,
Big
symbol: We write for a sequence of constants if is bounded in probability
when is large. That is, for any
, there exist a
constant and an integer
such that
G. Casella and R. Berger. Statistical Inference. Duxbury
advanced series. Thomson Learning, 2nd ed. edition, 2002.↩︎