Review of Proof Techniques and Probability

CSE547 / STAT548 at the University of Washington

Acknowledgment: This document has been adapted from a similar review session for CS224W at Standard with substantial modifications by Yikun Zhang at Winter 2023 for CSE547 / STAT548 at UW. Some good references about writing proofs and basic probability theory include

Proof Techniques

In this section, we will review several mathematical techniques for writing rigorous proofs. They are useful in the field of machine learning when we want to formally state and verify some theoretical properties of our proposed algorithms.

Terminologies

Universal and Existence Statements

Universal statement: To prove a universal statement, like “the square of any odd number is odd”, it is always easy to show that this is true for some specific cases – for example, 32=9, which is an odd number, and 52=25, which is another odd number. However, to rigorously prove the statement, we must show that it works for all odd numbers, which is difficult as we cannot enumerate all of them.

On the contrary, if we want to disprove a universal statement, we only need to find one counterexample. For instance, if we want to disprove the statement “the square of any odd number is even”, it suffices to provide a specific example of an odd number whose square is not even. (For example, 32=9, which is not an even number.)

As a summary, it leads to a rule of thumb for proving or disproving

Existence statement: The above rule of thumb is reversed when it comes to the “existence” statements. For example, if the statement to be proved is that “there exists at least one odd number whose square is odd, then proving the statement just requires finding one specific case, e.g., 32=9, while disproving the statement would require showing that none of the odd numbers have squares that are odd.
Using the statement that “the square of any odd number is odd” as an example, we showcase how to prove a universal statement. To this end, we can pick an arbitrary odd number x and try to prove the statement for that number. In the proof, we cannot assume anything about x other than that it is an odd number. (In other words, we cannot simply set x to be a specific number, like 3, because then our proof might rely on special properties of the number 3 that do not generalize to all odd numbers).

Example 1. Prove that the square of any odd number is odd.

Proof. Let x be an arbitrary odd number. By definition, an odd number is an integer that can be written in the form 2k+1, for some integer k. This means that we can write x=2k+1, where k is some integer. Thus, x2=(2k+1)2=4k2+4k+1=2(2k2+2k)+1. Since k is an integer, 2k2+2k is also an integer, so we can write x2=2+1, where =2k2+2k is an integer. Therefore, x2 is odd.

Since this logic works for any odd number x, we have shown that the square of any odd number is odd. ◻

Special Proof Techniques

In addition to the technique of “picking an arbitrary element”, here are several other techniques commonly seen in proofs.

Proof by contrapositive: Consider the statement that

(a) “If it is raining today, then I do not go to class.”

This is logically equivalent to the statement that

(b) “If I go to class, then it is not raining today.”

Therefore, if we want to prove statement (a), then it suffices to prove statement (b). Statement (b) is called the contrapositive of statement (a).

It is worth mentioning that statement (a) is not logically equivalent to the statement:

(c) “If I do not go to class, then it is raining today.”

which is a converse of statement (a). This non-equivalence is also known as the fallacy of the converse.

Example 2. Let x be an integer. Prove that x2 is an odd number if and only if x is an odd number.

Proof. The “if and only if” in this statement requires us to prove both directions of the implication. First, we must prove that if x is an odd number, then x2 is an odd number. Then we should prove that if x2 is an odd number, then x is an odd number.

As we already proved the first statement in Example 1, we only need to prove the second statement. The second statement is logically equivalent to its contrapositive, so it suffices to prove that “if x is an even number, then x2 is even.”

Suppose x is an even number. Then, we can write x=2k for some integer k. It implies that x2=4k2=2(2k2). Since k is an integer, 2k2 is also an integer, so we can write x2=2 for the integer =2k2. By definition, this means that x2 is an even number. ◻

Proof by contradiction: When proving a statement by contradiction, we would assume that our statement is not true and then derive a contradiction. This is a special case of proving by contrapositive (where our “if” is all of mathematics, and our “then” is the statement to be proved).

Example 3. Prove that 2 is irrational.

Proof. Suppose that 2 was rational. By definition, this means that 2 can be written as m/n for some integers m and n. Since 2=m/n, it follows that 2=m2/n2, which in turn shows that m2=2n2. Now any square number x2 must have an even number of prime factors, since any prime factor found in the first x must also appear in the second x. Therefore, m2 must have an even number of prime factors. However, since n2 must also have an even number of prime factors, and 2 is a prime number, 2n2 must have an odd number of prime factors. This is a contradiction, since we claimed that m2=2n2, and no number can simultaneously have an even number of prime factors and an odd number of prime factors. Therefore, our initial assumption was wrong, and 2 must be irrational. ◻

Proof by cases: Sometimes, it might be difficult to prove the entire theorem at once. As a result, we consider splitting the proof into several cases and proving the theorem separately for each case.

Example 4. Let n be an integer. Show that if n is not divisible by 3, then n2=3k+1 for some integer k.

Proof. If n is not divisible by 3, then either n=3m+1 or n=3m+2 for some integer m.

Case 1: Suppose n=3m+1. Then n2=(3m+1)2=9m2+6m+1=3(3m2+2m)+1. Since 3m2+2m is an integer, it follows that we can write n2=3k+1 for k=3m2+2m.

Case 2: Suppose n=3m+2. Then n2=(3m+2)2=9m2+12m+4=9m2+12m+3+1=3(3m2+4m+1)+1. Hence, we can write n2=3k+1 for k=3m2+4m+1.

Since Case 1 and Case 2 reflect all possible possibilities, the proof is completed. ◻

Proof by induction

We can prove a statement by induction when showing that the statement is valid for all positive integers n. Note that this is not the only situation in which we can use induction, and that induction is (usually) not the only way to prove a statement for all positive integers.

To use induction, we need to establish two results:

  1. Base case: The statement is true when n=1.

  2. Inductive step: If the statement is true for n=k, then the statement is also true for n=k+1.

It allows for an infinite chain of implications:

Together, these implications prove the statement for all positive integer values of n. (It does not prove the statement for non-integer values of n, or values of n less than 1.)

Example 5. Prove that 1+2++n=n(n+1)/2 for all integers n1.

Proof. We proceed by induction.

Base case: If n=1, then the statement becomes 1=1(1+1)/2, which is true.

Inductive step: Suppose that the statement is true for n=k. This means 1+2++k=k(k+1)/2. We want to show the statement is true for n=k+1, i.e., 1+2++k+(k+1)=(k+1)(k+2)/2.

By the induction hypothesis (i.e., because the statement is true for n=k), we have 1+2++k+(k+1)=k(k+1)/2+(k+1). This equals (k+1)(k/2+1), which is equal to (k+1)(k+2)/2. This proves the inductive step.

Therefore, the statement is true for all integers n1. ◻

Strong induction

Strong induction (or complete induction) is a useful variant of induction. Here, the inductive step is changed to

  1. Base case: The statement is true when n=1.

  2. Inductive step: If the statement is true for all values of 1n<k, then the statement is also true for n=k.

This also produces an infinite chain of implications:

Strong induction works on the same principle as weak induction, but is generally easier to prove theorems under its stronger induction hypothesis.

Example 6. Prove that every integer n greater than or equal to 2 can be factored into prime numbers.

Proof. We proceed by (strong) induction.

Base case: If n=2, then n is a prime number, and its factorization is itself.

Inductive step: Suppose that k is some integer larger than 2, and assume that the statement is true for all numbers n<k. Then, there are two cases:

Case 1: k is prime. Then, its prime factorization is k itself.

Case 2: k is composite. This means that it can be decomposed into a product xy, where x and y are both greater than 1 and less than k. Since x and y are both less than k, both x and y can be factored into prime numbers (by the inductive hypothesis). That is, x=p1ps and y=q1qt, where p1,,ps and q1,,qt are prime numbers.

Thus, k can be written as (p1ps)(q1qt), which is a factorization into prime numbers. It also completed the proof of the statement. ◻

Useful Results in Calculus

The definition of the exponential function states that ex=limn(1+xn)n. In particular, it indicates that limn(1+1n)n=e and limn(11n)n=1e.

Gamma Function and Stirling’s Formula: For x(0,), the Gamma function is Γ(x)=0tx1etdt. While the exact value of Γ(x+1) is intractable for some x(0,), one can approximate Γ(x+1) when x is large by Stirling’s formula: limxΓ(x+1)(x/e)x2πx=1. This implies that when x=n is a sufficiently large integer, we can approximate Γ(n+1)=n! by 2πn(ne)n. More precisely, the following bound for n! holds for all n1 rather than only asymptotically: 2πn(ne)ne112n+1<n!<2πn(ne)ne112n.

Basic Probability Theory

Parts of this section are adapted from Chapter 1 of 1 and lecture notes of STAT 512 by Professor Yen-Chi Chen’s 2 and Professor Michael D. Perlman 3.

Probability Space

Sample space: The sample space Ω is the collection of all possible outcomes of a random experiment. For example, if we roll a die with 6 faces, then the sample space will be {1,2,3,4,5,6}.

Events: A subset of the sample space, AΩ, is called an event. For example, the event “the number from the above die is less than 4” can be represented by the subset {1,2,3}. The event “we roll a 6 from the above die” can be represented by the subset {6}.

σ-algebra: While the collection of all possible events (i.e., all subsets of Ω) is sometimes too large to define a valid probability space, we introduce the concept of σ-algebra F as a collections of subsets of Ω satisfying:

  1. (Nonemptiness) ΩF and F, where is the empty set.

  2. (Closure under complementation) If AF, then AcF.

  3. (Closure under countable unions) If A1,A2,...F, then i=1AiF.

The subsets of Ω in F are said to be measurable, and (Ω,F) is called measurable space.

Probability space: A probability measure (or probability function) P is a mapping from σ-algebra F to real numbers in [0,1] satisfying the following three axioms:

The triplet (Ω,F,P) is called a probability space.

Example 7. For a fair dice with 6 faces, we can define the probability function as: P({we roll the face i})=16 for i = 1,..., 6. All events in the probability space can be represented as unions of these six disjoint events. For instance, P(we roll an odd number)=16+16+16=12. Note that we can add probabilities here because the events {we roll the face 1}, {we roll the face 3}, and {we roll the face 5} are disjoint.

Properties of Probability Measure

Theorem 1 (Principle of inclusion-exclusion). Let (Ω,F,P) be a probability space. Given two subsets A,BF that are not necessarily disjoint, we have that P(AB)=P(A)+P(B)P(AB).

Proof. We can derive this theorem from the probability axioms. Note that AB can be split into three disjoint events: AB=ABc, AB, and BA=BAc. Furthermore, A can be split into AB and AB, and B can be split into BA and AB. Thus, P(AB)=P(AB)+P(AB)+P(BA)=P(AB)+P(AB)+P(BA)+P(AB)P(AB)=P(A)+P(B)P(AB) The result follows. ◻

Example 8. Suppose k is chosen uniformly at random from the integer set {1,2,,100}. (This means that the probability of getting each integer is 1/100.) Find the probability that k is divisible by 2 or 5.

Solution. By the principle of inclusion-exclusion (Theorem 1), P({k is divisible by 2 or 5})=P({k is divisible by 2})+P({k is divisible by 5})P({k is divisible by both 2 and 5}). There are 50 numbers divisible by 2, 20 numbers divisible by 5, and 10 numbers divisible by 10 (i.e., divisible by both 2 and 5). Therefore, the probability is P({k is divisible by 2 or 5})=50100+2010010100=0.6.

Theorem 2 (Union bound or Boole’s inequality). Let (Ω,F,P) be a probability space. For any collection of n events A1,,AnF, we have that P(i=1nAi)i=1nP(Ai).

Proof. We can prove this result by induction (for finite n).

Base case: When n=1, the statement becomes P(A1)P(A1), which is true.

Inductive step: Suppose that the statement is true for n=k. We must prove that the statement is true for n=k+1. Note that i=1k+1Ai=(i=1kAi)Ak+1 and by the principle of inclusion-exclusion (Theorem 1), P(i=1k+1Ai)P(i=1kAi)+P(Ak+1). By the induction hypothesis, the first term is less than or equal to i=1kP(Ai). Hence, P(i=1k+1Ai)i=1k+1P(Ai). The proof is completed. ◻

Example 9. Suppose that the chance of winning a Mega Million is 1 in 100000 every time a person buys a lottery ticket. If Tim buys one ticket every day of the year, how likely will he win the Mega Million at least once?

Answer. The union bound will not tell us the exact probability for Tim winning the Mega Million. However, it gives us an upper bound of this probability as 365/100000.

Other useful properties of probability measure: Let (Ω,F,P) be a probability space.

Conditional Probability, Independence, and Bayes’ Rule

Example 10. Suppose that you are administering the GRE, and you discover that 2.5% of students get a perfect score on the math section.4 By itself, this is not a very useful statistic, because the scores on the math section vary substantially by major. You dive a little deeper and find that 7.5% of physical sciences students get a perfect score, 6.3% of engineering students get a perfect score, and most other majors do substantially worse.

In the language of conditional probability, we would say that the probability of getting a perfect score conditional on engineering majors is 6.3%, i.e., P(perfect score | engineering major)=0.063. If we want to compute this probability, we would take the number of engineering majors that receive a perfect score, and divide it by the total number of engineering majors. This is equivalent to computing the formula: P(perfect score | engineering major)=P(perfect scoreengineering major)P(engineering major). In general, we can replace “perfect score” and “engineering major” with any two events and obtain the formal definition of conditional probability.

Conditional Probability: For two events A and B with P(B)>0, the conditional probability of A given B is defined as: P(A|B)=P(AB)P(B). Notice that when the event B is fixed, P(|B) is another probability measure.

Example 11. Suppose that we toss a fair coin three times. What is the probability that all three tosses come up heads, given that the first toss came up heads?

Answer. This probability is P({all three tosses come up heads and the first toss came up heads})P({the first toss came up heads})=1/81/2=14.

Independence and conditional independence: Two events A and B are independent if P(A|B)=P(A) or equivalently, P(AB)=P(A)P(B). In other words, the occurrence of event B does not affect the probability that event A happens.

For three events A,B,C, we say that A and B are conditionally independent given C if P(AB|C)=P(A|C)P(B|C). There are no implications between independence and conditional independence!

Bayes’ rule: Given an event A and some mutually exclusive events B1,...,Bk with i=1kBi=Ω, the Bayes’ rule states that P(Bj|A)=P(BjA)P(A)=P(A|Bj)P(Bj)i=1kP(A|Bi)P(Bi) for all j=1,...,k.

Example 12. Suppose that 5% of students enrolled in CSE547 at UW will get 4.0, and a student with 4.0 in CSE547 has 80% chance of getting recruited by Google. A student without getting 4.0 in CSE547 still has 40% chance of getting recruited by Google. What is the probability of a student getting 4.0 in CSE547, given that he/she has been recruited by Google?

Answer. By Bayes’ Rule, P(Get 4.0 | Recruited by Google)=P(Recruited by Google | Get 4.0)P(Get 4.0)P(Recruited by Google)=P(Recruited by Google | Get 4.0)P(Get 4.0)P(Recruited by Google|Get 4.0)P(Get 4.0)+P(Recruited by Google|Not get 4.0)P(Not get 4.0)=0.8×0.050.8×0.05+0.4×0.959.52%.

Random variables

Random variable: Let (Ω,F,P) be a probability space. A random variable X:ΩR is a (measurable) function satisfying X1((,c]):={ωΩ:X(ω)c}F for all cR. The probability that X takes on a value in a Borel set5 BR is written as: P(XB)=P({ωΩ:X(ω)B}).

Example 13. Suppose that we are tossing three fair coins. Let X be the number of coins that come up heads. Then, P(X=0)=1/8.

Cumulative distribution function (CDF): The CDF F:R[0,1] of a random variable X is a right continuous and nondecreasing function with left limits satisfying F(x):=P(Xx)=P({ωΩ:X(ω)x}). In particular, limxF(x)=0 and limxF(x)=1.

Probability mass function (PMF) and probability density function (PDF):

Expectation and Variance

Expectation: The expected value (or mean) of a random variable X with range XR can be interpreted as a weighted average.

Example 14. Suppose that Tim’s happiness scores 10 when it is sunny outside and 2 when it is raining outside. It is sunny 60% of the time at Seattle and raining 40%. What is the expected value of Tim’s happiness at Seattle?

Answer. 10×0.6+2×0.4=6.8.

Linearity of expectation: If X,Y are two random variables and a is a constant in R, then E(X+Y)=E(X)+E(Y) and E(aX)=aE(X). This is true even if X and Y are not independent.

Variance and covariance: The variance of a random variable measures how far away it is, on average, from the mean. It is defined as Var(X)=E[(XE[X])2]=E(X2)E(X)2. The covariance between random variables X,Y is defined as: Cov(X,Y)=E[(XE(X))(YE(Y))]. For a random variable X and a constant aR, we have Var(X+a)=Var(X) and Var(aX)=a2Var(X). We do not have Var(X+Y)=Var(X)+Var(Y) unless X and Y are uncorrelated (which means they have covariance 0). In particular, independent random variables are always uncorrelated, although the reverse doesn’t hold in general.

Pearson’s correlation coefficient: For two random variables X and Y, their (Pearson’s) correlation coefficient is defined as: ρXY=Cor(X,Y)=Cov(X,Y)Var(X)Var(Y), where ρXY[1,1] by the Cauchy-Schwarz inequality; see Section 3.7. It measures the linear relation between two random variables.

Known Probability Distributions

Discrete random variables

Bernoulli: If X is a Bernoulli random variable denoted by XBernoulli(p), then P(X=1)=p and P(X=0)=1p. A Bernoulli random variable with parameter p can be interpreted as a coin flip that comes up heads with probability p and tails with probability 1p. We know that E(X)=p and Var(X)=p(1p).

Binomial: If X is a binomial random variable denoted by XBinomial(n,p), then P(X=k)=(nk)pk(1p)nk for k=0,1,...,n. A binomial random variable with parameters n and p models the number of successes in n trials, each of which has a successful probability p. When n=1, it reduces to a Bernoulli random variable. We know that E(X)=np and Var(X)=np(1p).

Geometric: If X is a geometric random variable denoted by XGeometric(p), then P(X=k)=(1p)k1p for k=0,1,.... A geometric random variable with parameter p models the number of trials until the first success occurs, where each trial has a successful probability p. We know that E(X)=1p and Var(X)=1pp2.

Poisson: If X is a Poisson random variable denoted by XPoisson(λ), then P(X=k)=λkeλk! for k=0,1,.... A Poisson random variable often appears in counting processes. For instance, the number of laser photons hitting a detector in a particular time interval can be modeled as a Poisson random variable. We know that E(X)=λ and Var(X)=λ.

Indicator random variable: For an event A, an indicator random variable takes value 1 when A occurs and 0 otherwise, i.e., IA={1if event A occurs0otherwise

The expectation of an indicator random variable is just the probability of the event occurring, i.e., E[IA]=1P(IA=1)+0P(IA=0)=P(IA=1)=P(A), and its variance is Var(IA)=P(A)[1P(A)].

Multinomial: Suppose that Z is a categorical random variable with range {1,...,k} and P(Z=j)=pj for j=1,...,k. We generate independently and identically distributed data Z1,...,Zn with the above distribution and take Xj=i=1nI{Zi=j}=Number of observations in Category j. Then, the random vector X=(X1,...,Xk) follows a multinomial distribution denoted by XMultinomial(n;p1,...,pk) with j=1kpj=1, whose PMF is given by P(X1=x1,...,Xk=xk)=n!x1!xk!p1x1pkxk. Here, X takes integer values within a simplex {(x1,...,xk){0,1,...,k}n:j=1nxj=n}.

Continuous random variables

Uniform: If X is a uniform random variable over the interval [a,b] denote by XUniform[a,b], then its PDF is given by p(x)=1baI[a,b](x)={1baaxb,0otherwise. We know that E(X)=a+b2 and Var(X)=(ba)212.

Normal: If X is a normal random variable with parameters μ and σ2 denoted by XN(μ,σ2), then its PDF is given by p(x)=12πσe(xμ)22σ2, where x(,). We know that E(X)=μ and Var(X)=σ2.

Exponential: If X is an exponential random variable with parameter λ denoted by XExp(λ), then its PDF is given by p(x)=λeλxI[0,)(x). We know that E(X)=1λ and Var(X)=1λ2. A double exponential random variable Y satisfies that |Y|Exp(λ), so its PDF is given by p(y)=λ2eλ|y| with y(,). In particular, E(Y)=0 and Var(Y)=2λ2. Sometimes, Y is also called a Laplace random variable6.

Cauchy: If X is a Cauchy random variable with parameters μ,σ2 denoted by XCauchy(μ,σ2), then its PDF is given by p(x)=1πσ[σ2σ2+(xμ)2], where x(,). Note that both the expectation and variance of a Cauchy distribution do not exist. The parameter μ represents its median.

Gamma: A Gamma random variable X is characterized by two parameters α,λ>0 and has a PDF p(x)=λαΓ(α)xα1eλxI[0,)(x), where Γ(α)=0uα1eudu is the Gamma function; see Section 2. We denote XGamma(α,λ) and have that E(X)=αλ and Var(X)=αλ2.

Beta: A Beta random variable X with parameters α,β>0 has its PDF as: p(x)=1B(α,β)xα1(1x)β1I[0,1](x), where B(α,β)=Γ(α)Γ(β)Γ(α+β). Given that the Beta random variable XBeta(α,β) has a continuous distribution on [0,1], it is often used to model a ratio or probability. We know that E(X)=αα+β and Var(X)=αβ(α+β)2(α+β+1).

Logistic: A logistic random variable X with parameters αR,β>0 has its CDF with the form of a logistic function as: F(x)=P(Xx)=11+eαβx. Thus, its PDF is given by p(x)=ddxF(x)=βeαβx(1+eαβx)2=βeα+βx(1+eα+βx)2.

Dirichlet: A Dirichlet random vector Z=(Z1,...,Zk) generalizes the Beta distribution to its multivariate version (or extend the multinomial distribution to its continuous version). It has a PDF defined on the simplex {(z1,...,zk)[0,1]k:i=1kzi=1} as: p(z1,...,zk;α1,...,αk)=1B(α)i=1kziαi1, where α=(α1,...,αk) with αi>0,i=1,...,k is a k-dimensional parameter vector. The Dirichlet distribution is particularly useful in modeling the prior probabilities of a multinomial distribution that generates the latent topics of a document 7. When ZDirichlet(α1,...,αk), its mean vector is E(Z)=(α1i=1kαi,,αki=1kαi).

Inequalities

Markov’s inequality: Let X be a nonnegative random variable. Then, for any ϵ>0, P(Xϵ)E(X)ϵ.

Proof. For any ϵ>0, we consider splitting the expectation E(X) into two parts as: E(X)=E(XI{Xϵ})+E(XI{X<ϵ})E(XI{Xϵ})E(ϵI{Xϵ})=ϵP(Xϵ). The result follows by dividing ϵ>0 on both sides of the above inequality. ◻

Chebyshev’s inequality: Let X be a random variable with Var(X)<. Then, for any ϵ>0, P(|XE(X)|ϵ)Var(X)ϵ. The Chebyshev’s inequality can be proved by applying Markov’s inequality to the nonnegative random variable [XE(X)]2. It is a simple instance of general concentration inequalities that give a probabilistic bound on the deviation of X away from its mean.

Chernoff bound: Suppose that there is a constant b>0 such that the moment generating function φ(λ)=E[eλ(Xμ)] of a random variable X exists when λ|b|, where μ=E(X). Given that P[(Xμ)>t]=P[eλ(Xμ)eλt]E[eλ(Xμ)]eλt for any λ[0,b], we can optimize our choice of λ to obtain the Chernoff bound as: P[(Xμ)>t]infλ[0,b]E[eλ(Xμ)]eλt.

Cauchy-Schwarz inequality: Given two random variables X and Y, |E(XY)|2E(X2)E(Y2), where equality holds if and only if either P(X=0)=0, or P(Y=0)=0, or P(X=cY)=1 for some nonzero constant cR. A useful corollary of the Cauchy-Schwarz inequality is that |Cov(X,Y)|2Var(X)Var(Y).

Hölder inequality: Given two random variables X and Y, E|XY|(E|X|p)1p(E|Y|q)1q||X||p||Y||q with p,q[1,] and 1p+1q=1, where equality holds if and only if P(|X|p=c|Y|q)=1 for some nonzero constant c. Specifically, when p=, ||X||=inf{M0:P(|X|>M)=0}.

Minkowski Inequality: Given two random variables X and Y, [E|X+Y|p]1p[E|X|p]1p+[E|Y|p]1p for p[1,), where equality holds if and only if P(X=cY)=1 for some nonzero constant c, or P(Y=0)=1, or P(X=0)=1.

Jensen’s Inequality: Given a convex function φ and a random variable X, φ(E(X))E[φ(X)], where equality holds if and only if either P(X=c)=1 for some constant c or for every line a+bx that is tangent to φ at E(X), P(φ(x)=a+bx)=1.

Big O and OP Symbols

In machine learning, big O and little o symbols are used to characterize the time or space complexity of our algorithm with respect to the sample size or data dimension. In general, these symbols describe the growth rate of functions as follows.

Let f(x) be the function to be estimated on R and g(x) be the comparison function that is strictly positive when x is large on R.

Big O symbol: We write f(x)=O(g(x)) if there exist constants M>0 and x0>0 such that8 |f(x)|Mg(x) for all xx0. Using the limit superior notation, we know that f(x)=O(g(x))lim supx|f(x)|g(x)<. Notice that lim supxh(x)=limx0[supxx0f(x)].

Little o symbol: Similarly, we write f(x)=o(g(x)) if for any ϵ>0, there exists a constant x0>0 such that |f(x)|ϵg(x) for all xx0. Under the limit notation, we have that f(x)=o(g(x))limx|f(x)|g(x)=0.

Big Ω symbol: Sometimes, depending on the context, we may encounter the big Ω symbol in machine learning literature. In most cases, the definition of f(x)=Ω(g(x)) follows from 9, so we write f(x)=Ω(g(x)) if there exist constants m>0 and x0 such that |f(x)|mg(x) for all xx0, or equivalently, f(x)=Ω(g(x))lim infnf(x)g(x)>0.

Taking into account the randomness of input data, it may not be possible to bound a quantity or random variable in our algorithm through the above big O and little o symbols. We introduce the OP and oP symbols10 to handle the stochastic rate of convergence for a sequence of random variables {Xn}n=1.

Little oP symbol: We write Xn=oP(an) for a sequence of constants {an}n=1 if Xnan converges to 0 in probability as n. That is, for any ϵ>0, Xn=oP(an)limnP(|Xnan|ϵ)=0.

Big OP symbol: We write Xn=OP(an) for a sequence of constants {an}n=1 if Xnan is bounded in probability when n is large. That is, for any ϵ>0, there exist a constant M>0 and an integer N>0 such that Xn=OP(an)P(|Xnan|>M)<ϵ when n>N.


  1. G. Casella and R. Berger. Statistical Inference. Duxbury advanced series. Thomson Learning, 2nd ed. edition, 2002.↩︎

  2. See http://faculty.washington.edu/yenchic/20A_stat512.html.↩︎

  3. M. Perlman. Probability and Mathematical Statistics I (STAT 512 Lecture Notes), 2020. URL https://sites.stat.washington.edu/people/mdperlma/STAT%20512% 20MDP%20Notes.pdf.↩︎

  4. See https://www.ets.org/s/gre/pdf/gre_guide_table4.pdf for a breakdown by specific majors. For some reason, computer science is counted as part of the physical sciences, and not as engineering.↩︎

  5. A Borel set in R is a set that can be formed from open sets through the operations of countable unions/intersections and complements.↩︎

  6. See https://en.wikipedia.org/wiki/Laplace_distribution.↩︎

  7. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.↩︎

  8. See https://en.wikipedia.org/wiki/Big_O_notation.↩︎

  9. D. E. Knuth. Big omicron and big omega and big theta. ACM Sigact News, 8(2):18–24, 1976.↩︎

  10. See https://en.wikipedia.org/wiki/Big_O_in_probability_notation.↩︎