**Instructor: **Larry
Ruzzo

**Notes:** Tushar
Bhangale

__ __

__Sample Space:__ The set of all possible outcomes is *sample
space *(W)
P(W)
= 1.

And probability of any event A:_{}

__ __

__Conditional
Probability:__ The
probability of an event given that another event has occurred is called a
conditional probability. The conditional probability of *A *given *B *is
denoted by P(*A*|*B*) ans is computed as follows:

_{}

P(A|B) is also called as the *posterior* probability of
A i.e. probability of A after observing that event B has occurred. In this case
P(A) is also called as *prior* probability.

* *

__Bayes’ Rule:__

_{}

It is often easier to compute P(B|A) than P(A|B). Bayes’ rules makes it possible to evaluate P(A|B).

__Coin problem:__ Consider 2 biased coins, one (H_{biased}) has P(Head) = 0.99 and the other (T_{biased})
has P(Tail) = 0.99. One of them is
drawn randomly ( P_{Hbiased} = P_{Tbiased} = 0.5) and tossed.
Thus the prior probability of P_{Hbiased} = 0.5. What is the posterior probability of H_{biased}
given the fact that a Head occurred P(H_{biased}|H) ?

_{}

Thus the posterior probability P(H_{biased}|H)=
0.99 where the prior probability of P(H_{biased}) was 0.5.

Z_{ij} {0,1} is a binary variable such that Z_{ij}=1
if X_{i}Î
Gaussian with m_{j}
and Z_{ij}= 0 otherwise.

Event A = sample X_{i} is drawn from N(m_{1},
s_{1}), P(A) = t_{1}

Event B = sample X_{i} is drawn from N(m_{2},
s_{2}),
P(B) = t_{2}

Event D = X_{i} Î [X, X + dx]

P(D|A) can be calculated using: _{}

And P(A|D) can be calculated using P(D|A) and applying Bayes’ rule as:

_{}

where, P(D) = P(D|A)P(A) + P(D|B)P(B) if A and B are mutually exclusive and exhaustive.

_{}

D is the observed data and A is the model. P(A|D) is the posterior probability after seeing the data D that it came from model A.

And E(Z_{ij}) = P(A|D).

Clustering can also be classified into hard clustering and soft clustering. Hard clustering is where every data point is assumed to belong to only one cluster. Soft clustering involves assigning a certain probability for the data point belonging to each cluster.

If t_{j}s are unknown but Zs are known, ms and
ts
can be calculated by using maximum likelihood estimation. If Zs are unknown, bayesian estimation has
to be used to calculate Z_{i}.

** **

**EM Algorithms**

EM stands for estimation-maximization. There are two types of EM algorithms.

__ __

__Classification Em Algorithms__: (Hard clustering)

Steps:

- Given ms
and ts,
estimate Z
_{i} - Assign
each x
_{i}to the best cluster - Re-estimate ms and ts
- Reiterate

__(General) EM Algorithm:__ (soft clustering)

Steps:

- Random initialization of ms and ts
- Using these values of ms and ts, estimate Zs
- Given distribution of Zs, re-estimate ms and ts
- Reiterate

Consider that the data points belong to a mixture of two
Gaussians with means m_{1} and m_{2} and variance s^{2}.
Assuming equal likelihood of the data point belonging to each cluster i.e. t_{1=}t_{2},
for any data point, the posterior probability (given the ms) of
it belonging to any cluster, is given by,

_{}

The joint probability for all the points is:

_{}

The goal is to maximize this probability, which is equivalent to maximizing the log of the function.

_{}

now, maximizing expected value of log P i.e. max E(log P),
treating Z_{i} as a random variable drawn from distributions defined by
m_{1}^{t},
m_{2}^{t}
_{}

Finding m_{1} and m_{2} that maximize
E(log P) is equivalent to finding m_{1} and m_{2} that minimize

_{}

Same technique can be used to estimate unknown ts and ss if they are not the same for each cluster.

__ __

__EM Algorithm ( proof of convergence):__

Let X be the visible data

Y the hidden data

q, q^{t}
the parameters where q^{t}
is the value of the parameters at time t

__ __

_{}

_{}