Clustering (contd.) EM Algorithm

October 6, 2001

Instructor: Larry Ruzzo

Notes: Tushar Bhangale

Probability Review


Sample Space: The set of all possible outcomes is sample space (W) P(W) = 1.

And probability of any event A:


Conditional Probability: The probability of an event given that another event has occurred is called a conditional probability. The conditional probability of A given B is denoted by P(A|B) ans is computed as follows:

P(A|B) is also called as the posterior probability of A i.e. probability of A after observing that event B has occurred. In this case P(A) is also called as prior probability.


Bayes’ Rule:

It is often easier to compute P(B|A) than P(A|B). Bayes’ rules makes it possible to evaluate P(A|B).

Coin problem: Consider 2 biased coins, one (Hbiased)  has P(Head) = 0.99 and the other (Tbiased) has P(Tail) = 0.99.  One of them is drawn randomly ( PHbiased = PTbiased = 0.5) and tossed. Thus the prior probability of PHbiased = 0.5.  What is the posterior probability of Hbiased given the fact that a Head occurred P(H­biased|H) ?

Thus the posterior probability P(H­biased|H)= 0.99 where the prior probability of P(Hbiased) was 0.5.


Notations used:

Zij {0,1} is a binary variable such that Zij=1 if XiÎ Gaussian with mj and Zij= 0 otherwise.

Event A = sample Xi is drawn from N(m1, s1),  P(A) = t1

Event B = sample Xi is drawn from N(m2, s2), P(B) = t2

Event D = Xi Î [X,  X + dx]


Calculating E(Zij):

P(D|A) can be calculated using:

And P(A|D) can be calculated using P(D|A) and applying Bayes’ rule as:

where, P(D) = P(D|A)P(A) + P(D|B)P(B) if A and B are mutually exclusive and exhaustive.


D is the observed data and A is the model. P(A|D) is the posterior probability after seeing the data D that it came from model A.

And E(Z­ij) = P(A|D).

            Clustering can also be classified into hard clustering and soft clustering. Hard clustering is where every data point is assumed to belong to only one cluster. Soft clustering involves assigning a certain probability for the data point belonging to each cluster.

If tjs are unknown but Zs are known, ms and ts can be calculated by using maximum likelihood estimation.  If Zs are unknown, bayesian estimation has to be used to calculate Zi.


EM Algorithms

EM stands for estimation-maximization. There are two types of EM algorithms.


Classification Em Algorithms: (Hard clustering)


  1. Given ms and ts, estimate Zi
  2. Assign each xi to the best cluster
  3. Re-estimate ms and ts
  4. Reiterate

(General) EM Algorithm: (soft clustering)


  1. Random initialization of ms and ts
  2. Using these values of ms and ts, estimate Zs
  3. Given distribution of Zs, re-estimate ms and ts
  4. Reiterate


Consider that the data points belong to a mixture of two Gaussians with means m1 and m2 and variance s2. Assuming equal likelihood of the data point belonging to each cluster i.e. t1=t2, for any data point, the posterior probability (given the ms) of it belonging to any cluster, is given by,

The joint probability for all the points is:

The goal is to maximize this probability, which is equivalent to maximizing the log of the function.

now, maximizing expected value of log P i.e. max E(log P), treating Zi as a random variable drawn from distributions defined by m1t, m2t

Finding m1 and m2 that maximize E(log P) is equivalent to finding m1 and m2 that minimize

Same technique can be used to estimate unknown ts and ss if they are not the same for each cluster.


EM Algorithm ( proof of convergence):


Let       X         be the visible data

            Y         the hidden data

            q, qt     the parameters where qt is the value of the parameters at time t