Clustering (contd.) EM Algorithm

October 6, 2001

Instructor: Larry Ruzzo

Notes: Tushar Bhangale

Probability Review

Sample Space: The set of all possible outcomes is sample space (W) P(W) = 1.

And probability of any event A:

Conditional Probability: The probability of an event given that another event has occurred is called a conditional probability. The conditional probability of A given B is denoted by P(A|B) ans is computed as follows:

P(A|B) is also called as the posterior probability of A i.e. probability of A after observing that event B has occurred. In this case P(A) is also called as prior probability.

Bayes’ Rule:

It is often easier to compute P(B|A) than P(A|B). Bayes’ rules makes it possible to evaluate P(A|B).

Coin problem: Consider 2 biased coins, one (H_biased) has P(Head) = 0.99 and the other (T_biased) has P(Tail) = 0.99. One of them is drawn randomly ( P_Hbiased = P_Tbiased = 0.5) and tossed. Thus the prior probability of P_Hbiased = 0.5. What is the posterior probability of H_biased given the fact that a Head occurred P(H_biased|H) ?

Thus the posterior probability P(H_biased|H)= 0.99 where the prior probability of P(H_biased) was 0.5.

Notations used:

Z_ij {0,1} is a binary variable such that Z_ij=1 if X_iÎ Gaussian with m_j and Z_ij= 0 otherwise.

Event A = sample X_i is drawn from N(m₁, s₁), P(A) = t₁

Event B = sample X_i is drawn from N(m₂, s₂), P(B) = t₂

Event D = X_i Î [X, X + dx]

Calculating E(Z_ij):

P(D|A) can be calculated using:

And P(A|D) can be calculated using P(D|A) and applying Bayes’ rule as:

where, P(D) = P(D|A)P(A) + P(D|B)P(B) if A and B are mutually exclusive and exhaustive.

D is the observed data and A is the model. P(A|D) is the posterior probability after seeing the data D that it came from model A.

And E(Z_ij) = P(A|D).

Clustering can also be classified into hard clustering and soft clustering. Hard clustering is where every data point is assumed to belong to only one cluster. Soft clustering involves assigning a certain probability for the data point belonging to each cluster.

If t_js are unknown but Zs are known, ms and ts can be calculated by using maximum likelihood estimation. If Zs are unknown, bayesian estimation has to be used to calculate Z_i.

EM Algorithms

EM stands for estimation-maximization. There are two types of EM algorithms.

Classification Em Algorithms: (Hard clustering)

Steps:

Given ms and ts, estimate Z_i
Assign each x_i to the best cluster
Re-estimate ms and ts
Reiterate

(General) EM Algorithm: (soft clustering)

Steps:

Random initialization of ms and ts
Using these values of ms and ts, estimate Zs
Given distribution of Zs, re-estimate ms and ts
Reiterate

Consider that the data points belong to a mixture of two Gaussians with means m₁ and m₂ and variance s². Assuming equal likelihood of the data point belonging to each cluster i.e. t₁₌t₂, for any data point, the posterior probability (given the ms) of it belonging to any cluster, is given by,

The joint probability for all the points is:

The goal is to maximize this probability, which is equivalent to maximizing the log of the function.

now, maximizing expected value of log P i.e. max E(log P), treating Z_i as a random variable drawn from distributions defined by m₁^t, m₂^t

Finding m₁ and m₂ that maximize E(log P) is equivalent to finding m₁ and m₂ that minimize

Same technique can be used to estimate unknown ts and ss if they are not the same for each cluster.

EM Algorithm ( proof of convergence):

Let X be the visible data

Y the hidden data

q, q^t the parameters where q^t is the value of the parameters at time t

Clustering (contd.) EM Algorithm

October 6, 2001

Probability Review

Notations used:

Calculating E(Zij):

Calculating E(Z_ij):