Thursday November 1, 2001

**Lecturer**:
Larry Ruzzo

**Notes**:
Shabnam Erfani

Before we discuss model-based clustering today, we review some probability theory.

Supposed we have a model that indicates how data is generated according to a certain probability distribution, say normal, denoted by:

Normal (μ, σ^{2}): where μ is the mean, and σ^{2 }is the variance.

The probability distribution function (PDF) for normal distribution is:

ƒ
(x) = (1 ∕ √(2πσ^{2} ) ) _{e} -(x- μ)^{ 2}/2σ^{2}

^{ }

^{ }

The pdf gives the probability that a sample is in interval
dx (or has value dx). The normal pdf describes a bell shaped curve that is
centered at μ and has
variance σ^{2}. These 2 parameters exactly define
what the normal distribution looks like. Our problem is reverse of this, given
a set of data that have normal distribution, what can we tell about the shape
of the curve? In other words, what are the values for parameters μ
and σ^{2 }based on the observed data? The
process of calculating μ and σ^{2 }from the sample data is called
parameter estimation. Parameter estimation uses a function called the
likelihood function:

L(X_{1}, X_{2}, …,X_{n} | θ_{1
}, θ_{2}): is the
probability of n samples assuming values X_{1}, X_{2}, …,X_{n
}given parameters θ_{1 }, θ_{2 }, in this case
θ_{1 }= μ , θ_{2} = σ^{2 }, in other words, this is the joint pdf of the random
variables generating the sample points.

Assuming we have independent samples with normal
distribution, the likelihood function is:
_{n}

Π ƒ (X_{i})

^{i=1 }

To estimate the parameters we essentially pick pairs of parameters and try to maximize the likelihood function. The pair that produces the max likelihood is the best candidate for the solution

Let us try some simple examples to illustrate the point:

Suppose we have a normal distribution where we know that
variance σ^{2}=1, but we do not know the mean. So the
likelihood function becomes:

_{n}

L(X |
θ_{1 }, θ_{2 }=1_{ }) = Π (1/√2π) exp( -(X- θ_{1 })^{2}/2 )

^{ i=1 }

^{ X
here is the vector of random variables representing the samples}

To find the value that maximizes this function, we take its derivative and set it to zero. This may get us stuck at local maxima unless we have a convex function. Or we may have multiple values as solutions of the resulting equations. In this case we need to plug in the values one by one and see which one gives the max value for the likelihood function.

Since the likelohood function is an exponential, it is easier to take the natural log and then the derivative:

a) ln ( L(X|
θ_{1 }) ) = ∑ [ -(1/2)ln 2π – (X_{i}
- θ_{1} )^{2}/2 ]

Taking the derivative of this function and equating to 0 results in:

d/d θ_{1 }(ln (L(X|
θ_{1 })) ) = 0 - ∑ (X_{i} - θ_{1 })
=0 (all summations are over i=1…n)

Þ ∑ X_{i }- n θ_{1 }= 0 Þ θ_{1 }= ∑ X_{i }¤ n

We see that the value that actually maximized the likelihood function is the same as the population mean, i.e. E(sample mean) = population mean. This is an unbiased estimation since the sample mean is the same as the population mean.

Now let’s consider the same example, but with both θ_{1
}and θ_{2 }unknown. Using equation a) above we get:

ln ( L(X|
θ_{1}, _{ }θ_{2
}) ) = ∑ [-(1/2) ln 2π
θ_{2 }– (X_{i} - θ_{1} )^{2}/2 θ_{2 }]

Now we take the partial derivative with respect to each unknown and set to zero:

¶ ¤ ¶
θ_{1 }ln ( L(X| θ_{1},
_{ }θ_{2 }) ) = 0
+ ∑ (X_{i} - θ_{1} ) / θ_{2 }= 0

¶ ¤ ¶
θ_{2 }ln ( L(X|
θ_{1}, _{ }θ_{2
}) ) = ∑ [ -1 / 2θ_{2
}+ (X_{i} - θ_{1} )^{2}/2 θ_{2 }]

Þ θ_{1 }= ∑ X_{i }¤ n = `X and θ_{2 }=
∑ (X_{i} - `X)^{ 2} / n_{ }= s^{2}

^{ }

^{ }

^{ }Note: This is a biased
estimation for θ_{2, }since it is not the expected value for
sample variance. However, as n®µ this estimation moves toward the unbiased function,
which is the population variance.

Suppose we have a sample that is distributed as shown below:

__* ******* * * **********************__

__ __

If we try to fit this data into a normal distribution, we get a broad curve since the data has high variance. Another possibility is that we have two Gaussian distributions, which results in two curves with lower variance, “mixed” according to some ratio.

**Mixture Model**:

We have two parameters: t_{1
}and t_{2
}known as the mixing parameters, where t_{1 }+ t_{2 }=1. We have
two normal distributions: N( m_{1}, s_{1}^{2})
and N( m_{2},
s_{2}^{2}).
A sample is drawn from the first distribution with probability t_{1
}and from the second distribution with probability t_{2. }This can
easily be scaled to n distributions.

Now to perform parameter estimation, we need to know how to classify the sample points into the different distributions first and then calculate the mean and variance for each. However given the sample data we do not really know sample membership. We can use an algorithm called Expectation Maximization (EM) to estimate the parameters in this model, as we will see later.

Formally a mixture model is used to define a likelihood function as follows:

Parameters: t_{i }, ¦_{i }define
the probability and pdf of each random variable contributing to the sample
space.

_{n} _{k}

L(X_{1}, X_{2}, …,X_{n }| t_{i }, ¦_{i }(q) ) = P S t_{i }¦_{i }(X_{j }| q_{i} )

^{J=1} ^{I=1}

The EM function provides a way to estimate parameters for this likelihood function. We will discuss EM and model-based clustering in more detail in the next lecture.

For now let’s look at an example of application of model based clustering to gene expression.

Note: The example discussed in class came from Dr. Ruzzo’s paper: Model-Based Clustering and Data Transformations for Gene Expression Data, which can be found at http://www.cs.washington.edu/homes/ruzzo/

The presentation used in class can be found here

Also, there is supplemental material at http://www.cs.washington.edu/homes/kayee/model/