Thursday November 1, 2001
Lecturer: Larry Ruzzo
Notes: Shabnam Erfani
Before we discuss model-based clustering today, we review some probability theory.
Supposed we have a model that indicates how data is generated according to a certain probability distribution, say normal, denoted by:
Normal (μ, σ2): where μ is the mean, and σ2 is the variance.
The probability distribution function (PDF) for normal distribution is:
ƒ (x) = (1 ∕ √(2πσ2 ) ) e -(x- μ) 2/2σ2
The pdf gives the probability that a sample is in interval dx (or has value dx). The normal pdf describes a bell shaped curve that is centered at μ and has variance σ2. These 2 parameters exactly define what the normal distribution looks like. Our problem is reverse of this, given a set of data that have normal distribution, what can we tell about the shape of the curve? In other words, what are the values for parameters μ and σ2 based on the observed data? The process of calculating μ and σ2 from the sample data is called parameter estimation. Parameter estimation uses a function called the likelihood function:
L(X1, X2, …,Xn | θ1 , θ2): is the probability of n samples assuming values X1, X2, …,Xn given parameters θ1 , θ2 , in this case θ1 = μ , θ2 = σ2 , in other words, this is the joint pdf of the random variables generating the sample points.
Assuming we have independent samples with normal distribution, the likelihood function is: n
Π ƒ (Xi)
To estimate the parameters we essentially pick pairs of parameters and try to maximize the likelihood function. The pair that produces the max likelihood is the best candidate for the solution
Let us try some simple examples to illustrate the point:
Suppose we have a normal distribution where we know that variance σ2=1, but we do not know the mean. So the likelihood function becomes:
L(X | θ1 , θ2 =1 ) = Π (1/√2π) exp( -(X- θ1 )2/2 )
X here is the vector of random variables representing the samples
To find the value that maximizes this function, we take its derivative and set it to zero. This may get us stuck at local maxima unless we have a convex function. Or we may have multiple values as solutions of the resulting equations. In this case we need to plug in the values one by one and see which one gives the max value for the likelihood function.
Since the likelohood function is an exponential, it is easier to take the natural log and then the derivative:
a) ln ( L(X| θ1 ) ) = ∑ [ -(1/2)ln 2π – (Xi - θ1 )2/2 ]
Taking the derivative of this function and equating to 0 results in:
d/d θ1 (ln (L(X| θ1 )) ) = 0 - ∑ (Xi - θ1 ) =0 (all summations are over i=1…n)
Þ ∑ Xi - n θ1 = 0 Þ θ1 = ∑ Xi ¤ n
We see that the value that actually maximized the likelihood function is the same as the population mean, i.e. E(sample mean) = population mean. This is an unbiased estimation since the sample mean is the same as the population mean.
Now let’s consider the same example, but with both θ1 and θ2 unknown. Using equation a) above we get:
ln ( L(X| θ1, θ2 ) ) = ∑ [-(1/2) ln 2π θ2 – (Xi - θ1 )2/2 θ2 ]
Now we take the partial derivative with respect to each unknown and set to zero:
¶ ¤ ¶ θ1 ln ( L(X| θ1, θ2 ) ) = 0 + ∑ (Xi - θ1 ) / θ2 = 0
¶ ¤ ¶ θ2 ln ( L(X| θ1, θ2 ) ) = ∑ [ -1 / 2θ2 + (Xi - θ1 )2/2 θ2 ]
Þ θ1 = ∑ Xi ¤ n = `X and θ2 = ∑ (Xi - `X) 2 / n = s2
Note: This is a biased estimation for θ2, since it is not the expected value for sample variance. However, as n®µ this estimation moves toward the unbiased function, which is the population variance.
Suppose we have a sample that is distributed as shown below:
* ******* * * **********************
If we try to fit this data into a normal distribution, we get a broad curve since the data has high variance. Another possibility is that we have two Gaussian distributions, which results in two curves with lower variance, “mixed” according to some ratio.
We have two parameters: t1 and t2 known as the mixing parameters, where t1 + t2 =1. We have two normal distributions: N( m1, s12) and N( m2, s22). A sample is drawn from the first distribution with probability t1 and from the second distribution with probability t2. This can easily be scaled to n distributions.
Now to perform parameter estimation, we need to know how to classify the sample points into the different distributions first and then calculate the mean and variance for each. However given the sample data we do not really know sample membership. We can use an algorithm called Expectation Maximization (EM) to estimate the parameters in this model, as we will see later.
Formally a mixture model is used to define a likelihood function as follows:
Parameters: ti , ¦i define the probability and pdf of each random variable contributing to the sample space.
L(X1, X2, …,Xn | ti , ¦i (q) ) = P S ti ¦i (Xj | qi )
The EM function provides a way to estimate parameters for this likelihood function. We will discuss EM and model-based clustering in more detail in the next lecture.
For now let’s look at an example of application of model based clustering to gene expression.
Note: The example discussed in class came from Dr. Ruzzo’s paper: Model-Based Clustering and Data Transformations for Gene Expression Data, which can be found at http://www.cs.washington.edu/homes/ruzzo/
The presentation used in class can be found here
Also, there is supplemental material at http://www.cs.washington.edu/homes/kayee/model/