Lecture 11

Course: CSE 527 (Computational Biology)

Instructor: Larry Ruzzo

Thurday November 8, 2001

Notes: Kai Wang

1. cont. from last lecture

Definition of the relative entropy (also called “Kullback-Leibler divergence” or “information content”) for distributions P, Q:

This is a summary for how similar the two distributions are. Suppose P and Q are identical, then,

The more different P and Q are, the bigger is.

For finite sample space, the function will never be negative (). Here is how to prove this:

"x, we have lnx £ x –1

\ - lnx ³ 1 – x

\ ln(1/x) ³ 1 –x

Bear this in mind, now we come back to prove that :

Now, remember from last lecture:

For a general EM algorithm, we have the following:

Visible factor – x

Hidden factor – y

Parameters – q

The goal is to maximize likelihood estimate of q, that is, to find the q that maximize Pr(x|q), or log Pr(x|q)

, so we have

For "y,

So, given the parameter optimization of t iteration q^t

|_______________________|

What the EM algorithm does is that it try to find a q which can maximize the

The key trick here is: it is not hard to find that the latter part in the above formula

So, logP(x|q) –logP(x|q^t) ³ 0 as long as Q(q|q^t) - Q(q^t|q^t) ³ 0

Now we only need to find a q to maximize Q(q|q^t).

2. DNA sequences pattern model

It has been found that, DNA sequence can form certain amount of patterns – four nucleotides are arranged into similar ways to facility some cellular function, such as DNA-DNA interaction or DNA protein binding. For example, in most of genes, the TATA box is a common parttern lying in the promoter region:

A C G T A T A A T C G A …

G T C T A T A C T G T A

G C G T T T A A T A C T

alwaysT T A T

usually a a

This pattern is not absolutely strict - every gene may have some exception, but most genes have most of the pattern intact.

Question: How could we model a pattern?

One solution to this question is to look for other patterns that differ at only on character

- This turns out to be not good, because some positions may be more tolerant than the other.

The way people propose to do this is as follows:

	T	a	T	A	A	T
A	2	80	1
C	1	10	1
G	0	5	8
T	97	5	90

Left is a summary how similar are the patterns in different instance (although this kind of summary sometimes masks some possible associations).

We define each of these patterns a motif, which has following parameters:

l: motif length

q_ib: frequency of occurrence of base b in the ith position of a motif

For above table:

q_1A = 2%; q_1C = 1%; q_2A = 80%

Assuming that no correlation between positions,

For the sequence T . G . . . .

We can calculate the probability to get such a sequence: * ….

The conserved sequence will be the one has the highest probability. This gives us the probability of each string that the motif produces.

Also, we can compare the probabilities where the motif begins:

0.97*0.80*…

TATAAT

0.02*0.05*…

But what is the motif against?

- we need to setup a background model: the frequency of occurrence of each four nucleotide within the whole genome.

Lecture 11

To be continued…