Course:
CSE 527 (Computational Biology)
Instructor:
Larry Ruzzo
Thurday
November 8, 2001
Notes:
Kai Wang
1.
cont. from last lecture
Definition of the relative entropy (also called “KullbackLeibler divergence” or “information content”) for distributions P, Q:
_{}
This is a summary for how similar the two distributions are. Suppose P and Q are identical, then,
_{}
The more different P and Q are, the bigger _{}is.
For finite sample space, the function will never be negative (_{}). Here is how to prove this:
"x, we have lnx £ x –1
\  lnx ³ 1 – x
\ ln(1/x) ³ 1 –x
Bear this in mind, now we come back to prove that _{}:
_{}
Now, remember from last lecture:
For a general EM algorithm, we have the following:
Visible factor – x
Hidden factor – y
Parameters – q
The goal is to maximize likelihood estimate of q, that is, to find the q that maximize Pr(xq), or log Pr(xq)
_{} , so we have _{}
For "y,
_{}
So, given the parameter optimization of t iteration q^{t}
_{}
_______________________
_{}
\ _{}
What the EM algorithm does is that it try to find a q which can maximize the
_{}
The key trick here is: it is not hard to find that the latter part in the above formula
_{}
So, logP(xq) –logP(xq^{t}) ³ 0 as long as Q(qq^{t})  Q(q^{t}q^{t}) ³ 0
Now we only need to find a q to maximize Q(qq^{t}).
2.
DNA sequences pattern model
It has been found that, DNA sequence can form certain amount of patterns – four nucleotides are arranged into similar ways to facility some cellular function, such as DNADNA interaction or DNA protein binding. For example, in most of genes, the TATA box is a common parttern lying in the promoter region:
A C G T A T A A T C G A …
G T C T A T A C T G T A
G C G T T T A A T A C T
.
.
alwaysT T A T
usually a a
This pattern is not absolutely strict  every gene may have some exception, but most genes have most of the pattern intact.
Question: How could we model a pattern?
One solution to this question is to look for other patterns that differ at only on character
 This turns out to be not good, because some positions may be more tolerant than the other.
The way people propose to do this is as follows:

T 
a 
T 
A 
A 
T 
A 
2 
80 
1 



C 
1 
10 
1 



G 
0 
5 
8 



T 
97 
5 
90 



Left is a summary how similar are the patterns in different instance (although this kind of summary sometimes masks some possible associations).
We define each of these patterns a motif, which has following parameters:
l: motif length
q_{ib}: frequency of occurrence of base b in the ith position of a motif
For above table:
q_{1A} = 2%; q_{1C} = 1%; q_{2A} = 80%
Assuming that no correlation between positions,
For the sequence T . G . . . .
We can calculate the probability to get such a sequence: _{}*_{} ….
The conserved sequence will be the one has the highest probability. This gives us the probability of each string that the motif produces.
Also, we can compare the probabilities where the motif begins:
0.97*0.80*…
¯
TATAAT
0.02*0.05*…
But what is the motif against?
 we need to setup a background model: the frequency of occurrence of each four nucleotide within the whole genome.