Lecture n


Tuesday, December 4

Lecturer: Larry Ruzzo

Notes: Tobias Mann


Gene Prediction




There is lots of sequence data available, and there is a need for automated sequence annotation. Gene prediction algorithms seek determine which parts of a dna sequence will be expressed as a protein.


Biological Background:


DNA is transcribed into mRNA, and the mRNA is then translated into proteins. Each amino acid in a protein is coded as a triplet of DNA bases (bi bi+1 bi+2),where each bj denotes an ‘A’,’T’,’G’, or ‘C’. These triplets are called codons.  There are also codons that indicate that translation should start or stop. There are 64 distinct codons, which map to 20 different amino acids.


Reading Frame:

A dna sequence S = [b1 b2 … bn], has three possible translations into a protein sequence, each of which is called a reading frame. The first translation starts at b1, and is the amino acid sequence determined by: [ (b1 b2 b3) (b4 b5 b6) …]. The second translation starts at b2, and is the amino acid sequence determined by: [ (b2 b3 b4) (b5 b6 b7) … ]. The third sequence starts at b3 and is the amino acid sequence determined by: [(b3 b4 b5) (b6 b7 b8) … ].  An open reading frame (ORF) is a reading frame with no stop codons.


Statistics of Sequences:


Random DNA should have a stop codon every 21 bases on average, but the average protein is about 10K bases.  This suggests that finding a long sequence of DNA with no stop codons might yield DNA sequences that are likely to be translated into proteins.


Also, in random DNA, the ratio of Leucine:Alanine:Tryptophan should be about 6:4:1, whereas proteins made by organisms have different ratios which can differ significantly. Also, some species are biased in their use of codons, and use some codon synonyms for certain proteins more than others.


An Algorithm For Predicting Reading Frames


Assume that codons are i.i.d, and that codon (bi bj bk) has frequency f(bi bj bk). Then, a scheme to predict reading frames is to find


Pi = f(bi+1 bi+2 bi+3)* f(bi+4 bi+5 bi+6)* f(bi+7 bi+8 bi+9)* f(bi+10 bi+11 bi+12)…i=(0,1,2)


Where Pi can be interpreted as the probability of observing a sequence of codons.


The most probable reading frame is the one whose codons are closest to the distribution of codons in the organism whose DNA is being analyzed.


This works pretty well in prokaryotes, where most DNA is coding for proteins, and the ORFs are relatively long. However, note that not every ORF is expressed, and the situation is considerably more complicated for eukaryotes.




Another way to identify genes is to find promoters, which are sequences that are upstream of genes and can cause the genes to be expressed. A classic example is the concensus sequence ‘TATAAT’ which is about 10 bp upstream of transcription sites in e coli. The ‘TATAAT’ sequence is not perfectly conserved, and there are other promoter sequences as well.


Weight matrices, which express the conditional probability of a base occurring as a function of position in the sequence, can be used to identify ‘TATAAT’ sequences, and the predictions of similarity to the ‘TATAAT’ sequence yielded by weight matrices have some correlation with the binding energy of RNA polymerase to that sequence. One reason for variation in the ‘TATAAT’ sequence may be that through variation of binding affinity to RNA polymerase, the level of expression of a gene can be varied through several orders of magnitude, where sequences closer to ‘TATAAT’ bind more readily to RNA polymerase and are thus expressed more often.