Lecture 16

Dec 6,2001

Lecturer: Walter L. Ruzzo

Notes: Hao Mei


Gene Prediction in Eukaryotes


Gene expression is the biological process by which a DNA sequence generates its product, protein. It involves two steps: transcription and translation. RNA polymerase uses one strand of DNA as template and produces  mRNA (messenger RNA).  The mRNA sequence produced is complementary to the DNA strand which was used as template. The subsequent process, translation, synthesizes the protein according to information coded in the mRNA. This process is performed in the sub cellular elements called ribosomes.



In the mature mRNA, A triplet of bases, called codons represent an amino acid. Three codons (UAA, UAG and UGA) indicate end of translation.  One coden, AUG, indicates start of translation as well as code for an amino acid (Met) . Any given nucleotide sequence (single DNA strand or mRNA) can be interpreted in three possible ways. These three ways are called reading frames. An open reading frame (ORF) is a sequence of codons with no stop codon. Coding regions are the key features to explore the long distinguished DNA sequence. Signal sequence (ie, codon frequency) is the important feature for us to identify the functional gene. In the last class, we also discussed the methods of Markov Model, Weight Matrices and codon frequency etc. to find the codon regions in the prokaryotes.


Eukaryote gene structure:


Today, we continue our discussion on gene finding of Eukaryotes.  We will try to integrate multiple types of signal information in order to predict entire gene structures in genomic sequences. We will focus on the paper by Burge & Karlin. As a case study, we look at a computer program called GENSCAN which uses a general probabilistic model for gene identification.


The gene structure and the gene expression mechanism in eukaryotes are far more complicated than in prokaryotes. As in prokaryotes, the eukaryotes also have the signal sequences such as Promoters, start/stop transcription and start/stop translation, but the sequence may be more variable. The gene expression in eukaryotes has the new features. The transcription of DNA to pre-mRNAs by RNA polymerase II is processed in the cell nucleus. After nascent RNA molecules produced by RNA polymerase II, the 5′-Cap(7-methylguanosine) is added. Transcription by RNA polymerase II terminates at any one of multiple termination sites downstream from the poly(A) site, which is located at the 3′ end of the final exon. After the primary transcript is cleaved at the poly(A) site, a string of adenine (A) residues is added. During the final step in formation of a mature, functional mRNA, the introns are removed and exons are spliced together. Mature mRNAs need to be transported into the cytoplasm to process the translation.


In typical eukaryotes, the region of the DNA coding for a protein is usually not continuous. This region is composed of alternating stretches of exons and introns. During transcription, both exons and introns are transcribed onto the RNA, in their linear order. Thereafter, a process called splicing takes place, in which the intron sequences are excised and discarded from the RNA sequence. The cell's method for identifying introns is the presence of GT and AG splice signals that always occur as the first and last dinucleotide of an intron. The remaining RNA segments, the ones corresponding to the exons, are ligated to form the mature RNA strand. A typical multi-exon gene has the following structure. It starts with the promoter region, which is followed by a transcribed but non-coding region called 5' untranslated region (5' UTR). Then follows the initial exon which contains the start codon. Following the initial exon, there is an alternating series of introns and internal exons, followed by the terminating exon, which contains the stop codon. It is followed by another non-coding region called the 3' UTR. Ending the eukaryotic gene, there is a polyadenylation (polyA) signal: the nucleotide Adenine repeating several times. The important process is that the region starting 10-30 nucleotides after a polyadenylation signal, usually AATAAA, is chopped off and replaced by a string of several hundred A's, called the poly-A tail. The exon-intron boundaries (i.e., the splice sites) are signaled by specific short (2bp long) sequences. The 5'(3') end of an intron (exon) is called the donor site (Splicing Signal: GT), and the 3'(5') end of an intron (exon) is called the acceptor(Splicing Signal:AG). Branch point is an anchor point that appears frequently in the intron. Another statistical characteristic is a pyrimidine (bases C,T) rich area that appears between the branchpoint and the acceptor site. There are probably 35% of genes alternatively spliced, which means that under different circumstances, different combinations of exons are selected.


Some Statistical Feature:

The example of Vertebrate genes:

On average, about 6 exons span a 30Kb long vertebrate gene.  The average coding region is only about 1Kb long. Each exon is about 150bp long. The promoter is about 6bp long and appears about 30bp upstream of the transcription start site (TSS).


High Variance:

There is huge deviations from the average eukaryote gene structure. For example, the gene of dystrophin is about 2.4MB long. The size of 26 exons in the Blood Coagulation Factor varies from 69bp to 3106bp. The total coding region is about 186Kb long. The introns are up to 32.4kb. Intron number 22 produces 2 transcripts unrelated to this gene, one for each strand.


More statistical feature:

An average 5’ UTR is 750 bp long, but it can be longer and span several exons (for example, in the MAGE family). On an average, the 3’ UTR is about 450 bp long, but examples exist where its length exceeds 4 Kb (e.g., the gene for Kallman’s syndrome).


Variation in overall gene size and intron size:


There is considerable variation in overall gene size and intron size. Many genes are over 100 kb long. The max known example is the dystrophin gene (DMD) (2.4 Mb). The variation in the size distribution of coding sequences and exons is less extreme, although there are some remarkable outliers. The titin gene has the longest currently known coding sequence at 80,780 bp; it also has the largest number of exons (178) and longest single exon (17,106 bp).


Comparison of human, worm and fly genes:

Ø      Similar length of coding sequences.

Ø      Most internal exons fall within a common peak between 50 and 200 bp

Ø      Intron size distributions differ substantially

o       worm and fly each have a reasonably tight

o       humans have a much broader distribution


Ø     Variation in intron size results in great variation in gene size








Gene Scan: A Case Study

GENSCAN, a computer program for gene identication. The program uses a training set of completely sequenced genes from GenBank as the test set.


The tranning data of GeneScan include:

238 multi-exon genes

142 single exon genes

a total of 1492 exons

a total of 1254 introns

2.5 millon base pairs


Important features of GENSCAN include:

GENSCAN identifies complete exon/intron structures of genes in genomic DNA.

It can predict multiple genes in a sequence, to deal with partial as well as complete genes.

Ability to predict consistent sets of genes occurring on either or both DNA strands.

Ability to predict both optimal annotation and sub-optimal exons.


GENSCAN is shown to have substantially higher accuracy than existing methods when tested on standardized sets of human and vertebrate genes. The program is also capable of indicating fairly accurately the reliability of each predicted exon.


Signal models

GENSCAN uses different signal models to model different functional units. One of the models is weight matrix model (WMM) in which every position has its own specific independent distribution. It is used for modeling polyadenylation signals, translation initiation signal, translation termination signal and promoters.


Weight Matrix Models (WMMS)

Polyadenylation signals are modeled as a 6 bp WMM (consensus: AATAAA).

A 12 bp WMM  model, beginning 6 bp prior to the initiation codon, is used for the translation initiation signal.

For the translation termination signal, one of the three stop codons is generated according to its observed frequency in the learning set and the next three nucleotides are generated according to a WMM.

Promoter: Since about 30% of eukaryotic promoters lack an apparent TATA signal, Thus, GENSCAN uses a split model in which a TATA-containing promoter is generated with probability 0.7 and a TATA-less promoter with probability 0.3. The TATA-containing promoter is modeled using a  15 bp TATA-box WMM and an 8 bp cap signal WMM. The length between the WMMs is generated uniformly from the range of 14 to 20 nucleotides, corresponding to a TATA -cap site distance of 30 to 36 bp, from the first T of the TATA-box matrix to the cap site (start of transcription). TATA-less promoters are simply modeled as intergenic null regions of 40 bp length.


Maximal dependence decomposition (MDD).

Donor splice sites are modeled by maximal dependence decomposition. A very common observation is that there exist strong dependencies between non-adjacent as well as adjacent positions in the donor splice signal. The maximal dependence decomposition is designed to capture exactly these kinds of dependencies. (to be continued ……)



1.      Prediction of Complete Gene Structures in Human Genomic DNA ,ChrisBurge, Samuel Karlin. Journal of Molecular Biology, Vol. 268, No. 1, Apr 1997, pp. 78-94

2.      Initial sequencing and analysis of the human genome.  Lander ES, et al. Nature 2001 Feb 15;409(6822):860-921

3.      http://linkage.rockefeller.edu/wli/gene/