Lecture 16
Dec 6,2001
Lecturer: Walter L.
Ruzzo
Notes: Hao Mei
Gene Prediction in Eukaryotes
Review:
Gene
expression is the biological process by which a DNA sequence generates its
product, protein. It involves two steps: transcription and translation. RNA
polymerase uses one strand of DNA as template and produces mRNA (messenger RNA). The mRNA sequence produced is complementary
to the DNA strand which was used as template. The subsequent process,
translation, synthesizes the protein according to information coded in the
mRNA. This process is performed in the sub cellular elements called ribosomes.
In
the mature mRNA, A triplet of bases, called codons represent an amino acid.
Three codons (UAA, UAG and UGA) indicate end of translation. One coden, AUG, indicates start of
translation as well as code for an amino acid (Met) . Any given nucleotide
sequence (single DNA strand or mRNA) can be interpreted in three possible ways.
These three ways are called reading frames. An open reading frame (ORF) is a
sequence of codons with no stop codon. Coding regions are the key features to
explore the long distinguished DNA sequence. Signal sequence (ie, codon
frequency) is the important feature for us to identify the functional gene. In
the last class, we also discussed the methods of Markov Model, Weight Matrices
and codon frequency etc. to find the codon regions in the prokaryotes.
Eukaryote gene
structure:
Today,
we continue our discussion on gene finding of Eukaryotes. We will try to integrate multiple types of signal
information in order to predict entire gene structures in genomic sequences. We
will focus on the paper by Burge & Karlin. As a case study, we look at a
computer program called GENSCAN which uses a general probabilistic model for
gene identification.
The gene
structure and the gene expression mechanism in eukaryotes are far more
complicated than in prokaryotes. As in prokaryotes, the eukaryotes also have
the signal sequences such as Promoters, start/stop transcription and start/stop
translation, but the sequence may be more variable. The gene expression in
eukaryotes has the new features. The transcription of DNA to pre-mRNAs by RNA
polymerase II is processed in the cell nucleus. After nascent RNA molecules
produced by RNA polymerase II, the 5′-Cap(7-methylguanosine) is added.
Transcription by RNA polymerase II terminates at any one of multiple
termination sites downstream from the poly(A) site, which is located at the
3′ end of the final exon. After the primary transcript is cleaved at the
poly(A) site, a string of adenine (A) residues is added. During the final step
in formation of a mature, functional mRNA, the introns are removed and exons
are spliced together. Mature mRNAs need to be transported into the cytoplasm to
process the translation.
In typical
eukaryotes, the region of the DNA coding for a protein is usually not
continuous. This region is composed of alternating stretches of exons and
introns. During transcription, both exons and introns are transcribed onto the
RNA, in their linear order. Thereafter, a process called splicing takes place,
in which the intron sequences are excised and discarded from the RNA sequence.
The cell's method for identifying introns is the presence of GT and AG splice
signals that always occur as the first and last dinucleotide of an intron. The
remaining RNA segments, the ones corresponding to the exons, are ligated to
form the mature RNA strand. A typical multi-exon gene has the following
structure. It starts with the promoter region, which is followed by a transcribed
but non-coding region called 5' untranslated region (5' UTR). Then follows the
initial exon which contains the start codon. Following the initial exon, there
is an alternating series of introns and internal exons, followed by the
terminating exon, which contains the stop codon. It is followed by another
non-coding region called the 3' UTR. Ending the eukaryotic gene, there is a
polyadenylation (polyA) signal: the nucleotide Adenine repeating several times.
The important process is that the region starting 10-30 nucleotides after a
polyadenylation signal, usually AATAAA, is chopped off and replaced by a string
of several hundred A's, called the poly-A tail. The exon-intron boundaries
(i.e., the splice sites) are signaled by specific short (2bp long) sequences.
The 5'(3') end of an intron (exon) is called the donor site (Splicing Signal:
GT), and the 3'(5') end of an intron (exon) is called the acceptor(Splicing
Signal:AG). Branch point is an anchor point that appears frequently in the
intron. Another statistical characteristic is a pyrimidine (bases C,T) rich
area that appears between the branchpoint and the acceptor site. There are
probably 35% of genes alternatively spliced, which means that under different
circumstances, different combinations of exons are selected.
Some Statistical
Feature:
The example of
Vertebrate genes:
On average, about 6 exons span a 30Kb
long vertebrate gene. The average
coding region is only about 1Kb long. Each exon is about 150bp long. The
promoter is about 6bp long and appears about 30bp upstream of the transcription
start site (TSS).
High Variance:
There is
huge deviations from the average eukaryote gene structure. For example, the
gene of dystrophin is about 2.4MB long. The size of 26 exons in the Blood
Coagulation Factor varies from 69bp to 3106bp. The total coding region is about
186Kb long. The introns are up to 32.4kb. Intron number 22 produces 2
transcripts unrelated to this gene, one for each strand.
More statistical
feature:
An
average 5’ UTR is 750 bp long, but it can be longer and span several exons (for
example, in the MAGE family). On an average, the 3’ UTR is about 450 bp long,
but examples exist where its length exceeds 4 Kb (e.g., the gene for Kallman’s
syndrome).
Variation in overall gene size and intron size:
There
is considerable variation in overall gene size and intron size. Many genes are
over 100 kb long. The max known example is the dystrophin gene (DMD)
(2.4 Mb). The variation in the size distribution of coding sequences and
exons is less extreme, although there are some remarkable outliers. The titin
gene has the longest currently known coding sequence at 80,780 bp; it also
has the largest number of exons (178) and longest single exon (17,106 bp).
Comparison
of human, worm and fly genes:
Ø
Similar length of coding sequences.
Ø
Most internal exons fall within a common peak
between 50 and 200 bp
Ø
Intron size distributions differ substantially
o worm
and fly each have a reasonably tight
humans have a much broader distribution
Ø Variation
in intron size results in great variation in gene size
Gene Scan: A
Case Study
GENSCAN,
a computer program for gene identication. The program uses a training set of
completely sequenced genes from GenBank as the test set.
The tranning data of GeneScan include:
238 multi-exon genes
142 single exon genes
a total of 1492 exons
a total of 1254
introns
2.5 millon base pairs
Important features of
GENSCAN include:
GENSCAN
identifies complete exon/intron structures of genes in genomic DNA.
It can
predict multiple genes in a sequence, to deal with partial as well as complete
genes.
Ability to predict consistent sets of
genes occurring on either or both DNA strands.
Ability
to predict both optimal annotation and sub-optimal exons.
GENSCAN is shown to have substantially higher accuracy than
existing methods when tested on standardized sets of human and vertebrate
genes. The program is also capable of indicating fairly accurately the
reliability of each predicted exon.
Signal models
GENSCAN
uses different signal models to model different functional units. One of the
models is weight matrix model (WMM) in which every position has its own
specific independent distribution. It is used for modeling polyadenylation
signals, translation initiation signal, translation termination signal and
promoters.
Weight Matrix Models
(WMMS)
Polyadenylation
signals are modeled as a 6 bp WMM (consensus: AATAAA).
A
12 bp WMM model, beginning 6 bp prior
to the initiation codon, is used for the translation initiation signal.
For
the translation termination signal, one of the three stop codons is generated
according to its observed frequency in the learning set and the next three
nucleotides are generated according to a WMM.
Promoter:
Since about 30% of eukaryotic promoters lack an apparent TATA signal, Thus,
GENSCAN uses a split model in which a TATA-containing promoter is generated
with probability 0.7 and a TATA-less promoter with probability 0.3. The
TATA-containing promoter is modeled using a
15 bp TATA-box WMM and an 8 bp cap signal WMM. The length between the
WMMs is generated uniformly from the range of 14 to 20 nucleotides,
corresponding to a TATA -cap site distance of 30 to 36 bp, from the first T of
the TATA-box matrix to the cap site (start of transcription). TATA-less
promoters are simply modeled as intergenic null regions of 40 bp length.
Maximal dependence
decomposition (MDD).
Donor
splice sites are modeled by maximal dependence decomposition. A very common
observation is that there exist strong dependencies between non-adjacent as
well as adjacent positions in the donor splice signal. The maximal dependence
decomposition is designed to capture exactly these kinds of dependencies. (to
be continued ……)
Reference:
1. Prediction of Complete Gene Structures in
Human Genomic DNA ,ChrisBurge,
Samuel Karlin. Journal of Molecular Biology, Vol. 268, No. 1, Apr 1997, pp.
78-94
2. Initial sequencing and analysis of the
human genome. Lander
ES, et al. Nature 2001 Feb 15;409(6822):860-921
3. http://linkage.rockefeller.edu/wli/gene/