CSE527, Lecture 14, Nov 13, 2006 Zach Frazier Administration: The homework grades will be emailed soon. There will be a new homework soon. Gene splicing: This is a complex process, with lots of different RNA molecules and proteins involved. The data given is from human, but is characteristic of most mammals. Splicing is common in most eukaryotes, but the stats vary across genera. Gene Finding is a hard problem, like looking for a needle in a haystack. The intron lengths can be very long, and the distribution is more skewed than other species. There is lots of speculation about the origins of intronic DNA. One hypothesis, the exon-shuffling hypothesis, suggests a link between protein domains or modules and exons. Suggesting that separating small functional or useful regions allows for random mutations to have a higher effect of providing functional proteins. Scientists speculate that introns developed over one billion years ago, based on homologous exon/intron DNA segments found in both plants and humans, suggesting presence in the common ancestor. Bacteria do not have introns, but they are under very strict evolutionary pressure, and may have had and lost them in order to gain some efficiency. The self-splicing introns seen in Tetrahymena (and elsewhere) are believed to be precursors to modern intron splicing systems. Tetrahymena is a very advanced species with many interesting features. It can reproduce sexually and asexually, and maintains two distinct nuclei. GC content varies for exons and introns. G and C frequencies are often very similar on a single strand. One reason is believed to be because of flipping during mutation. Which will maintain the two pairing bases in similar frequencies. Most of the human genome is at 38% GC. Genes most often have a higher GC. Megabase-scale regions of high GC are also known as isochores. These are observed mostly in vertebres, and have been reported in mammals, and birds, but not fish. One hypothesis suggests that the genome used to be AT rich, except for the ends of the chromosomes which were GC rich, and over time through chromosonal rearrangement the current pattern emerged. A piece of supporting evidence can be found in the rodent genome, which is more homogeneous then human, and is also more rapidly reproducing. There is also debate about what evolutionary pressure would be. One theory is that processes related to DNA damage repair are biased toward converting T to C, when mismatched pairs are detected, since that is a common mutation due to methylation. The theory is that this process has over compensated. From the figures: GC rich regions tend to hold more genes. intron lengths are correlated with GC content, exon lengths are not. The GenScan program is a gene finding program. There used to be many gene finding systems which used AI, or expert systems, but these have fallen out of favor as statistical methods proved more accurate. The training set used was very limited, but very good for the time. In their results the sensitivity is measured as: (true positives)/(actual positives) The specificity is measured as: (true positives)/(predicted positives) ME : missed exons WE : wrong exons. (mismatched boundaries?) They used a Generalized Hidden Markov Model (GHMM). This differs from the standard HMMs that we have seen, in that instead of emitting single values, the states can emit arbitrary strings (do not have to be of same length). The standard HMM algorithms such as Viterbi training can still be used, with some modifications. An example trace through the model: Start at N: intergenic region. We emit a typical intergenic region of the standard length distribution we have observed. Next we move to the promoter box. Here we might emit the TATA-box. From there we would move to the 5' UTR, where we would emit a typical UTR, which was based on the training data. Then we might pass through one or more exons, as well as the polyadenylation tail. There are three exon and three intron states. These correspond to each phase of the reading frame, and ensure that when intron regions are modeled the reading frame stays in the same phase. If the gene is on the opposite strand, then we have to identify everything backwards. We would begin with the reverse complement of the polyadenylation tail, followed by a reversed 3'-UTR, etc. Sub-models: The intron length distribution is approximately geometric. The initial exon is distinct enough from the others that it is modeled separately. For the length distribution they use a smoothed distribution of observed lengths. The 3' UTR is modeled through a 5th order Markov Model. The exon sequence is also modeled through a 5th order Markov model. Weight matrices are used for some sub-models, such as the polyadenylation signal, where the consensus is AATAAA, but others have non-zero probability. The translation start includes 12 base pairs, starting 6 before the start codon. This is also modelled with a WMM. Translation stop is only modeled as the stop codon and additional base pairs. Promoters are broken down onto categories. Since 30% of human genes appeared to not have a TATA-box, there is a 30% probability the model will generate a 40 bp sequence according to the background. The other 70% corresponds to the typical promoter regions. It begins with a WMM for the TATA-box, then there is a 15 bp WMM, followed by a uniform distribution for 14 to 20 bp. Then there is another 8bp WMM which corresponds to the CAP-enzyme binding site. Intron "Sequence Logos": The vertical black lines represent the start and stop of the intron. The relative heights of the letters represent their frequency, While the total height of the column represents the entropy of that position. The beginning of the intron for example is near several high entropy nucleotides that are almost certain to be present, as is represented by the large G and T. At the other end we can see the polypyrimidine tract of CT. The information present does not appear to be enough of a signal to precisely identify splicings. There are probably more signals, they just have not been recovered yet. For the 5' UTR there was too much dependence between distant columns to use a first order model, so they use a decision tree. In this model the splice site is at position 0. The column numbers are given relative to the splice site. However the positions 1 and 2 are not given (since they are always GT, resp.) The \chi^2 test is used to measure the independence of two different positions. In the table high numbers are non-independent. (More precisely, high numbers are incleasingly unlikely if the positions are indepencent.) Similarly, if the \chi^2 is near 0, the two positions are assumed to be independent. Most values in the table are interesting, in that they have a significant value according to the p-value. The calculation was not done in an all-vs-all way for all nucleotides in all positions, because there was not enough data, instead the consensus value for each column was compared against all values in the other columns, the consensus values that were used are provided in the first column. The sums on the right side provide a rough measure of which columns are more important. The U1 small nuclear RNA is aligned below. This is critical in splicing and it is assumed that there must be a strong match to the reverse of this molecule. Intuitively, it seems plausible that any mismatches in one part of the match, must be made up later in the pairing, and the non-independence data in the table roughly supports this view. These most important columns appear first in the decision tree. Since the 5th column is most important, we branch on that first. After the split based on this column, we recalculate the column importance with the given values for the column, and continue to procedure down the tree. We stop when all values are below the \chi^2 cutoff, or when the counts are too small to continue. The overall success of Genscan is due to its thorough and careful construction. They were careful to not cut corners, and to use the most complex model they could justify while avoiding over-fitting. GHMM is a very powerful and generic framework for developing applications. Issues with Training Data: single exon genes are over-represented. highly expressed genes are over-represented. pseudo-genes: mouse has 5000 olfactory receptors. humans have 100, but pseudo genes can be found for 1000 more which are still recognizable after 65 million years. There also may be 10,000 or more non-coding RNA genes which these methods will not recognize.