CSE P590A, Su '06: Assignment #5, Due: 8/17/06

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Reading: See Schedule page.

Homework:

You have a choice this week: do any one of the following problems. Tackle two or more for extra credit.

Prediction of Protein Coding Genes: In lecture 7, I described 2 basic ideas that are exploited in nearly all gene finding programs. First, look for long ORFs; second, look for ORFs having sequence statistics (e.g., codon usage frequencies) that are more typical of known genes than of non-coding DNA. An ORF is just a (maximal length) sequence of consecutive nucleotide triplets, only the last of which is a stop codon (TAA, TAG, or TGA). ORF's are easy to find. Starting at the leftmost triplet, scan triplet by non-overlapping triplet until you find a stop; then continue until you find another stop, etc. Then repeat the whole process from the left end again, but omitting he very first or first two nucleotides (to account for the other two possible reading frames. [One should also do it from the right end on the Watson-Crick complement of the sequence, but for this assignment, in the interest of simplicity, we're going to completely ignore that. There are several other over-simplifications as well, so that, hopefully, the key ideas will shine through.]
The first part of this problem is to write a program to do the above process. Let's again use the M. jannaschii sequence as an example. Also, download the Genbank annotation file NC_000909.gbk (from the same place as in last week's assignment). In it you will find the coordinates of all the presumed protein-coding genes, each identified by a line like
```
     CDS             complement(2216..3343)
```
or
```
     CDS             4911..5381
```
"CDS" is shorthand for "coding sequence", and the numbers are its start and end positions, up to and including the last nucleotide of the stop codon. The numbering scheme counts the first position of the genome as 1, not zero. The "complement" entries are on the reverse strand; you may ignore them, as well as a handful of other odd cases.
Perhaps the simplest plausible gene prediction approach is to guess that every ORF longer than some fixed threshold is a gene. How good it this? What threshold? Use your ORF finder to examine these questions, by comparing ORFs to the Genbank annotations. Correctly predicting the exact starting positions of genes is difficult, but matching the other end is easier - it will be a stop codon. Again to (over-) simplify the problem, we will declare an ORF to be a successful gene prediction if its stop codon exactly matches the position of a stop codon in one of the Genbank genes. Except in rare cases, which we will again ignore, each such ORF will be at least as long as the real gene.
You should print out a simple "histogram" with one line for each ORF length showing the number of ORFs of that length in the genome, and the number of them that match "real" genes, as above. Only print lines having non-zero counts.
I think you will see that this simple method is quite accurate for long genes (given my simplified definitions, it cannot fail to find every gene, and above some length threshold it rarely makes false predictions). For shorter genes, however, its rate of false predictions rises. Can we do better by exploiting sequence statistics? The answer is yes, and here's one simple approach (which is a little more flexible than simply looking at codon bias). A k-th order Markov model predicts successive letters in a sequence based on the previous k letters (or fewer, at the beginning of the sdequence). Specifically, for k=3, say, the probability of sequence x = x₁ x₂ ... x_n is
```
  P(x) = P(x₁) * P(x₂ | x₁) * P(x₃ | x₁x₂) * P(x₄ | x₁x₂x₃) * P(x₅ | x₂x₃x₄) ... P(x_n | x_n-3x_n-2x_n-1)
```
Compare to the definition of 1st order Markov models on page 48 in the text. Note that all but the first 3 terms in the product are conditional on just the previous 3 letters. Also note that this is a plain Markov model, not a hidden Markov model. If you know the various probabilities, it is simple to evaluate the probability that a given sequence was generated by this model: take the first letter x₁ of the sequence, and use it to look up P(x₁); take the first two letters x₁ x₂ and use them to look up P(x₂ | x₁); etc. Then multiply them all together. (Or, more conveniently, add their logs.) Getting back to the gene prediction problem, suppose I have a set of probabilities like this for real genes, and a similar set of probabilities, say Q( . | . ), for "background" sequences that are not part of genes. Then I could predict whether a given ORF sequence x was a real gene by comparing P(x) to Q(x), e.g. log(P(x)/Q(x)) > 0 implies that the sequence is more likely to have been generated by the gene model than the background model.
OK, so where do the probabilities P come from? Simply take a "trusted" set of example genes and count the frequencies of all (k+1)-letter (contiguous) subsequences. E.g., if k=2 and your training set is just the sequence AAAAGCTA, then AAA occurs twice, and AAG, AGC, GCT, and CTA each occur once, so P(A | AA) = 2/3, P(G | AA) = 1/3, P(A | CT) = 1/1, etc. Estimate probabilities for the shorter conditioning sequences in the same way. E.g., 5/8 of the training set letters are A so P(A) = 5/8, P(A | A) = 3/4, P(A | G) = 0, etc. As usual, with a sparse training set like this, adding pseudo counts would probably be a good idea, but I think there's enough training data in the M. jannaschii example that pseudo counts are unnecessary: take as your trusted set all ORFs longer than 1400 nucleotides; most or all of them are real genes. Similarly, you can use all ORFs shorter than 50 nt as a trusted set of non-genes. Treat non-ACGT letters as T (arbitrarily).
Calculate the P and Q tables as above for k=3, then compute log(P(x)/Q(x)) for each ORF. Extend your "histogram" printout to show the average log ratio for ORFs of a given length as well as the number of them having positive log ratio, and the number of positive ones that are annotated as real in the Genbank records. Does this additional information increase the accuracy of gene predictions for shorter ORFs?
There are various directions you could pursue if you want some extra credit. Handle the reverse strand, too. I think you'll find a few very long ORFs that don't exactly correspond to the Genbank annotations, and other anomalies; what's up with that? My negative training set includes short ORFs on the reverse strand of real genes; is this a good idea? Explore how performance depends on k. Go read about "ROC curves" and do a full ROC analysis of this problem. Could this approach be extended to make better predictions of the start codon, too? (Most or all genes in M. jannaschii start with ATG, but there can be multiple ATG's in an ORF; which it the right one?) I'm open to other ideas, too; send me email.
Modeling Non-Coding RNA Gene Families: This is a reasonably straight-forward problem, but requires reading ahead to look at material I won't lecture on until the last class. Details here.
De Novo Discovery of Non-Coding RNA Genes: Given the success of the simple approach we used in HW#4 for finding non-coding RNA genes, it is natural to wonder whether any of the GC rich patches we found which were not previously annotated as tRNAs or rRNAs are in fact real non-coding RNA genes. Short of wet-lab experiments (as in [1]), how might we tell? One approach would be to look for similar sequences in related organisms. Here's a sketch of one possible approach. Take a look at the taxonomy information at the NCBI web site, and select some organisms related to M. jannaschii, probably AT-rich ones, perhaps also hyperthermophiles. Run your HW#4 algorithm on them as well. Filter out the tRNA and rRNA hits and any shorter than, say, 50 nucleotides. Perhaps extend each hit by 25-50 nucleotides in each direction, in case the Viterbi boundaries were somewhat off. Try to match each hit in one organism to its (putative) orthologs in the others, based perhaps on length, BLAST matches (feel free to download & install it locally, either the NCBI or WU versions) or Smith-Waterman alignments (modify your HW#2 code, or perhaps use the ssearch component of W. Pearson's fasta package). Even matching them by eye is OK, although that obviously won't scale very well... Do any of the putative ortholog groups appear to have conserved secondary structures? Perform secondary structure predictions using the Vienna RNA package, Pfold, CMfinder or other tools. What do you find?
This problem is deliberately open-ended, and not certain of success. I'm open to just about anything you want to try this side of ouija boards; just describe what you try and how it works (or doesn't), and perhaps your thoughts on better alternatives.

Code: As before, you may use C, C++, C#, Java, Scheme, Perl, Python or Ruby; ask if you prefer something else. It's ok to make your code relatively special-purpose if you prefer, e.g., array sizes and initial parameters as compiled-in constants, minimal error checking, etc.

Turn in: For any of these problems, turn in your (commented) source code, output logs of any data requested and a brief write-up (.pdf, .doc, .rtf or plain text) describing what you did/what you observed. If you did any of the extra credit pieces, clearly indicate this at the front of your writeup; If you tackled more than one of the problems above, please put each in a separate subdirectory. If there's anything the least bit tricky about building your program, explain it; you're dealing with an ivory-tower pinhead here, so don't assume I'm competent...

If desired, you may work in groups of 2 or 3, but if so (a) you must do at least one of the extra credit steps, (b) you really need to collaborate on all steps; you won't learn anything by calling someone else's black box, and (c) clearly identify all group members in each file you turn in.

Bundle everything into a single .zip or .tgz file and UPLOAD IT HERE.

After turning it in, please complete this quick (non-anonymous) SURVEY. (You may have been prompted to do this when you turned in your assignment; no need to do it twice.)

References:

[1] RJ Klein, Z Misulovin, SR Eddy, "Noncoding RNA genes identified in AT-rich hyperthermophiles." Proc. Natl. Acad. Sci. U.S.A., 99, #11 (2002) 7542-7. [offcampus]

Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to csep590a-webmaster at cs.washington.edu]


	Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX [comments to csep590a-webmaster at cs.washington.edu]