CSE 427, Wi '08: Assignment #2, Due: 2/19/2008

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Reading: See Schedule page.

Homework: Prediction of Protein Coding Genes

Background: In lecture, I described 2 basic ideas that are exploited in nearly all gene finding programs. First, look for long ORFs; second, look for ORFs having sequence statistics (e.g., codon usage frequencies) that are more typical of known genes than of non-coding DNA. In the last assignment, you looked for long ORFs. This time, we'll see whether sequence statistics help.

Where do the statistics come from? I.e., how do we get a reliable set of gene sequences (aka "positive training examples") vs non-gene sequences ("negative training examples") from which to calculate the statistics? If your genome has already been annotated by the experts and you can look at their data from Genbank, you could use that. But that's really cheating---in practice, you'd want to do this on a genome that has not been annotated yet.

Here's one idea for doing it "from scratch." As in the previous assignment, an ORF is just a (maximal length) sequence of consecutive nucleotide triplets, the first of which is a start codon (ATG or GTG) and only the last of which is a stop codon (TAA, TAG, or TGA). Based on the data you saw last time, long ORFs are almost always genes, but short ORFs are almost never genes. ORFs with intermediate lengths are mixed. So, a reasonable starting point in a newly sequenced genome would be to take as your trusted set of genes all ORFs longer than, say, 1000 nucleotides. Similarly, as a trusted set of non-genes, you can use all ORFs shorter than, say, 50 nt, together with all "intergenic" sequence, i.e., sequence that is not part of any ORF.

Markov Models: Here is a simple approach to exploiting sequence statistics that is a little more flexible than simply looking at codon bias. A k-th order Markov model predicts successive letters in a sequence based on the previous k letters (or fewer, at the beginning of the sequence). Specifically, for k=3, say, the probability of sequence x = x₁ x₂ ... x_n is

  P(x) = P(x₁) * P(x₂ | x₁) * P(x₃ | x₁x₂) * P(x₄ | x₁x₂x₃) * P(x₅ | x₂x₃x₄) ... P(x_n | x_n-3x_n-2x_n-1)

Compare to the definition of 1st order Markov models on page 48 in Durbin et al., and to the CpG example in the 3rd online slide packet. Note that all but the first 3 terms in the product are conditional on just the previous 3 letters. Also note that this is a plain Markov model, not a hidden Markov model. If you know the various probabilities, it is simple to evaluate the probability that a given sequence was generated by this model: take the first letter x₁ of the sequence, and use it to look up P(x₁); take the first two letters x₁ x₂ and use them to look up P(x₂ | x₁); etc. Then multiply them all together. (Or, more conveniently, add their logs.) Getting back to the gene prediction problem, suppose I have a set of probabilities like this for real genes, and a similar set of probabilities, say Q( . | . ), for "background" sequences that are not part of genes. Then I could predict whether a given ORF sequence x was a real gene by comparing P(x) to Q(x); e.g., log(P(x)/Q(x)) > 0 implies that the sequence is more likely to have been generated by the gene model than the background model.

Training: OK, so where do the probabilities P come from? Simply take a "trusted" set of example genes and count the frequencies of all (k+1)-letter (contiguous) subsequences. E.g., if k=2 and your training set is just the sequence AAAAGCTA, then AAA occurs twice, and AAG, AGC, GCT, and CTA each occur once, so P(A | AA) = 2/3, P(G | AA) = 1/3, P(A | CT) = 1/1, etc. Estimate probabilities for the shorter conditioning sequences similarly, but restricted to sequences at the left ends of your training set. E.g., with only the one training sequence above, 100% of the sequences start with AA, so P(A) = P(A | A) = 1, and the other probabilities are zero. (More generally, with more training sequences, P(A) will be the fraction of training sequences starting with A, and P(A | A) will be the fraction starting with AA, etc.) As usual, with a sparse training set like this, adding pseudo counts would probably be a good idea, but I think there's enough training data in your bacterial genomes that pseudo counts are unnecessary. Take as your trusted set all ORFs longer than 1000 nucleotides; most or all of them are likely to be real genes. Similarly, you can use all sequence not in ORFs longer than 50 nt (i.e. short ORFs plus non-ORFs) as a trusted set of non-genes.

Treat non-ACGT letters, if any, as T (arbitrarily). (They reflect ambiguities in the readout of the DNA sequencers.)

To simplify this assignment a little and focus on the essentials, you are only required to identify and process ORFs on the "+" strand. This has the undesired effect that you need to alter your previous code to calculate statistics (TP, sTP, etc.) for just the "+" strand. I think this will be easier than generalizing your new code to handle both strands, but if you feel otherwise, you are free to handle both strands; see notes under "extra credit" below. (This also has the undesired effect that much of your non-gene training data will actually be from long ORFs on the "-" strand, but I think it will still work reasonably well.)

Thus, your "gene" training data consists of long ORFs on the "+" strand. It is unlikely that any two long ORFs overlap, but if they do, process each separately. (Thus, nucleotides in the overlap region get counted twice in the statistics. It is debatable whether this is a good idea, but the "obvious" alternative of processing the union of the two as one long ORF has the disadvantage that the second ORF is in a different reading frame from the first, which is likely to distort the statistics more than double-counting the small overlap does.)

The "background" (non-gene) training data needs to be handled slightly differently. In particular, after conceptually removing all "+" strand ORFs longer than 50 nt, you will have a number of contiguous segments of genome left. Each such segment could be subdivided into possibly overlapping short ORFs and (potentially long) non-ORFs, but breaking it up this way is artificial, since "start" and "stop" and "codon" are all meaningless if it's not coding for a protein. So, you shouldn't process the negative examples ORF by ORF; instead, for your statistics-gathering, process each contiguous piece as a unit.

Gene Prediction via Markov Models: Calculate the P and Q tables as above for k=3, then compute log(P(x)/Q(x)) for each ORF x. Those with positive log ratios are predicted to be genes. Repeat your calculation of accuracy statistics (false positive rate, etc.) from the last assignment for just the "+" strand ORFs, and compare them to the analogous statistics for gene predictions (again of "+" strand ORFs) based on log(P(x)/Q(x)) > 0. With your turnin, include a short written discussion of whether this additional information increases the accuracy of gene predictions, especially for shorter ORFs. You may want to include some additional visualizations of the data to clearly support your conclusions. E.g., maybe some histograms or scatter plots of accuracy versus length for the two methods would be informative than the genome-wide aggregate statistics.

Extra Credit: There are various directions you could pursue if you want some extra credit, including the following. (a) My negative training set includes short ORFs on the reverse strand of real genes; is this a good idea? If not, what's the right way? Even if you don't fully handle the "-" strand, explore whether inclusion/exclusion of long "-" strand ORFs from the negative training set hurts accuracy. (b) Handle the reverse strand, too. (Be careful to collect statistics for the long ORFs from the correct strand.) (c) How do you associate the negative training data to "+" or "-" strand? (d) Unlike ORFs, the negative examples don't have distinct "starts"; is it reasonable to learn their lower-order dependencies like Q(x₂ | x₁) from the starts of the training examples? What would be a better way? (e) I think you'll find a few very long ORFs that don't exactly correspond to the Genbank annotations, and other anomalies such as in-frame stop codons; what's up with that? (f) Explore how performance depends on k. (g) Is it better to use different k for the positive and negative models? (h) Go read about "ROC curves" and do a full ROC analysis of this problem. (i) Could this approach be extended to make better predictions of the start codon, too? (Most bacterial genes start with ATG/GTG, but there can be multiple ATG/GTG's in an ORF; which it the right one?) I'm open to other ideas, too; send me email.

Code: As before, you may use C, C++, C#, Java, Scheme, Perl, Python or Ruby; ask if you prefer something else. It's ok to make your code relatively special-purpose if you prefer, e.g., array sizes and initial parameters as compiled-in constants, minimal error checking, etc. Using external libraries (sorting, whatever) is OK. You should avoid, say, unnecessary quadratic time algorithms, but I'm not unduely worried about efficiency; if your code runs in under 10 minutes or so, that's fine.

Turn in: Please follow the turnin instructions from assignment 1, with the exception that you should separately give the 10 performance statistics (TP, sTP, ...) for the "long ORF" method from last time and the new Markov model method, just for "+" strand ORFs, and include the brief write-up (.pdf, .doc, .rtf or plain text) describing what you did/what you observed/how they compare. If you did any of the extra credit pieces, clearly indicate this at the front of your writeup, and preferably give output from both the extra-credit and plain versions.

As before, you will actually run your program on two separate genomes: once on your personal prokaryote and once on the "community prokaryote" Mycobacterium leprae. For each of these individually, you will produce the list of long open reading frames, and the 10 statistics TP, sTP, FN, FP, Sn, sSn, FOR, PPV, sPPV, and FDR, once for the old "long ORF" method, and again for the new Markov model method.

Zip these files together and upload them via the UW Catalyst system, here. If there's anything the least bit tricky about building/running your program, explain it; don't assume I'm competent...

Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX