CSE 527, Au '07: Assignment #5

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Prediction of Protein Coding Genes: In lecture, I described 2 basic ideas that are exploited in nearly all gene finding programs. First, look for long ORFs; second, look for ORFs having sequence statistics (e.g., codon usage frequencies) that are more typical of known genes than of non-coding DNA. An ORF is just a (maximal length) sequence of consecutive nucleotide triplets, only the last of which is a stop codon (TAA, TAG, or TGA). ORF's are easy to find. Starting at the leftmost triplet, scan triplet by non-overlapping triplet until you find a stop; then continue until you find another stop, etc. Then repeat the whole process from the left end again, but omitting he very first or first two nucleotides (to account for the other two possible reading frames. [One should also do it from the right end on the Watson-Crick complement of the sequence, but for this assignment, in the interest of simplicity, we're going to completely ignore that. There are several other over-simplifications as well, so that, hopefully, the key ideas will shine through.]
The first part of this problem is to write a program to do the above process. Let's again use the M. jannaschii sequence as an example. Also, download the Genbank annotation file NC_000909.gbk (from the same place as in assignment #4). In it you will find the coordinates of all the presumed protein-coding genes, each identified by a line like
```
     CDS             complement(2216..3343)
```
or
```
     CDS             4911..5381
```
"CDS" is shorthand for "coding sequence", and the numbers are its start and end positions, up to and including the last nucleotide of the stop codon. The numbering scheme counts the first position of the genome as 1, not zero. The "complement" entries are on the reverse strand; you may ignore them, as well as a handful of other odd cases.
You will use the .gbk file to evaluate the quality of your predictions, but keep in mind that the real goal is to envisage procedures that would help annotate a new genome, one for which no .gbk file exists. I.e., can we come up with a procedure that would more or less automatically produce reasonably accurate identification of the protein-coding genes in a newly sequenced organism.
Perhaps the simplest plausible gene prediction approach is to guess that every ORF longer than some fixed threshold is a gene. How good it this? What threshold? Use your ORF finder to examine these questions, by comparing ORFs to the Genbank annotations. Correctly predicting the exact starting positions of genes is difficult, but matching the other end is easier - it will be a stop codon. Again to (over-) simplify the problem, we will declare an ORF to be a successful gene prediction if its stop codon exactly matches the position of a stop codon in one of the Genbank genes. Except in rare cases, which we will again ignore, each such ORF will be at least as long as the real gene.
You should print out a simple "histogram" with one line for each ORF length showing the number of ORFs of that length in the genome, and the number of them that match "real" genes, as above. Only print lines having non-zero counts.
I think you will see that this simple method is quite accurate for long genes (given my simplified definitions, it cannot fail to find every gene, and above some length threshold it rarely makes false predictions). For shorter genes, however, its rate of false predictions rises. Can we do better by exploiting sequence statistics? The answer is yes, and here's one simple approach (which is a little more flexible than simply looking at codon bias). A k-th order Markov model predicts successive letters in a sequence based on the previous k letters (or fewer, at the beginning of the sequence). Specifically, for k=3, say, the probability of sequence x = x₁ x₂ ... x_n is
```
  P(x) = P(x₁) * P(x₂ | x₁) * P(x₃ | x₁x₂) * P(x₄ | x₁x₂x₃) * P(x₅ | x₂x₃x₄) ... P(x_n | x_n-3x_n-2x_n-1)
```
Compare to the definition of 1st order Markov models on page 48 in the text. Note that all but the first 3 terms in the product are conditional on just the previous 3 letters. Also note that this is a plain Markov model, not a hidden Markov model. If you know the various probabilities, it is simple to evaluate the probability that a given sequence was generated by this model: take the first letter x₁ of the sequence, and use it to look up P(x₁); take the first two letters x₁ x₂ and use them to look up P(x₂ | x₁); etc. Then multiply them all together. (Or, more conveniently, add their logs.) Getting back to the gene prediction problem, suppose I have a set of probabilities like this for real genes, and a similar set of probabilities, say Q( . | . ), for "background" sequences that are not part of genes. Then I could predict whether a given ORF sequence x was a real gene by comparing P(x) to Q(x), e.g. log(P(x)/Q(x)) > 0 implies that the sequence is more likely to have been generated by the gene model than the background model.
OK, so where do the probabilities P come from? Simply take a "trusted" set of example genes and count the frequencies of all (k+1)-letter (contiguous) subsequences. E.g., if k=2 and your training set is just the sequence AAAAGCTA, then AAA occurs twice, and AAG, AGC, GCT, and CTA each occur once, so P(A | AA) = 2/3, P(G | AA) = 1/3, P(A | CT) = 1/1, etc. Estimate probabilities for the shorter conditioning sequences in the same way. E.g., 5/8 of the training set letters are A so P(A) = 5/8, P(A | A) = 3/4, P(A | G) = 0, etc. As usual, with a sparse training set like this, adding pseudo counts would probably be a good idea, but I think there's enough training data in the M. jannaschii example that pseudo counts are unnecessary. Take as your trusted set all ORFs longer than 1400 nucleotides; most or all of them are real genes. Similarly, you can use all ORFs shorter than 50 nt as a trusted set of non-genes. Treat non-ACGT letters as T (arbitrarily).
Calculate the P and Q tables as above for k=3, then compute log(P(x)/Q(x)) for each ORF. Extend your "histogram" printout to show the average log ratio for ORFs of a given length as well as the number of them having positive log ratio, and the number of positive ones that are annotated as real in the Genbank records. Does this additional information increase the accuracy of gene predictions for shorter ORFs?
There are various directions you could pursue if you want some extra credit. Handle the reverse strand, too. I think you'll find a few very long ORFs that don't exactly correspond to the Genbank annotations, and other anomalies; what's up with that? My negative training set includes short ORFs on the reverse strand of real genes; is this a good idea? Explore how performance depends on k. Go read about "ROC curves" and do a full ROC analysis of this problem. Could this approach be extended to make better predictions of the start codon, too? (Most or all genes in M. jannaschii start with ATG, but there can be multiple ATG's in an ORF; which it the right one?) I'm open to other ideas, too; send me email.

Code: As before, you may use C, C++, C#, Java, Scheme, Perl, Python or Ruby; ask if you prefer something else. It's ok to make your code relatively special-purpose if you prefer, e.g., array sizes and initial parameters as compiled-in constants, minimal error checking, etc.

Turn in: For any of these problems, turn in your (commented) source code, output logs of any data requested and a brief write-up (.pdf, .doc, .rtf or plain text) describing what you did/what you observed. If you did any of the extra credit pieces, clearly indicate this at the front of your writeup; If you tackled more than one of the problems above, please put each in a separate subdirectory. If there's anything the least bit tricky about building your program, explain it; you're dealing with an ivory-tower pinhead here, so don't assume I'm competent...

If desired, you may work in groups of 2 or 3, but if so (a) you must do at least one of the extra credit steps, (b) you really need to collaborate on all steps; you won't learn anything by calling someone else's black box, and (c) clearly identify all group members in each file you turn in.

Bundle everything into a single .zip or .tgz file and email it to me.

Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX