image University of Washington Computer Science & Engineering
  CSE 527Au '07:  Assignment #4
  CSE Home   About Us    Search    Contact Info 

Reading: See Schedule page.

Homework:

For reasons that are not well-understood, many genomes have well under 50% GC content. However, in these genomes, the functional non-coding RNA molecules (e.g., tRNAs) often have GC content well above 50%. This effect is especially noticeable in hyperthermophiles---organisms that live in high temperature environments like deep sea hydrothermal vents---possibly because of the greater stability provided by G-C base pairs (3 hydrogen bonds vs 2 for A-T pairs). These facts suggested an approach for finding non-coding RNA genes [1]: scan the genome for GC-rich patches.

How? One idea is to use a simple 2-state HMM, much like in the "occasionally dishonest casino" example in the text. Your HMM will have a state 1 corresponding to the low GC genomic background, and a state 2 corresponding to high GC patches. Initialize its emission/transition probabilities in the following rather naive way:
EmissionsACGT
State 1.25.25.25.25
State 2.20.30.30.20
   
TransitionsState 1State 2
Begin.9999.0001
State 1.9999.0001
State 2.01.99
The state 2 emissions just reflect a small bias towards GC compared to state 1. The transitions say that these GC patches average about 100 nucleotides in length, with about 10k between such patches. These guesses aren't terribly accurate, but we'll see whether they matter much.

Implictly I'm using the version of the HMM from the text with a begin state but no end state. The begin state does not need to be explicitly represented in your data structures, but you do need to include the appropriate initial probabilities in your algorithms. E.g., the Viterbi probability vstate 1(position 1) = (.9999)*(.25).

Data: Download the complete genome sequence data for Methanocaldococcus jannaschii. You can find it by going to the NCBI Genome page, click "Microbial" in right column, then scroll through the list of available microbial genomes to find "Methanocaldococcus jannaschii DSM 2661". Click its name. Read the brief description of it. Note the size and GC percentage in its genome (do the parameters I gave above seem off?). To get the genome sequence itself, follow the ftp link and download two files: NC_000909.fna and NC_000909.rnt.

NC_000909.fna is in the popular "FASTA" format. It starts with a comment line signaled by a ">" in column 1, followed by many lines just containing ACGT. Parsing is easy: ignore lines starting with ">", and all whitespace; treat every other character as a nucleotide. You will occasionally see letters other than just ACGT, reflecting various ambiguities in the sequencing process. E.g., "N" means "could be any of ACGT". For simplicity, just treat anything other than ACG as a T. (There are better ways to handle this, but this will suffice for the homework.)

Algorithm: Implement and run the HMM Viterbi algorithm on this data. Print the lengths and locations (starting and ending positions, where the first nucleotide in the genome is at position 1) of all "hits," i.e., all those (contiguous) subsequences that are assigned to state 2 in the Viterbi path. Also print the log probability of the overall Viterbi path.

Based on the specified initial parameters, my implementation finds 2 such subsequences, each with length about 5k. Perhaps somewhat disappointing if we're hunting for interesting RNAs. But don't despair yet; maybe the parameters were ill-chosen. We could fiddle with them in various ways, but instead let's see whether the good Professor Viterbi can help. Implement "Viterbi training" to refine the transition/emission parameter estimates (but not the initial-state probabilities; with only one sequence, there's not enough training data to estimate them).

Deliverables: Iterate this 10 times. Print the parameters, the log probability of the Viterbi path, and the number of hits found in each iteration. (In part, this serves as a crude check on convergence.) Print the hits found during the last iteration. For the first 10 hits having length at least 50 nucleotides, look at the genome annotation to see what you can learn about these sequences. The .rnt file will be a good start for this. It annotates the most common noncoding RNAs, but isn't necessarily complete or up to date. Other data from the NCBI pages may be helpful, and there are other internet sites devoted to RNA. (Rfam is generally recommended, although I don't happen to know whether they have much on M. jannaschii.) By the way, as with the dice example, I would expect a technique like this at best to be a little sloppy in marking the exact ends of known RNA genes, so be happy if your hits merely significantly overlap real genes.

Optional Extra Credit: Repeat this process using posterior decoding and Baum-Welch training instead. A "hit" is now a (contiguous) subsequence with state 2 posterior probability above one half. With each iteration, print out your transition/emission parameters, number of hits, and log P(x|θ), the probability of the given input x given the current parameters θ; this quantity should be increasing (otherwise you've probably got a bug.) Again, give me a list of hits in pass 10, and analyze the first 10 of them over 50 nucleotides. (I expect many of them will overlap the Viterbi hits, so this shouldn't be much extra work.)

Numerics: Use log probabilities for both versions. In Viterbi this is completely straightforward: products of probabilities become sums of log probs; max is max. In the forward/backward algorithms, however, you need sums of products of probabilities. Mathematically, log(x+y) = log(exp(log x) + exp(log y)), but there's some danger of underflow/loss of precision. Instead, do something like this (C syntax):

    /* 
     * Given two probabilities x and y, represented by their logs lx, ly,
     * return the log of their sum log(x+y) = log(exp(lx) + exp(ly)).
     *
     * Assume log(0) is represented by NaN.
     *
     * The "lx > ly" trick is some protection from underflow:
     *   log(a+b) = log(a(1+b/a)) = log(a)+log(1+b/a), 
     * which will be most accurate when b/a < 1.
     */
    double log_of_sum_of_logs(double lx, double ly){
      if(isnan(lx)) return ly;
      if(isnan(ly)) return lx;
      if(lx > ly) {
        return lx + log(1+exp(ly-lx));
      } else {
        return ly + log(1+exp(lx-ly));
      }
    }
(Thanks to Tobias Mann for these tricks.)

More Extra Credit: Explore convergence of Viterbi and/or Baum-Welch on this data. Do the answers get better with substantially more iterations? How sloppy can we be with starting parameter values and still have it converge to an interesting solution? I picked "10 iterations" by eye; how might you automate convergence (and divergence) detection? As a more biological direction, does this whole approach work in other AT-rich hyperthermophiles? Non-hyperthermophiles? Cryophiles? Eukaryotes?

Code: You may use C, C++, C#, Java, Scheme, Perl, Python or Ruby, etc.; ask if you prefer something more obscure. (The Baum-Welch version is a little heavy numerically, so if you pick an interpreted language, run some quick timing tests before you get in too deep to verify that performance will be acceptable. My C implementation takes 2-3 minutes on my .8 Ghz laptop.) Although it's not too difficult to write a fairly general HMM package, it's ok to make your code relatively special-purpose if you prefer, e.g., just 2-state HMMs, just ACGT alphabet, just input sequence files of <2 megabases, initial HMM parameters as compiled-in constants, etc.

Testing: Obviously you want to test your code on some very small hand-verifiable examples. You may also be interested in trying the loaded die example from chapter 3. I have typed in the 300-roll sequence from the book; see comments in the file for more detail (and caveats). Incidentally, the Viterbi path given in figure 3.5 appears to be completely correct.

Turn in: Your (commented) source code, output logs of the data requested above, namely, (for both the Viterbi and Baum-Welch versions, if you did that extra credit piece):

and a brief write-up (.pdf, .doc, .rtf or plain text) of what you learned about your first 10 hits, if anything. If you did any of the extra credit pieces, clearly indicate this at the front of your writeup. If there's anything the least bit tricky about building your program, explain it; you're dealing with an ivory-tower pinhead here, so don't assume I'm competent...

If desired, you may work in groups of 2 or 3, but if so (a) you must do at least one of the extra credit steps, (b) you really need to collaborate on all steps; you won't learn anything by calling someone else's black box, and (c) clearly identify all group members in each file you turn in.

Bundle everything, excluding the genome sequences, into a single .zip or .tgz file and email it to me.

References:

[1] RJ Klein, Z Misulovin, SR Eddy, "Noncoding RNA genes identified in AT-rich hyperthermophiles." Proc. Natl. Acad. Sci. U.S.A., 99, #11 (2002) 7542-7. [offcampus]

CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX