
CSE Home  About Us  Search  Contact Info 
Reading: See Schedule page.
Homework:
For reasons that are not wellunderstood, many genomes have well under 50% GC content. However, in these genomes, the functional noncoding RNA molecules (e.g., tRNAs) often have GC content well above 50%. This effect is especially noticeable in hyperthermophilesorganisms that live in high temperature environments like deep sea hydrothermal ventspossibly because of the greater stability provided by GC base pairs (3 hydrogen bonds vs 2 for AT pairs). These facts suggested an approach for finding noncoding RNA genes [1]: scan the genome for GCrich patches.
How? One idea is to use a simple 2state HMM, much like in the "occasionally dishonest casino" example in the text. Your HMM will have a state 1 corresponding to the low GC genomic background, and a state 2 corresponding to high GC patches. Initialize its emission/transition probabilities in the following rather naive way:


Implictly I'm using the version of the HMM from the text with a begin state but no end state. The begin state does not need to be explicitly represented in your data structures, but you do need to include the appropriate initial probabilities in your algorithms. E.g., the Viterbi probability v_{state 1}(position 1) = (.9999)*(.25).
Data: Download the complete genome sequence data for Methanocaldococcus jannaschii. You can find it by going to the NCBI Genome page, click "Microbial" in right column, then scroll through the list of available microbial genomes to find "Methanocaldococcus jannaschii DSM 2661". Click its name. Read the brief description of it. Note the size and GC percentage in its genome (do the parameters I gave above seem off?). To get the genome sequence itself, follow the ftp link and download two files: NC_000909.fna and NC_000909.rnt.
NC_000909.fna is in the popular "FASTA" format. It starts with a comment line signaled by a ">" in column 1, followed by many lines just containing ACGT. Parsing is easy: ignore lines starting with ">", and all whitespace; treat every other character as a nucleotide. You will occasionally see letters other than just ACGT, reflecting various ambiguities in the sequencing process. E.g., "N" means "could be any of ACGT". For simplicity, just treat anything other than ACG as a T. (There are better ways to handle this, but this will suffice for the homework.)
Algorithm: Implement and run the HMM Viterbi algorithm on this data. Print the lengths and locations (starting and ending positions, where the first nucleotide in the genome is at position 1) of all "hits," i.e., all those (contiguous) subsequences that are assigned to state 2 in the Viterbi path. Also print the log probability of the overall Viterbi path.
Based on the specified initial parameters, my implementation finds 2 such subsequences, each with length about 5k. Perhaps somewhat disappointing if we're hunting for interesting RNAs. But don't despair yet; maybe the parameters were illchosen. We could fiddle with them in various ways, but instead let's see whether the good Professor Viterbi can help. Implement "Viterbi training" to refine the transition/emission parameter estimates (but not the initialstate probabilities; with only one sequence, there's not enough training data to estimate them).
Deliverables: Iterate this 10 times. Print the parameters, the log probability of the Viterbi path, and the number of hits found in each iteration. (In part, this serves as a crude check on convergence.) Print the hits found during the last iteration. For the first 10 hits having length at least 50 nucleotides, look at the genome annotation to see what you can learn about these sequences. The .rnt file will be a good start for this. It annotates the most common noncoding RNAs, but isn't necessarily complete or up to date. Other data from the NCBI pages may be helpful, and there are other internet sites devoted to RNA. (Rfam is generally recommended, although I don't happen to know whether they have much on M. jannaschii.) By the way, as with the dice example, I would expect a technique like this at best to be a little sloppy in marking the exact ends of known RNA genes, so be happy if your hits merely significantly overlap real genes.
Optional Extra Credit: Repeat this process using posterior decoding and BaumWelch training instead. A "hit" is now a (contiguous) subsequence with state 2 posterior probability above one half. With each iteration, print out your transition/emission parameters, number of hits, and log P(xθ), the probability of the given input x given the current parameters θ; this quantity should be increasing (otherwise you've probably got a bug.) Again, give me a list of hits in pass 10, and analyze the first 10 of them over 50 nucleotides. (I expect many of them will overlap the Viterbi hits, so this shouldn't be much extra work.)
Numerics: Use log probabilities for both versions. In Viterbi this is completely straightforward: products of probabilities become sums of log probs; max is max. In the forward/backward algorithms, however, you need sums of products of probabilities. Mathematically, log(x+y) = log(exp(log x) + exp(log y)), but there's some danger of underflow/loss of precision. Instead, do something like this (C syntax):
/* * Given two probabilities x and y, represented by their logs lx, ly, * return the log of their sum log(x+y) = log(exp(lx) + exp(ly)). * * Assume log(0) is represented by NaN. * * The "lx > ly" trick is some protection from underflow: * log(a+b) = log(a(1+b/a)) = log(a)+log(1+b/a), * which will be most accurate when b/a < 1. */ double log_of_sum_of_logs(double lx, double ly){ if(isnan(lx)) return ly; if(isnan(ly)) return lx; if(lx > ly) { return lx + log(1+exp(lylx)); } else { return ly + log(1+exp(lxly)); } }(Thanks to Tobias Mann for these tricks.)
More Extra Credit: Explore convergence of Viterbi and/or BaumWelch on this data. Do the answers get better with substantially more iterations? How sloppy can we be with starting parameter values and still have it converge to an interesting solution? I picked "10 iterations" by eye; how might you automate convergence (and divergence) detection? As a more biological direction, does this whole approach work in other ATrich hyperthermophiles? Nonhyperthermophiles? Cryophiles? Eukaryotes?
Code: You may use C, C++, C#, Java, Scheme, Perl, Python or Ruby, etc.; ask if you prefer something more obscure. (The BaumWelch version is a little heavy numerically, so if you pick an interpreted language, run some quick timing tests before you get in too deep to verify that performance will be acceptable. My C implementation takes 23 minutes on my .8 Ghz laptop.) Although it's not too difficult to write a fairly general HMM package, it's ok to make your code relatively specialpurpose if you prefer, e.g., just 2state HMMs, just ACGT alphabet, just input sequence files of <2 megabases, initial HMM parameters as compiledin constants, etc.
Testing: Obviously you want to test your code on some very small handverifiable examples. You may also be interested in trying the loaded die example from chapter 3. I have typed in the 300roll sequence from the book; see comments in the file for more detail (and caveats). Incidentally, the Viterbi path given in figure 3.5 appears to be completely correct.
Turn in: Your (commented) source code, output logs of the data requested above, namely, (for both the Viterbi and BaumWelch versions, if you did that extra credit piece):
If desired, you may work in groups of 2 or 3, but if so (a) you must do at least one of the extra credit steps, (b) you really need to collaborate on all steps; you won't learn anything by calling someone else's black box, and (c) clearly identify all group members in each file you turn in.
Bundle everything, excluding the genome sequences, into a single .zip or .tgz file and email it to me.
References:
[1] RJ Klein, Z Misulovin, SR Eddy, "Noncoding RNA genes identified in ATrich hyperthermophiles." Proc. Natl. Acad. Sci. U.S.A., 99, #11 (2002) 75427. [offcampus]
Computer Science & Engineering University of Washington Box 352350 Seattle, WA 981952350 (206) 5431695 voice, (206) 5432969 FAX 