CSE 427, Wi '21: Assignment #3, Due: Wednesday, 2/24/21

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Homework:

Background: For reasons that are not well-understood, many genomes have well under 50% "G+C content" (i.e., the fraction of nucleotides that are G or C). However, in these genomes, the functional non-coding RNA molecules (e.g., tRNAs) often have G+C content well above 50%. This effect is especially noticeable in hyperthermophiles---organisms that live in high temperature environments like deep sea hydrothermal vents---possibly because of the greater stability provided by G:C base pairs (3 hydrogen bonds vs 2 for A:U pairs). These facts suggested an approach for finding non-coding RNA genes [1]: scan the genome for patches that are unusually rich in G and/or C nucleotides. [Note: these are not CpG islands; the later are largely characterized by enrichment for C immediately followed by G; this assignment focuses on C and/or G, in any ratio, in any order, adjacent or not.]

Approach: How might we find G+C-rich patches? One idea is to use a simple 2-state HMM, much like in the "occasionally dishonest casino" example presented in lecture and in DEKM Chapter 3. In your HMM, state 1 will correspond to the low G+C genomic background, and state 2 will correspond to high G+C patches. Initialize its emission/transition probabilities in the following rather naive way:

Emissions A C G T

State 1 .25 .25 .25 .25

State 2 .20 .30 .30 .20

Transitions State 1 State 2

Begin .9999 .0001

State 1 .9999 .0001

State 2 .01 .99

The state 2 emissions just reflect a small bias towards G+C compared to state 1. The transitions say that these G+C patches average about 100 nucleotides in length, with about 10k between such patches.

Implictly I'm using an HMM with a begin state but no end state (see slides or DEKM)---whichever state has higher probability at the end of the genome is where you end (and where you start the Viterbi traceback). The begin state does not need to be explicitly represented in your data structures, but you do need to include the appropriate initial probabilities in your algorithms. E.g., the Viterbi probability v_{state 1}(position 1) = (.9999)*(.25).

Data: We will use the genome sequence of Methanocaldococcus jannaschii (sometimes also called Methanococcus jannaschii). For an overview, have a look at the first few paragraphs of narrative description at MicrobeWiki.

Download the complete genome sequence data for M. jannaschii from the NCBI Genomes web site. Click "Browse by Organism," then paste Methanocaldococcus jannaschii into the search box. (An alternate approach, expecially relevant to some of the extra credit suggestions, is to click the "Filters" button, then select Kingdom = "Archaea," Group = "Euryarchaeota," SubGroup = "Methanomada". M. jannaschii should be approximately the 43rd entry.) In either case, click its name in column 1. This should take you to to the "organism overview" page. Read the brief description of it. Note the size and GC percentage in its genome. (Do the parameters I gave above seem off?)

The genome sequence itself can be found by clicking the "genome" link in the phrase "Download sequences in FASTA format for genome, protein" near the top of the page. Also, download the associated genome annotation file by clicking the GFF link in "Download genome annotation in GFF, ... format"; more on the use of this file later. (Note: these ftp links don't work in all browsers; try a different browser, command line ftp client, or curl if you have trouble.)

The FASTA download yields GCF_000091665.1_ASM9166v1_genomic.fna.gz, which is compressed (gzipped) text file in the popular "FASTA" format. This genome effectively has 3 chromosomes: one large one of about 1.7 megabases, and 2 very small ones (technically called "plasmids" rather than chromosomes). Each of the three sequences starts with an identifier/comment line signaled by a ">" in column 1, followed by many lines just containing ACGT. Parsing is easy: ignore the first line (the one starting with ">"), and all whitespace; treat every other character as a nucleotide. Treat the sequence as case-insensitive. You may occasionally see letters other than just ACGT, reflecting various ambiguities in the sequencing process. E.g., "N" means "could be any of A, C, G or T". For simplicity, just treat anything other than aAcCgG as a T. (There are better ways to handle this, but this will suffice for the homework.) Again in the interest of simplicity, we are only going to look at the long chromosome, so also ignore everything in the file starting at the 2nd ">" line.

Note: The FASTA file downloaded from the above link should be 1,761,929 bytes long (uncompressed), and should have exactly 3 lines begining with ">". Especially if you downloaded it from a source other than the link above, check this, as there are other slightly different versions of the genome "out there."

Algorithm: Implement the HMM Viterbi algorithm, including traceback, so that you can run it on this data for various choices of the HMM parameters. Arrange it so that it prints

the HMM emission/transition parameters used for this pass (e.g., from the tables above for pass 1),
the log probability (natural log, base-e) of the overall Viterbi path,
the total number of "hits" found, where a hit is (contiguous) subsequence assigned to state 2 in the Viterbi path, and
the lengths and locations (starting and ending positions) of the first k (defined below) "hits." Print all hits, if there are fewer than k of them. (By convention, genomic positions are 1-based, not 0-based, indices.)

Numerics: To avoid underflow/loss of precision, your algorithms should store log probabilities, not the probabilities themselves. In Viterbi this is completely straightforward: products of probabilities become sums of log probs; log of a max is max of the logs.

Viterbi Training: Based on the specified initial parameters, my implementation finds two hits, each with length about 5 kilobases. This is disappointing if we're hunting for interesting RNAs. But don't despair yet; maybe the parameters were ill-chosen. We could fiddle with them in various ways, but instead let's see whether the good Professor Viterbi can help. Also implement "Viterbi training" to refine the transition/emission parameter estimates (but not the initial-state probabilities; with only one sequence, there's not enough training data to estimate them). Then, like EM, we can iterate the algorithm, using the parameters estimated during one pass in the next.

Deliverables:

Iterate your scan/Viterbi parameter estimation 10 times, with k=5 during the first 9 passes, then k=infinity in the last pass. So, your printout should capture the changing HMM parameters, the log probability of the Viterbi path, and the number of hits found in each of the ten iterations, a few hits from the first 9 passes, and all hits from the last pass. (Path probability, hits and such from the early passes are not of much interest per se, but printing them serves as a crude visual check on convergence. I think you will find that the results are relatively stable after fewer than 10 passes. See extra credit suggestion below for more on this.)
For the first 10 hits from the 10th pass, look at the genome annotation (next paragraph) to see what you can learn about these sequences. At a minimum, say whether each of your 10 hits is "known", i.e., "matches" a known noncoding RNA. (Note: as with the dice example, expect your HMM to be somewhat inaccurate in marking the exact ends of RNA genes, so you should declare your hits to "match" real (noncoding RNA) genes if they significantly overlap, as opposed to exactly matching their annotated end points. Likewise, a long RNA gene might yield several shorter predictions that overlap most of it, and conversely several closely-spaced RNAs might be covered by one prediction. Also, note that the G+C content of any chunk of the genome is the same on both strands, so our HMM is equally good (or equally bad) at finding RNAs on the reverse strand, even when run on the forward strand. Consequently, don't be surprised if many of your hits are annotated as being on the minus strand.)

Use the .gff file downloaded earlier as the "gold standard" for known noncoding RNA genes. It is a compressed, tab-delimited text file. It contains (among other things) the coordinates of presumed protein-coding genes, and some known noncoding RNAs. The format is described here, but I think most of what you need to know will be evident from an example:
```
  NC_000909.1  RefSeq       gene  97426  97537  .  -  .   ID=gene102;Dbxref=GeneID:1450942;Name=MJ_RS00515;gbkey=Gene;gene_biotype=tRNA;locus_tag=MJ_RS00515;old_locus_tag=MJ_t01%2CMJt01%2CtRNA-Met-1
  NC_000909.1  tRNAscan-SE  tRNA  97426  97537  .  -  .   ID=rna0;Parent=gene102;Dbxref=GeneID:1450942;anticodon=(pos:complement(97501..97503));gbkey=tRNA;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Met
  NC_000909.1  tRNAscan-SE  exon  97500  97537  .  -  .   ID=id2;Parent=rna0;Dbxref=GeneID:1450942;anticodon=(pos:complement(97501..97503));gbkey=tRNA;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Met
  NC_000909.1  tRNAscan-SE  exon  97426  97464  .  -  .   ID=id3;Parent=rna0;Dbxref=GeneID:1450942;anticodon=(pos:complement(97501..97503));gbkey=tRNA;inference=COORDINATES: profile:tRNAscan-SE:1.23;product=tRNA-Met
```
The first line says that there is a methionine tRNA gene on the "minus" strand of the genome (i.e., the reverse complement of the sequence in your FASTA file) between positions 97426 and 97537. The next three lines elaborate a bit, especially in saying that it has a short intron at 97464-97500. (I think initial lines of all such groups relevant for the assignment can be found by searching for the pattern "biotype=.*RNA", but I may have missed some; please let me know of you think you have found others.) By convention, the first position of the genome is 1, not zero.

Optionally, you might enjoy participating in the HMM Language Olympics: also print (a) the language in which you coded, and (b) the total CPU time taken by your algorithm for the first 9 Viterbi iterations. Do NOT include the time taken to read the genome sequence, nor the time taken by the 10th iteration, which may be dominated by printing hits. Include basic info about the computer on which you ran the timing test ("6 bazigahertz Intelerola winter olympathalon with 33 bytes of RAM"). See the course FAQ for info on timing programs; please post to the class discussion board if you know how to do this for languages not mentioned in the FAQ. Also (c) please feel free to post this information, and the first 10 hits from your final iteration, to the discussion board, so you can (partially) check your work against each other and see whether/how much language matters. If any of you choose to try the Baum-Welch/posterior extra credit outlined below, feel free to post analogous data from these runs. This timing info is for fun and curiousity only; speed is not one of my grading criteria.

Optional Extra Credit: Especially for those of you who have seen or worked with HMMs in other courses, I recommend trying one or more of the following extra credit steps to deepen your understanding of this important computational model.

Repeat the foregoing process using posterior decoding and Baum-Welch training instead of Viterbi. A "hit" is now a (contiguous) subsequence with state 2 posterior probability above one half. Similar to the above, each iteration should print your transition/emission parameters, number of hits, coordinates of a few, and log P(x|θ) = the natural log of the probability of the genomic input x given the current parameters θ. (Barring numerical issues like roundoff errors, theory predicts that log P(x|θ) should increase with successive iterations. If it doesn't, you've probably got a bug.) Again, give me a list of hits in pass 10, and analyze the first 10 of them over 50 nucleotides. (I expect many of them will overlap the Viterbi hits, so this shouldn't be much extra work.) Optionally, compare your timings to the Viterbi version.
More Numerics: Unlike Viterbi, to implement the forward/backward algorithms for this extra credit part, you need sums of products of probabilities. Again, to avoid underflow/loss of precision, your algorithms should store log probabilities, not the probabilities themselves. Mathematically, log(x+y) = log(exp(log x) + exp(log y)), but there's some danger of underflow/loss of precision when doing this calculation. Instead, do something like this (C syntax):
```
    /* 
     * Given two probabilities x and y, represented by their logs lx, ly,
     * return the log of their sum log(x+y) = log(exp(lx) + exp(ly)).
     *
     * Assume log(0) is represented by NaN.
     *
     * The "lx > ly" trick below is some protection from underflow:
     *   log(a+b) = log(a(1+b/a)) = log(a)+log(1+b/a), 
     * which will be most accurate when b/a < 1.
     */
    double log_of_sum_of_logs(double lx, double ly){
      if(isnan(lx)) return ly;
      if(isnan(ly)) return lx;
      if(lx > ly) {
        return lx + log(1+exp(ly-lx));
      } else {
        return ly + log(1+exp(lx-ly));
      }
    }
```
(Thanks to Tobias Mann for this trick.)
Explore convergence of Viterbi and/or Baum-Welch on this data. Do the answers get better with substantially more iterations? E.g., if you run 100 passes or 1000, do you get the same answers, or does the solution slowly drift to something very different? How sloppy can we be with starting parameter values and still have it converge to an interesting solution? I picked "10 iterations" by eye; how might you automate convergence (and divergence) detection?
NCBI now has genome sequences for several other Methanocaldococcus species. Apply the same process (Viterbi and/or Baum-Welch) to them as well and see how well the hits match across species. E.g., align corresponding hits with a nucleotide-oriented version of your Smith-Waterman code or using public pairwise- or multiple sequence alignment tools.
More generally, does this whole approach work in other AT-rich hyperthermophiles? Non-hyperthermophiles? Cryophiles? Eukaryotes?
The model considered above is a zero-th order model: emission probabilities depend on the HMM's state, but not on previous letters (except insofar as they influence state). It is very possible that a first-order model, similar to the CpG model discussed in lecture and in the text, could do better. Try it and compare to the zero-th order model. Is it better? How much more memory/time does it require? Is convergence faster or slower? Does using extra training data as in the previous point help or hurt? Note that this model may NOT be symmetric with respect to strand, so you may need to run it separately on the reverse-complement of the genome, and training becomes trickier... Think about it a bit and talk to me before you dig in too deep.
If anyone is interested, you could try to duplicate the CpG island classification experiments described in the book on a much larger scale -- i.e., I can point you to genome-wide human CpG data. Time, memory, and numerical accuracy may become more critical on this much larger data set.
If your genome scan predicted RNAs that are not annotated in the GFF file, that may be because it is incomplete and/or out of date. See whether you can locate other resources that lend additional credibility to your predictions. There are several internet sites devoted to RNA. (Rfam is generally recommended, although I don't happen to know whether they have much on M. jannaschii. RNAcentral is another good source.)
Do a more formal analysis of the "accuracy" of this method for its designated purpose of discovering RNA genes. I.e., count True Positives, False Negatives, etc., perhaps including an ROC curve. The discussion in Tompa, et al. [2] may be helpful also.

Code:

Testing: Obviously you want to test your code on some very small hand-verifiable examples. You may also be interested in trying the loaded die example from chapter 3 of DEKM. I have typed in the 300-roll sequence from the book; see comments in the file for more detail (and caveats). Incidentally, the Viterbi path given in DEKM Figure 3.5 and on my slides appears to be completely correct.

Turn in: Your (commented) source code, output logs of the data requested above, and a brief write-up (.pdf, .doc, .rtf or plain text) of what you learned about your first 10 hits, if anything, and other thoughts about this assignment.

Bundle everything into a single .zip or .tgz file and UPLOAD IT TO CANVAS. Make sure your name, email, and student number are included in the writeup and in comments in your code, and clearly label any extra credit components. Include build/run instructions unless they are completely trivial. Please do not include easily rebuilt object or binary files, core dumps, etc.; they are bulky... Please do NOT turn in the large genome files you've downloaded (.fna, .gbff, .gbk, .gff, etc.).

References:

[1] RJ Klein, Z Misulovin, SR Eddy, "Noncoding RNA genes identified in AT-rich hyperthermophiles." Proc. Natl. Acad. Sci. U.S.A., 99, #11 (2002) 7542-7. [offcampus]

[2] M Tompa, N Li, TL Bailey, GM Church, B De Moor, E Eskin, AV Favorov, MC Frith, Y Fu, WJ Kent, VJ Makeev, AA Mironov, WS Noble, G Pavesi, G Pesole, M Régnier, N Simonis, S Sinha, G Thijs, J van Helden, M Vandenbogaert, Z Weng, C Workman, C Ye, Z Zhu, "Assessing computational tools for the discovery of transcription factor binding sites." Nat. Biotechnol., 23, #1 (2005) 137-44. [offcampus]


	Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX