CSE 427, Wi '08: Assignment #4

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Homework: RNA Search--sequence or structure?

I've said in class that secondary sturcture is important in many RNA genes, often more important than sequence. In this exercise we'll try to see how important this is. In part I, you'll see whether Smith-Waterman is effective in finding tRNA's in your (or other - see part II) pet bacterium. In part II, you can try a covariance-model-based method (incorporating secondary structure information) to see whether it's better. The assignment is deliberately somewhat open ended and/or ill-defined. Feel free to talk to me, or each other, to innovate and/or further clarify what makes sense. Working in a small group is also ok. Feel free to use the class mailing list to discuss the project, find a partner, etc.

Part I: Modify your Smith-Waterman code from the last assignment to look at nucleotide sequences. Start with a score of +5 for exact nucleotide matches, and -4 for each insertion or deletion (i.e., the alignment of the gap character with a nucleotide). Feel free to try tweaking the scoring if desired.

A few classes of RNA genes in your organism are annotated in the .rnt file you downloaded from NCBIs Microbial FTP page in assignment #1. Use it to find the sequence of the first tRNA in your organism (i.e. the tRNA nearest to position 1 in your genome, a completely arbitrary choice), and use that as a query to search for all tRNAs. Ideally, you could just run Smith-Waterman on the one tRNA (a sequence of length about 75) against the entire genome sequence (of length 1 million or so), and report the top, say, 10000 hits. Plot a histogram of these scores, highlighting which are "true" hits versus false positives (most of them). (For the histogram, I'm thinking of something like slide #11 in my CM slides. If 10000 doesn't include all the "true" tRNAs, try to dig out more until you get them all, or at least estimate what their scores are/how low you need to go to see them all.)

As you saw in assignment #1, defining success can be a bit problematic. It's fine to try to gather the kinds of detailed statistics we did in that assignment, but it's also OK to take a simpler approach: say, a high-scoring match is a "hit" if it is roughly the correct length (say, give or take 20%) and mostly overlaps the true tRNA (say, overlaps 80%).

You may run into serious speed or memory problems trying to handle such a big sequence. If so, break your genome up into big chunks that contain all the tRNAs documented in the .rnt file. (They sometimes come in clusters, so one of your chunks might contain several tRNAs.) You can throw in other big (non-overlapping, non-tRNA-containing) chunks, too. The more sequence you can process the better - the goal is to get a sense for how likely it is for random non-tRNAs to score high by chance. Also, I think with most programming systems it will be better to set up your matrix with 75-ish columns and thousands or millions of rows rather than the reverse, and to fill it in row-by-row. This should maximize the benefit of the cache.

Another thing you should do to save time is to only compute the scores of the local alignments, not reconstruct the alignments themselves (traceback) except for those near true tRNAs (so you can see whether the alignment really overlaps the true tRNA).

Part II: As an alternative that exploits secondary structure to (hopefully) improve accuracy, look at the "RNA family data base" http://www.sanger.ac.uk/Software/Rfam/. I described Rfam briefly in lecture. In a nutshell, for a particular RNA family like tRNAs, their experts build a high-quality "seed alignment," including secondary structure annotation, train a Covariance Model (CM) from it, and, use it to search sequence databases for additional instances of the family (resulting in an automatically produced "full alignment"). tRNAs are family RF00005.

If your pet organism is one that they have already processed, you can simply download their full alignment, extract the subset of it drawn from your organism, (or look under the "genomes" tab on the Rfam web site) and compare it to the .rnt file. Compare these results to the results from Part I, using the same definition of accuracy that you used in that part.

If they did not include your organism, you may either (a) pick a different organism that they did do, then proceed as above, (b) wimp out and only look at M. leprae, or (c) do the scan yourself by downloading the RF00005 CM from Rfam, plus the Infernal software package for CM scanning from http://infernal.janelia.org/. I'd recommend using CMSearch's --hmmfilter option for speed.

(One slight complication is that Rfam is built on top of the EMBL database, the European analog of NCBIs Genbank. It is similar in scope and content, but the accession numbers are different, so you may need to do a little digging to find your organism. Furthermore, their sequence assemblies are sometimes different, so you may also need to do some work to compare or convert the tRNA coordinates, e.g. maybe BLASTing parts of one against the other. Probably they have an analog of the .rnt files, too. It's fine to use that if you can find it. If so, please tell me where you found it...)

Extra Extra credit:

Your ideas here.

Turn-in Instructions

Again, I'd prefer that you run your program as described above on two separate genomes: once on your personal prokaryote and once on the "community prokaryote" Mycobacterium leprae.

Give me a brief writeup (2-4 pages) summarizing what you learned,
Source files for your program, with appropriate filenames,
Output files from your run(s) on your genome and M. Leprae, and
README: a short text file explaining how to compile and run your program.

Zip these files together and upload them via the UW Catalyst system, here.

If desired, you may work in small groups on this project. Please clearly identify all group members in the files you turn in. Enjoy!

Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX