CSE 527, Au '07: Course Project, Due: Finals Week (12/10-12/13)

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Projects can be done individually, or in small groups. Large groups are also okay, provided you have a thoughtful plan to organize and divide the work. Groups combining people from different fields are particularly encouraged. Feel free to use the class email list cse527a_au07@u.washington.edu to brainstorm project ideas, try to round up partners, etc. Choices for the project include, but are not limited to:

Literature review: Read 3-4 papers on a coherent topic, and report on them.
Implementation: Read 2-3 background papers, implement the algorithms (or find existing software that implements the algorithms), find some test data, and report on the results.

I'd like each group to send me a paragraph or drop by to tell me who's in the group, describe your topic, the initial papers, and the implementations and test data (if applicable). Maybe I can give you some pointers. Please try to do this by 11/7.

By middle of finals week, hand in a paper (approximately 5-10 pages) describing the project, and give a 20-30 minute presentation.

Students consistently impress me with creative, cogent project ideas, so by all means fell free to come up with your own ideas. Here are a few of mine to get you started:

The idea I presented in class---evaluating significance of RNA predictions relative to mono- vs di-nucleotide background.
Does motif detection improve if you use a more elaborate background model that the 0-th order model we typically considered? I've seen a couple of articles recently that discuss this.
Evaluating significance of scores from, say, matches to an HMM can be very slow, involving running it on thousands of random sequences. I think we can do better...
What's the state of the art for modeling and/or predicting RNA structure if pseudoknots are allowed? What about searching for motifs involving pseudoknots?
What's the state of the art for RNA tertiary structure prediction?
In riboswitches and some other noncoding RNAs, existence of two alternate stable structures is functionally important. I don't think there has been much work on computational modeling, prediction or search for RNAs having such features. What can you find? (I know of one paper that might be useful.)
De Novo Discovery of Non-Coding RNA Genes: If you did HW#4, here's an extension that might be of interest. Given the success of the simple approach we used in HW#4 for finding non-coding RNA genes, it is natural to wonder whether any of the GC rich patches we found which were not previously annotated as tRNAs or rRNAs are in fact real non-coding RNA genes. Short of wet-lab experiments (as in [1]), how might we tell? One approach would be to look for similar sequences in related organisms. Here's a sketch of one possible approach. Take a look at the taxonomy information at the NCBI web site, and select some organisms related to M. jannaschii, probably AT-rich ones, perhaps also hyperthermophiles. Run your HW#4 algorithm on them as well. Filter out the tRNA and rRNA hits and any shorter than, say, 50 nucleotides. Perhaps extend each hit by 25-50 nucleotides in each direction, in case the Viterbi boundaries were somewhat off. Try to match each hit in one organism to its (putative) orthologs in the others, based perhaps on length, BLAST matches (feel free to download & install it locally, either the NCBI or (faster) WU versions) or Smith-Waterman alignments (modify your HW#2 code, or perhaps use the ssearch component of W. Pearson's fasta package). Even matching them by eye is OK, although that obviously won't scale very well... Do any of the putative ortholog groups appear to have conserved secondary structures? Perform secondary structure predictions using the Vienna RNA package, Pfold, CMfinder or other tools. What do you find?
This problem is obviouslyly open-ended, and not certain of success. I'm open to just about anything you want to try this side of ouija boards; just describe what you try and how it works (or doesn't), and perhaps your thoughts on better alternatives.
Try your favorite algorithm on your favorite organism.
Sequence data is now available from 28 or more vertebrates. What can you learn, especially about conserved non-coding regions?
There are big swaths of computational biology that I haven't had time to touch, e.g., microarrays, proteomics, protein structure prediction, systems biology, inference of regulatory networks, simulation of cells, phylogenetc inference, ... See if you can dig up some good material about any of these topics.
Many others possible, too ...

A Nobel prize is great; so is a result of the form ``I tried x to solve y, but it didn't work'' (but give some thought to why it failed and maybe suggest a next step to try).

References:

[1] RJ Klein, Z Misulovin, SR Eddy, "Noncoding RNA genes identified in AT-rich hyperthermophiles." Proc. Natl. Acad. Sci. U.S.A., 99, #11 (2002) 7542-7. [offcampus]

Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX


	Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX