CSE 527, Au '09: Course Project, Due: Finals Week (12/14-12/18)

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Projects can be done individually, or in small groups. Large groups are also okay, provided you have a thoughtful plan to organize and divide the work. Groups combining people from different fields are particularly encouraged. Feel free to use the class email list cse527a_au09@u.washington.edu to brainstorm project ideas, try to round up partners, etc. Choices for the project include, but are not limited to:

Literature review: Read 3-4 papers on a coherent topic, and report on them.
Implementation: Read 2-3 background papers, implement the algorithms (or find existing software that implements the algorithms, or make up your own algorithms), find some test data, and report on the results.
HMMs: Re-read the slides and section of the text motivating HMMs for CpG island detection and implement it. More details below.

I'd like each individual/group to send me a paragraph or drop by to tell me who's in the group, describe your topic, the initial papers, and the implementations and test data (if applicable). Maybe I can give you some pointers. Please try to do this by early next week...

Deliverables: During finals week, hand in a paper (approximately 5-10 pages) describing your project, and give a 20-30 minute presentation. (These will be open to everyone in the class.) See schedule page for details. I would like to get electronic versions of any code you write, together with associated specialized input files, make files, sample outputs, etc. (Please do NOT include vast swaths of genomic sequence, but do tell me what you used, where you got it.) Electronic copies of your report and presentation are also welcome, but I would also like your report on paper, if possible. Use the Catalyst dropbox for electronic turnin.

Students consistently impress me with creative, cogent project ideas, so by all means fell free to come up with your own ideas. Here are a few of mine to get you started:

Does motif detection improve if you use a more elaborate background model than the 0-th order model we typically considered? I've seen a couple of articles recently that discuss this.
Evaluating significance of scores from, say, matches to an HMM can be very slow, involving running it on thousands of random sequences. I think we can do better...
What's the state of the art for modeling and/or predicting RNA structure if pseudoknots are allowed? What about searching for motifs involving pseudoknots? If that's too slow to be practical, what about generating alignments of sequences that (presumably) contain pseudoknots?
What's the state of the art for RNA tertiary structure prediction?
In riboswitches and some other noncoding RNAs, existence of two alternate stable structures is functionally important. I don't think there has been much work on computational modeling, prediction or search for RNAs having such features. What can you find? (I know of one paper that might be useful; ask.)
Try your favorite algorithm on your favorite organism.
Sequence data is now available from 40 or more vertebrates. What can you learn, especially about conserved non-coding regions?
DNA sequencing technology is changing rapidly. How do the new methods work? What new biological discoveries are being enabled thereby and how? What computational problems are inherent?
There's a lot of fuss recently about "personal genomics." What is it? Is it all hype, or are there serious prospects? What are the computational problems therein?
There are big swaths of computational biology that I haven't had time to touch, e.g., microarrays, proteomics, protein structure prediction, systems biology, inference of regulatory networks, simulation of cells, phylogenetic inference, ... See if you can dig up some good material about any of these topics.
Many others possible, too ...

If none of these excite you, here's a more concrete option, that I think will be instructive---implement HMM training/inference to predict CpG islands in the human genome. My slides and Durbin, Eddy text, chapter 3, provide background on this problem. I'd suggest you use data from human chromosone 21, downloaded from the UCSC genome browser as your training and test data. Use the "Feb 2009 Human Assembly (hg19)." The CpG island track should be visible by default; if not, select it in the "Regulation" section at the bottom of a typical browser page. You can look at stuff in the browser, but in general data is downloaded through the "table browser" interface. Get the chromosome 21 sequence, and the chromosome 21 'CpG' track. (I recommend the ".bed" file format for the latter, but suit yourself.) Ideally, you can discover the difference between CpG islands and non-islands from the sequence alone, but it is also OK to use "labeled" training data, i.e., exploit the CpG track for training. In either case, use just the first half of Ch 21 for training your HMM, and the other half for testing. Any combination of Viterbi/Baum-Welch training with Viterbi/posterior decoding is fine. Compare different combinations if you're feeling ambitious. As I said in class, I haven't tried this, so it may fail spectacularly. If so, give thought to why, and to how it might be resurrected. If you want some simple data to test you HMM implementations, dice.txt contains the loaded dice example from Durbin Chapter 3.

Whatever your project, a Nobel prize is great; so is a result of the form ``I tried x to solve y, but it didn't work,'' or ''I didn't have time to finish, but here's how far I got.'' (In the latter cases, give some thought to why it failed and/or outline the next steps to try.)

Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX