Projects can be done individually, or in small groups.
Large groups are also okay, provided you have a thoughtful plan to
organize and divide the work.
Groups combining people from different
fields are particularly encouraged. Feel free to use the class
email list
cse527a_au09@u.washington.edu
to brainstorm project ideas, try to round
up partners, etc.
Choices for the project include, but are not limited to:
- Literature review: Read 3-4 papers on a coherent topic,
and report on them.
- Implementation: Read 2-3 background papers, implement the
algorithms (or find existing software that implements the
algorithms, or make up your own algorithms), find some test data,
and report on the results.
- HMMs: Re-read the slides and section of the text
motivating HMMs for CpG island detection and implement it. More
details below.
I'd like each individual/group to send me a paragraph or drop by to
tell me who's in the group, describe your topic, the initial papers,
and the implementations and test data (if applicable). Maybe I can
give you some pointers. Please try to do this by early next week...
Deliverables: During finals week, hand in a paper
(approximately 5-10 pages) describing your project, and give a 20-30
minute presentation. (These will be open to everyone in the class.)
See schedule page for details. I would like
to get electronic versions of any code you write, together with
associated specialized input files, make files, sample outputs, etc.
(Please do NOT include vast swaths of genomic sequence, but do tell me
what you used, where you got it.) Electronic copies of your report
and presentation are also welcome, but I would also like your report
on paper, if possible. Use the Catalyst dropbox
for electronic turnin.
Students consistently impress me with creative, cogent project ideas,
so by all means fell free to come up with your own ideas. Here are a
few of mine to get you started:
- Does motif detection improve if you use a more elaborate
background model than the 0-th order model we typically
considered? I've seen a couple of articles recently that discuss
this.
- Evaluating significance of scores from, say, matches to an HMM
can be very slow, involving running it on thousands of random
sequences. I think we can do better...
- What's the state of the art for modeling and/or predicting RNA
structure if pseudoknots are allowed? What about searching for
motifs involving pseudoknots? If that's too slow to be practical,
what about generating alignments of sequences that
(presumably) contain pseudoknots?
- What's the state of the art for RNA tertiary structure prediction?
- In riboswitches and some other noncoding RNAs, existence of two
alternate stable structures is functionally important. I don't
think there has been much work on computational modeling,
prediction or search for RNAs having such features. What can you
find? (I know of one paper that might be useful; ask.)
- Try your favorite algorithm on your favorite organism.
- Sequence data is now available from 40 or more vertebrates.
What can you learn, especially about conserved non-coding regions?
- DNA sequencing technology is changing rapidly. How do the new
methods work? What new biological discoveries are being enabled
thereby and how? What computational problems are inherent?
- There's a lot of fuss recently about "personal genomics." What
is it? Is it all hype, or are there serious prospects? What are
the computational problems therein?
- There are big swaths of computational biology that I haven't
had time to touch, e.g., microarrays, proteomics, protein
structure prediction, systems biology, inference of regulatory
networks, simulation of cells, phylogenetic inference, ... See if
you can dig up some good material about any of these topics.
- Many others possible, too ...
If none of these excite you, here's a more
concrete option, that I think will be instructive---implement HMM
training/inference to predict CpG islands in the human genome. My
slides and Durbin, Eddy text, chapter 3, provide background on this
problem. I'd suggest you use data from human chromosone 21,
downloaded from the UCSC genome
browser as your training and test data. Use the "Feb 2009 Human
Assembly (hg19)." The CpG island track should be visible by default;
if not, select it in the "Regulation" section at the bottom of a
typical browser page. You can look at stuff in the browser,
but in general data is downloaded through the "table browser"
interface. Get the chromosome 21 sequence, and the chromosome 21
'CpG' track. (I recommend the ".bed" file format for the latter, but
suit yourself.) Ideally, you can discover the difference between CpG
islands and non-islands from the sequence alone, but it is also OK to
use "labeled" training data, i.e., exploit the CpG track for training.
In either case, use just the first half of Ch 21 for training your
HMM, and the other half for testing. Any combination of
Viterbi/Baum-Welch training with Viterbi/posterior decoding is fine.
Compare different combinations if you're feeling ambitious. As I said
in class, I haven't tried this, so it may fail spectacularly. If so,
give thought to why, and to how it might be resurrected. If you want
some simple data to test you HMM implementations,
dice.txt contains the loaded dice example from Durbin Chapter 3.
Whatever your project, a Nobel prize is great; so is a result of the
form ``I tried x to solve y, but it didn't work,'' or ''I
didn't have time to finish, but here's how far I got.'' (In the
latter cases, give some thought to why it failed and/or outline the
next steps to try.)
|
|
Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
|