CSE 490MT: Phase 2 Goals

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Phase 1

Project

CSE 490MT Home

There are two things to tackle in phase 2. Both involve some design decisions, meaning that I'm not sure myself exactly how to solve them.

Automatically choosing good FootPrinter parameters.
This is an art manually, so what I'd like you to do first is come up with a list of criteria you will use to compare two FootPrinter outputs to decide which is better. Obvious symptoms of bad output include too few motifs or too many motifs, but you have to decide what's too few or too many. You should prefer motifs that span more of the tree over ones that span less. You should prefer sets of motifs that occur in (roughly) the same order in many species. Take a look at the examples on the project FootPrinter page to get a feel for the comparison of different outputs.
Once you have the list of criteria, what I expect you'll want your program to do is automatically try various settings of FootPrinter's parameters, selecting the 1 or 2 settings that lead to the best output according to your criteria. You might want to try motif sizes 10 and 8. You might want to try subregion change costs 1 and 0. You'll want to allow for losses, but the biggest challenge may be to get the config file right. One thing I'd try is a simple rescaling of FootPrinter's 3 "universal" config files. (From the manual: "For a motif of size X, three files are provided: universalXloose.config , universalX.config and universalXtight.config, which will respectively report motifs that are somewhat significant, significant or very significant, approximatively corresponding to p-values of 0.2, 0.1 and 0.05 respectively.") My first guess would be to scale all the spans in these files by C/F, where C is the sum of all ClustalW's branch lengths and F is the sum of all FootPrinter's branch lengths if you were to use the -compute_branch_lengths option (which you won't in your real program). Just these few suggestions will lead to 2x2x3 settings of FootPrinter's parameters to be compared according to your criteria. You may want to do more.
Use your team's wiki to post good FootPrinter outputs for the folC data set, and say whether it was found manually or automatically by your phase 2 program.

Starting from a set of genes instead of a single gene.
If you go to BioCyC E. coli folate biosynthesis you'll find a list of 11 genes involved in the folate biosynthesis pathway of E. coli. One of these genes is folC, the gene you've been working on in B. subtilis. There is good reason to believe that some of these genes may share the same motifs, because they are regulated by the same mechanism. Start thinking about how you might find these common motifs. You could do 11 runs of your program and compare the 11 FootPrinter outputs for similarities. Or you could combine (some of) these 11 E. coli genes with a few of each of their best blast hits resulting in a single FootPrinter output showing any common motifs. (One reason to be skeptical of this latter suggestion is that the amino acid sequences of these 11 genes are likely so different from each other that the phylogeny you get out of ClustalW may be meaningless.)

Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to tompa]