Computational Biology Capstone: Project

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Course Home

Software

Data

References

Cataloguing the Prokaryotic Regulatory Elements

Ultimate Project steps

Cluster all prokaryotic genes by homology (using protein sequence similarity as a proxy for homology).
For each cluster of homologous genes, do the following:

Extract the upstream regions of each gene, using MicroFootPrinter (after masking annotated tRNA and rRNA genes).
Discard upstream regions that are near duplicates of others in the set.
Run MicroFootPrinter on the remaining genes to find conserved motifs.

Cluster the entire collection of motifs by sequence similarity.
For each motif cluster, check for functional enrichment of genes in the operons immediately downstream.
Compare the resulting motif clusters to the known binding sites from RegulonDB, RegTransBase, and PRODORIC, and known noncoding RNAs from RFAM and RibEx.
Classify the motifs into likely binding sites and likely RNA secondary structure elements.

Phase Alpha steps

Write a simple program that parses a file that contains clusters of homologous proteins. For each cluster, write just the PIDs (protein IDs) of that cluster into a file in the format required by microfootprinter -grouped. From this large collection of cluster files select ten or so clusters that each contain between 10 and 50 proteins, to be used for a trial run below.
For each of these ten small clusters, do the following:

Extract the upstream regions of each gene, using MicroFootPrinter. The appropriate parameters include -grouped -skipduplicatechecks 1 -auxfilesonly 1 -phylogeny 0 . The upstream regions will be in the file up.out .
Discard upstream regions that are near duplicates of others in the set.
Run MicroFootPrinter on the remaining genes to find conserved motifs. The appropriate parameters include -grouped -skipduplicatechecks 1 -basichtml 0 . The motifs will be in the file up.out.motifs.html .

Phase Alpha Teams

Team Cluster will be responsible for step 1.
Team Hash and Team Align will each do step 2b. Team Hash will make sequence similarity decisions based on a threshold on the number of n-mers shared by the sequences. Team Align will make these decisions based on a threshold on the length-normalized alignment score. As test input, these teams can choose any cluster from the cluster file and run the simple MicroFootPrinter call in step 2a.
The Coordinators will be responsible for putting the pieces of Phase Alpha together in a computational pipeline. This entails iterating through the output files from Team Cluster, performing step 2a, passing the upstream regions to Team Hash or Team Align, performing step 2c on the result, and collecting all the motif files. The Coordinators will therefore be responsible for coordinating the I/O specifications of the other teams.
All teams should feel free to use the class mailing list for communicating.

Phase Beta steps

Write a simple program that parses a file that contains clusters of homologous proteins. For each cluster, write just the PIDs (protein IDs) of that cluster into a file in the format required by microfootprinter -grouped.
For each of these clusters, do the following:

Extract the upstream regions of each PID's gene, using MicroFootPrinter. The appropriate parameters include -grouped -skipduplicatechecks 1 -auxfilesonly 1 -phylogeny 0 . The upstream regions will be in the file up.out .
Discard PIDs whose upstream regions are near duplicates of others in the set.
If the number of remaining PIDs in the cluster exceeds 100 (FootPrinter's current limit), use the prokaryotic taxonomy to break the cluster into subclusters of size at most 100. Rerun MicroFootPrinter on each of these new subclusters, using -grouped -skipduplicatechecks 1 -auxfilesonly 1 -phylogeny 0 .
From each remaining upstream region, mask out annotated tRNA and rRNA genes.
Run MicroFootPrinter on the masked upstream regions to find conserved motifs. The appropriate parameters include -grouped -skipduplicatechecks 1 -basichtml 0 -sequences 1 . The motifs will be in the file up.out.motifs.html . There is a script to extract the motifs and their scores from this HTML file.
Use FootPrinter's motif significance scores to determine which motifs are significant enough to retain.

Use Tomtom to cluster the entire collection of motifs by sequence similarity.

Phase Beta Teams

Team Cluster has completed step 1 and is responsible for step 3.
Team Align is responsible for steps 2b and 2d.
Team Hash is responsible for step 2f.
The Coordinators are responsible for putting the pieces of Phase Beta together in a computational pipeline. This entails iterating through the output files from Team Cluster, performing step 2a, passing the upstream regions to Team Align, performing step 2c on the result, passing the resulting upstream regions back to Team Align, performing step 2e on the result, collecting all the motif files, passing them to Team Hash for step 2f, and then to Team Cluster for Step 3. The Coordinators will therefore be responsible for coordinating the I/O specifications of the other teams.
All teams should feel free to use the class mailing list for communicating.

Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to tompa]