|
|
|
|
Cataloguing the Prokaryotic Regulatory Elements
Ultimate Project steps
-
Cluster all prokaryotic genes by homology (using protein sequence
similarity as a proxy for homology).
-
For each cluster of homologous genes, do the following:
-
Extract the upstream regions of each gene, using MicroFootPrinter
(after masking annotated tRNA and rRNA genes).
-
Discard upstream regions that are near duplicates of others in the
set.
-
Run MicroFootPrinter on the remaining genes to find conserved motifs.
-
Cluster the entire collection of motifs by sequence similarity.
-
For each motif cluster, check for functional enrichment of genes
in the operons immediately downstream.
-
Compare the resulting motif clusters to the known binding sites from
RegulonDB,
RegTransBase,
and PRODORIC,
and known noncoding RNAs from RFAM and RibEx.
-
Classify the motifs into likely binding sites and likely RNA secondary
structure elements.
Phase Alpha steps
-
Write a simple program that parses a file that
contains clusters of homologous proteins. For each cluster,
write just the PIDs (protein IDs) of that cluster into a file in the
format required by microfootprinter -grouped. From this large
collection of cluster files select ten or so clusters that each
contain between 10 and 50 proteins, to be used for a trial run below.
-
For each of these ten small clusters, do the following:
-
Extract the upstream regions of each gene, using MicroFootPrinter. The appropriate
parameters include -grouped -skipduplicatechecks 1 -auxfilesonly 1
-phylogeny 0 . The upstream regions will be in the file up.out .
-
Discard upstream regions that are near
duplicates of others in the set.
-
Run MicroFootPrinter on the remaining
genes to find conserved motifs. The appropriate parameters include
-grouped -skipduplicatechecks 1 -basichtml 0 . The motifs will be
in the file up.out.motifs.html .
Phase Alpha Teams
-
Team Cluster will be responsible for step 1.
-
Team Hash and Team Align will each do step 2b. Team Hash will
make sequence similarity decisions based on a threshold on
the number of n-mers shared by the sequences. Team Align will make
these decisions based on a threshold on the length-normalized
alignment score. As test input, these teams can choose any cluster from
the cluster file and run the simple MicroFootPrinter call in step 2a.
-
The Coordinators will be responsible for putting the pieces of
Phase Alpha together in a computational pipeline. This entails
iterating through the output files from Team Cluster, performing step
2a, passing the upstream regions to Team Hash or Team Align,
performing step 2c on the result, and collecting all the motif files.
The Coordinators will therefore be responsible for coordinating the
I/O specifications of the other teams.
-
All teams should feel free to use the class mailing list for communicating.
Phase Beta steps
-
Write a simple program that parses a file that
contains clusters of homologous proteins. For each cluster,
write just the PIDs (protein IDs) of that cluster into a file in the
format required by microfootprinter -grouped.
-
For each of these clusters, do the following:
-
Extract the upstream regions of each PID's gene, using MicroFootPrinter. The appropriate
parameters include -grouped -skipduplicatechecks 1 -auxfilesonly 1
-phylogeny 0 . The upstream regions will be in the file up.out .
-
Discard PIDs whose upstream regions are near
duplicates of others in the set.
-
If the number of remaining PIDs in the cluster exceeds 100 (FootPrinter's
current limit), use the prokaryotic
taxonomy to break the cluster into subclusters of size at most 100.
Rerun MicroFootPrinter on each of these
new subclusters, using -grouped -skipduplicatechecks 1
-auxfilesonly 1 -phylogeny 0 .
-
From each remaining upstream region, mask out annotated tRNA and rRNA genes.
-
Run MicroFootPrinter on the masked
upstream regions to find conserved motifs. The appropriate parameters
include -grouped -skipduplicatechecks 1 -basichtml 0 -sequences
1 . The motifs will be in the file up.out.motifs.html .
There is a script to extract the motifs
and their scores from this HTML file.
-
Use FootPrinter's motif significance scores to determine which motifs
are significant enough to retain.
-
Use Tomtom to
cluster the entire collection of motifs by sequence similarity.
Phase Beta Teams
-
Team Cluster has completed step 1 and is responsible for step 3.
-
Team Align is responsible for steps 2b and 2d.
-
Team Hash is responsible for step 2f.
-
The Coordinators are responsible for putting the pieces of
Phase Beta together in a computational pipeline. This entails
iterating through the output files from Team Cluster, performing step
2a, passing the upstream regions to Team Align, performing step 2c on
the result, passing the resulting upstream regions back to Team Align,
performing step 2e on the result, collecting all the motif files,
passing them to Team Hash for step 2f, and then to Team Cluster for
Step 3. The Coordinators will therefore be responsible for
coordinating the I/O specifications of the other teams.
-
All teams should feel free to use the class mailing list for communicating.
|