Computational Biology Capstone: Project Software

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Course Home

Project

Note on Instructional Project machines as servers
The machines at your disposal for capstone computing are the Instructional Project machines kiska and umnak. These are distinct from the Instructional Lab Linux machines, which should not be treated like servers. If you need to work from a Windows machine or from home, log in to one of the Instructional Project (as opposed to Instructional Lab Linux) machines listed above to do your work.

MicroFootPrinter

Where to find MicroFootPrinter
MicroFootPrinter has been kindly installed by its author, Shane Neph, on the Instructional Lab Linux systems and the Instructional Project machines (kiska and umnak). The path for its latest version is
/projects/cse/courses/cse481f/bin/default/microfootprinter

Where to find MicroFootPrinter documentation
There is a short paper on the web version of MicroFootPrinter:
Shane Neph and Martin Tompa, MicroFootPrinter: a Tool for Phylogenetic Footprinting in Prokaryotic Genomes. Nucleic Acids Research, vol. 34, July 2006, W366-W368.
There are slides for a lecture on MicroFootPrinter.
Documentation on the command-line version of MicroFootPrinter is available at
/projects/cse/courses/cse481f/bin/v1.14/bin/MFP-README

MicroFootPrinter parameters
You will always need to use the parameter
-host cubist.cs.washington.edu
It is likely you will want to use the following parameters:
-grouped [filename] -skipduplicatechecks 1 -basichtml 0

A script for extracting MicroFootPrinter's motifs
In the directory /projects/cse/courses/cse481f/bin there is a small script called parse-motifs.csh . It takes one input (MicroFootPrinter's HTML motif file) and extracts the text corresponding to the motifs in that file. For instance:
parse-motifs.csh up.out.motifs.html > parsed.results

A program for detecting near-duplicate sequences
See the files in the following directory:
/projects/cse/courses/cse481f/07sp/filter
An example of this program's results.
Issues that are not yet handled by this program:

In the hashing module, I don't believe MicroFootPrinter's NNNNNNNNNN sequences are handled correctly. I've marked the place in the program with a comment.
The global alignment score is not length-normalized, so longer sequences will tend to have higher scores.
The program makes no decision about which of the similar sequences to discard.

Some threshold of similarity should be chosen, possibly based on a length-normalized global alignment score. (But if a cluster contains many similar sequences, it may be too expensive to run all these pairwise alignments and we may need to revert to a threshold based just on the hashing method.)
If A and B are reported as similar, and B and C are reported as similar, but A and C are not reported as similar, what should be done? More general cases of intransitivity must be handled.
If two sequences are similar, the shorter one should be discarded.

Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to tompa]