This is a list of possible course project ideas, to get you started thinking. More ideas will be added as the course progresses, so check back. You are welcome to propose your own project idea to the instructor.
With each project idea a contact person is listed. This person has graciously supplied the idea, and expressed willingness to discuss more details with any project group interested in pursuing the project. (If the contact is the instructor, then it may not have been gracious, but still willing. Thanks to the other contacts for their help.)
This type of data was popular before routine sequencing was possible, and is still being generated in many instances. There has not been much research on this in the last 25 years, so a group could potentially make some very interesting contributions. Furthermore, there are some simple questions which could be addressed with interesting results in within 2-3 weeks. A recent paper: Karlin, Blaisdell, & Bucher. Protein Engineering. 5(8):729-738. 1992.
In general, composition alone is not sufficient to say much about a protein. For example, it cannot distinguish between chymotrypsin and trypsin. However, coupled with additional information, composition can be useful. For example, if you are told that a protein is a trypsin, then, given the composition, you can say what kind of trypsin it is. An unaddressed question: Given that a protein is a lactate dehydrogenase, can composition be used to determine whther it is an LDH A, LDH B, or LDH C? (I have the data; you are welcome to it; I haven't even looked at it yet).
Other interesting questions: Is there a relationship between body temperature and average protein composition? What about genome GC content? What about mitochondrial- versus nuclear-encoded proteins? Etc...
These questions are very statistical in nature, and might be addressed with techniques such as principal components, discriminant analysis, or support-vector machines. It is not clear to me what sort of algorithmic issues maight arise if one looks at this data set, but there may very well be some interesting ones, depending on what aspect of the data one chooses to bite into.
Contact: Jared Roach (roach@u.washington.edu).
This problem relates to establishing synteny of stretches of
chromosomes from different species, as well as locating global
duplications within a genome. More generally, it relates to
deciphering
the evolutionary history of genomic organization. This task is useful
in
setting the stage for sequence alignment, as it is helpful to know
that
two sequences are orthologous before attempting a sequence alignment.
A particular example might come into play if one were comparing
two long sequences (i.e., ~1 Mb) from orthologous regions of the human
and
mouse genomes. Each of these loci is derived from a common ancestral
locus
which had some (unknown) annotation such as:
Repeat1-Gene1-Gene2-Gene3-Repeat4-Gene4
After the species divereged, each locus evolved independently, perhaps
giving rise to annotations that look like:
Repeat5-Repeat6-Gene1-Gene2-Repeat7-Gene3-Repeat8-Gene4
and
Gene1-Repeat9-Gene2-Gene3-GeneThatIsVerySimilarToGene3-Repeat10-Gene4
It might initially appear that the alignment problem is exactly
identical to aligning DNA or protein sequences, but with a new
alphabet. However, there are some interesting novel aspects that may
be considered. For example, if there is a gap in the final alignment,
the score of that alignment might depend on what is in that gap. So,
for example, if a repeat is known to have spread only in the mouse
genome after the species divergence, then one might not penalize that
particular gap in the human sequence at all.
Furthermore, the length of "annotation sequences" is much shorter than
the
corresponding DNA sequences, so algorithms that are less efficient can
possiby be considered.
Contact: Jared Roach (roach@u.washington.edu).
This sequence presents two challenges. One challenge is that the "x"s (meaning any residue) CANNOT be Cysteine. The second challenge is that the actual number of "x"s in the two longer substrings is flexible, within a range of 5-20. I know of one sequence in the SWISS-PROT database that should be recognized as containing this motif, but hasn't been.
Contact: Sandra Pennington (spenning@genetics.washington.edu).
The superfamily of phosphodiesterases contains 11 known families; each family has 1-6 genes and each gene can be alternatively spliced in several different ways. Several new genes and four gene families have been identified in the last three years by simple homology searching: using known protein sequences to query large databases of ESTs. ESTs are anonymously cloned cDNAs (known sequence but unidentified function).
These searches have a major problem in formatting: most "hits" in the database are uninteresting in that they match perfectly to the query string. If they have a region which does not match, this region is not reported in the search output. Another problem is redundancy. Each query can match hundreds of ESTs perfectly and it is easy to miss the one imperfect match stuck in with these hits.
The NCBI has a group working on clustering algorithms to reduce this redundancy in a project called UniGene. It is the goal of UniGene to cluster all ESTs and non-anonymous DNA sequences from each gene into "UniGenes." They distribute their builds of this dataset monthly or so and it can be searched locally using BLAST (or other string comparison algorithms).
I have written a perl script to search the UniGene databases with the 120 or so known phosphodiesterase protein sequences and report back the UniGene sequences which match. This has already identified one potential new enzyme.
I now need to find a way to align the 1-200 sequences within each UniGene cluster to check if they all should really be clustered together. Questions to be answered are: Are there new splice varients of known genes? Are there new genes? Do the new sequences look like cloning artifacts or authentic cDNAs? How can one identify the full-length cDNA sequence of any new sequences?
Contact: Jonathan Schaefer (schaefer@u.washington.edu).
In most cases these errors are picked up early in processing the sequence. In particular, when contigs are constructed from multiple overlapping raw sequence reads (possibly in both directions), an indel in a single raw sequence read will be detected and corrected in the consensus of the contig. However, despite this, a few slip through and are published in Genbank.
Even after contig construction, there is a final trick that can be used to catch indels. If the DNA sequence codes for a protein (assume for a moment that it is a cDNA), then there should be no stop codons until the end of the protein. If there is a premature stop codon, then the sequence analyst will tend to look very hard for an error (either mis-read or indel) to explain this stop codon. If, for example, this stop codon occurs right after a stretch of 4 As late in a sequence read in a region of poor raw data coverage, then the analyst might pull up the raw chromatograms and convince themselves that there was an analysis error, and thus explain the premature stop codon. In addition to unexpected stop codons, there may be other conserved features that might signal an analysis error. A typical example would be a canonical active site in an enzyme. In general such heuristics are very good at picking up indels, because an indel alters the entire downstream coding sequence.
However, there are situations where these heuristics will not work well. These occur when an insertion occurs, and then a few bp afterwards a compensatory deletion occurs (or vice versa). In this case, only a few predicted residues are altered, and if none of these becomes a stop codon, and none of them are in a region predicted to be conserved, then the analyst is likely to miss the paired errors. Even if the analyst notices a difference in a conserved region, it might be difficult to determine whether this is because there is a true underlying difference (e.g., due to selective pressure and a change in function) or whether the difference is due to a sequencing error.
Enter the algorithm. Given a new sequence with potential frameshift errors, compare it to a reference sequence(s), and suggest whether or not errors are present, and where they might be. One can also imagine a couple simple variations on the description of the problem, but this description should catch the gist of the necessary algorithm. Such an algorithm might work by testing all possible pairs of indels in the DNA alignment and seeing if they increase the global alignment score of the protein alignment. There might be some clever dynamic tricks here.
One could make some assumptions if necessary. For example, one might assume that the indels are at least q bp apart. We can talk about what q might be set to.
One would presumably be able to find a long frameshifted stretch by noting that correcting a small portion of the stretch increased the protein alignment score. One could then extend the region until the protein score failed to increase. One might want to iterate the algorithm, as it is possible that there could be a second or third pair of indels. Conceivably even nested, but I have never observed the nested case. One might also modify the algorithm slightly to look for premature termination due to a frameshift (e.g., the real protein sequence extends farther than predicted).
A tool to do this could become very useful. If used without care for diplomacy, it could get one into a lot of trouble. Most major labs probably have a bunch of frameshifted sequences in Genbank. Rubbing people's noses in their frameshifts could get messy...
I have a couple of sequences that such a program could be trained on.
Note that I have been assuming the availability of reference sequence(s) for this algorithm. One might also design an algorithm to work in the absence of a reference sequence, and rely instead on a statistical description of what sequences should look like (e.g., the length distribution of sequences in C. elegans). Also note that one is only asking the algorithm to propose paired corrective indels. Following this proposal, one would assume the analyst would look at the original raw data, or produce more, before jumping to any conclusions.
Contact: Jared Roach (roach@u.washington.edu).