CSE 527: Computational Biology
Project Ideas

This is a list of possible course project ideas, to get you started thinking. More ideas will be added as the course progresses, so check back. You are welcome to propose your own project idea to the instructor.

With each project idea a contact person is listed. This person has graciously supplied the idea, and expressed willingness to discuss more details with any project group interested in pursuing the project. (If the contact is the instructor, then it may not have been gracious, but still willing. Thanks to the other contacts for their help.)

  1. In organisms without introns in their genes, possibly the most difficult part of identifying the genes, given the genomic sequence, is locating the true start codon. The instructor has some evidence that one or more of the bacterial genomes has many start codons misannotated. Run an up-to-date gene finder on this genome, and see if the reannotated genes pass the instructor's test. Contact: Martin Tompa.

  2. One of the bacterial genomes in the public databases has a mysterious but pervasive motif in the regions just upstream from its translation start sites. This motif does not seem to be part of the standard ribosome binding site. Explore whether this motif is pervasive in other parts of the genome as well, and try to find clues as to its meaning. Contact: Martin Tompa.

  3. It is not too hard to extend the optimal alignment algorithm to allow "don't care" regions, which would allow poorly matching substrings to be aligned at a small penalty (e.g., using an affine linear penalty function). Explore possible applications of this idea. Contact: Martin Tompa.

  4. For the cDNA matching problem discussed in the lecture notes, investigate what is known about the distribution of intron and exon lengths in a eukaryote of your choice, and devise a gap penalty function that models this distribution well. Design and analyze an efficient algorithm to find optimal alignments with this gap penalty function, and try it out on cDNA data. Contact: Martin Tompa.

  5. Investigate what is known about any of the following types of sites in your favorite organism, and experiment with an algorithm to find these sites in genomic data: Contact: Martin Tompa.

  6. Consider proteins that have been characterized only in terms of their amino acid compositions, unordered. This data can be obtained by sequencing them, and then discarding the ordering of the residues, for example. What can be said about the proteins?

    This type of data was popular before routine sequencing was possible, and is still being generated in many instances. There has not been much research on this in the last 25 years, so a group could potentially make some very interesting contributions. Furthermore, there are some simple questions which could be addressed with interesting results in within 2-3 weeks. A recent paper: Karlin, Blaisdell, & Bucher. Protein Engineering. 5(8):729-738. 1992.

    In general, composition alone is not sufficient to say much about a protein. For example, it cannot distinguish between chymotrypsin and trypsin. However, coupled with additional information, composition can be useful. For example, if you are told that a protein is a trypsin, then, given the composition, you can say what kind of trypsin it is. An unaddressed question: Given that a protein is a lactate dehydrogenase, can composition be used to determine whther it is an LDH A, LDH B, or LDH C? (I have the data; you are welcome to it; I haven't even looked at it yet).

    Other interesting questions: Is there a relationship between body temperature and average protein composition? What about genome GC content? What about mitochondrial- versus nuclear-encoded proteins? Etc...

    These questions are very statistical in nature, and might be addressed with techniques such as principal components, discriminant analysis, or support-vector machines. It is not clear to me what sort of algorithmic issues maight arise if one looks at this data set, but there may very well be some interesting ones, depending on what aspect of the data one chooses to bite into.

    Contact: Jared Roach (roach@u.washington.edu).

  7. The information in a long DNA sequence can be summarized as a series of annotations, where each annotation corresponds to a specific gene or specific repeat element (there are some other types of annotations as well). Consider alignment of annotations and the associated issues.

    This problem relates to establishing synteny of stretches of chromosomes from different species, as well as locating global duplications within a genome. More generally, it relates to deciphering the evolutionary history of genomic organization. This task is useful in setting the stage for sequence alignment, as it is helpful to know that two sequences are orthologous before attempting a sequence alignment. A particular example might come into play if one were comparing two long sequences (i.e., ~1 Mb) from orthologous regions of the human and mouse genomes. Each of these loci is derived from a common ancestral locus which had some (unknown) annotation such as:
    Repeat1-Gene1-Gene2-Gene3-Repeat4-Gene4
    After the species divereged, each locus evolved independently, perhaps giving rise to annotations that look like:
    Repeat5-Repeat6-Gene1-Gene2-Repeat7-Gene3-Repeat8-Gene4
    and
    Gene1-Repeat9-Gene2-Gene3-GeneThatIsVerySimilarToGene3-Repeat10-Gene4
    It might initially appear that the alignment problem is exactly identical to aligning DNA or protein sequences, but with a new alphabet. However, there are some interesting novel aspects that may be considered. For example, if there is a gap in the final alignment, the score of that alignment might depend on what is in that gap. So, for example, if a repeat is known to have spread only in the mouse genome after the species divergence, then one might not penalize that particular gap in the human sequence at all. Furthermore, the length of "annotation sequences" is much shorter than the corresponding DNA sequences, so algorithms that are less efficient can possiby be considered.

    Contact: Jared Roach (roach@u.washington.edu).

  8. Devise a way to search protein sequence databases for a particular motif, one that the more commonly used programs fail to recognize. The motif is an atypical zinc finger of general form: CxxCxxxxxxxCxxxxxxxxxxxCxxC

    This sequence presents two challenges. One challenge is that the "x"s (meaning any residue) CANNOT be Cysteine. The second challenge is that the actual number of "x"s in the two longer substrings is flexible, within a range of 5-20. I know of one sequence in the SWISS-PROT database that should be recognized as containing this motif, but hasn't been.

    Contact: Sandra Pennington (spenning@genetics.washington.edu).

  9. Perform a comprehensive search for all phosphodiesterases. Phosphodiesterases are enzymes which degrade cyclic AMP and/or cyclic GMP, important small molecule "second messengers." It is the local concentration of these second messengers which determines the reaction of cells to various stimuli (e.g. hormones, other proteins). As an example, vision depends critically on phosphodiesterase 6. Different cells have different forms of this enzyme to allow a variety of reactions to a variety of stimuli.

    The superfamily of phosphodiesterases contains 11 known families; each family has 1-6 genes and each gene can be alternatively spliced in several different ways. Several new genes and four gene families have been identified in the last three years by simple homology searching: using known protein sequences to query large databases of ESTs. ESTs are anonymously cloned cDNAs (known sequence but unidentified function).

    These searches have a major problem in formatting: most "hits" in the database are uninteresting in that they match perfectly to the query string. If they have a region which does not match, this region is not reported in the search output. Another problem is redundancy. Each query can match hundreds of ESTs perfectly and it is easy to miss the one imperfect match stuck in with these hits.

    The NCBI has a group working on clustering algorithms to reduce this redundancy in a project called UniGene. It is the goal of UniGene to cluster all ESTs and non-anonymous DNA sequences from each gene into "UniGenes." They distribute their builds of this dataset monthly or so and it can be searched locally using BLAST (or other string comparison algorithms).

    I have written a perl script to search the UniGene databases with the 120 or so known phosphodiesterase protein sequences and report back the UniGene sequences which match. This has already identified one potential new enzyme.

    I now need to find a way to align the 1-200 sequences within each UniGene cluster to check if they all should really be clustered together. Questions to be answered are: Are there new splice varients of known genes? Are there new genes? Do the new sequences look like cloning artifacts or authentic cDNAs? How can one identify the full-length cDNA sequence of any new sequences?

    Contact: Jonathan Schaefer (schaefer@u.washington.edu).

  10. The prokaryotic ribosome binds to the mRNA it is about to translate by RNA-RNA hybridization: the 3' end of the 16S rRNA is roughly complementary to the 5' untranlated region of the mRNA just upstream from the start codon. Given the 5' untranslated region of the mRNA and the 3' end of the 16S rRNA, run dynamic programming alignment to find the best local alignment, thus identifying where the ribosome binds on the mRNA. You will have to use well known RNA binding energy rules, which require the alignment to look at neighboring bases and also affect the gap penalties. Contact: Martin Tompa.

  11. Consider alignment of a protein sequence with another protein sequence (or more generally, another profile) allowing a limited number of insertions and deletions in the underlying nucleic acid sequence of the first protein sequence. Frameshift errors (as opposed to frameshift mutations) occur during manual or automated sequencing of DNA. Much as a base can be "mis-called" during sequencing, an extra base can be slipped in, or one missed. Such "indels" often occur near the end of reads, and particularly in short simple sequence repeats. So, "CATGAAAATCTT" might be mis-read as "CATGAAATCTT".

    In most cases these errors are picked up early in processing the sequence. In particular, when contigs are constructed from multiple overlapping raw sequence reads (possibly in both directions), an indel in a single raw sequence read will be detected and corrected in the consensus of the contig. However, despite this, a few slip through and are published in Genbank.

    Even after contig construction, there is a final trick that can be used to catch indels. If the DNA sequence codes for a protein (assume for a moment that it is a cDNA), then there should be no stop codons until the end of the protein. If there is a premature stop codon, then the sequence analyst will tend to look very hard for an error (either mis-read or indel) to explain this stop codon. If, for example, this stop codon occurs right after a stretch of 4 As late in a sequence read in a region of poor raw data coverage, then the analyst might pull up the raw chromatograms and convince themselves that there was an analysis error, and thus explain the premature stop codon. In addition to unexpected stop codons, there may be other conserved features that might signal an analysis error. A typical example would be a canonical active site in an enzyme. In general such heuristics are very good at picking up indels, because an indel alters the entire downstream coding sequence.

    However, there are situations where these heuristics will not work well. These occur when an insertion occurs, and then a few bp afterwards a compensatory deletion occurs (or vice versa). In this case, only a few predicted residues are altered, and if none of these becomes a stop codon, and none of them are in a region predicted to be conserved, then the analyst is likely to miss the paired errors. Even if the analyst notices a difference in a conserved region, it might be difficult to determine whether this is because there is a true underlying difference (e.g., due to selective pressure and a change in function) or whether the difference is due to a sequencing error.

    Enter the algorithm. Given a new sequence with potential frameshift errors, compare it to a reference sequence(s), and suggest whether or not errors are present, and where they might be. One can also imagine a couple simple variations on the description of the problem, but this description should catch the gist of the necessary algorithm. Such an algorithm might work by testing all possible pairs of indels in the DNA alignment and seeing if they increase the global alignment score of the protein alignment. There might be some clever dynamic tricks here.

    One could make some assumptions if necessary. For example, one might assume that the indels are at least q bp apart. We can talk about what q might be set to.

    One would presumably be able to find a long frameshifted stretch by noting that correcting a small portion of the stretch increased the protein alignment score. One could then extend the region until the protein score failed to increase. One might want to iterate the algorithm, as it is possible that there could be a second or third pair of indels. Conceivably even nested, but I have never observed the nested case. One might also modify the algorithm slightly to look for premature termination due to a frameshift (e.g., the real protein sequence extends farther than predicted).

    A tool to do this could become very useful. If used without care for diplomacy, it could get one into a lot of trouble. Most major labs probably have a bunch of frameshifted sequences in Genbank. Rubbing people's noses in their frameshifts could get messy...

    I have a couple of sequences that such a program could be trained on.

    Note that I have been assuming the availability of reference sequence(s) for this algorithm. One might also design an algorithm to work in the absence of a reference sequence, and rely instead on a statistical description of what sequences should look like (e.g., the length distribution of sequences in C. elegans). Also note that one is only asking the algorithm to propose paired corrective indels. Following this proposal, one would assume the analyst would look at the original raw data, or produce more, before jumping to any conclusions.

    Contact: Jared Roach (roach@u.washington.edu).

  12. Locate genes in genomic sequences using LLR scores and the Ruzzo-Tompa linear time algorithm for finding all high-scoring regions. Contact: Martin Tompa.

  13. Many gene-finding algorithms use long ORFs as a training set (to learn codon bias, a Markov model, or an Interpolated Context Model) in order to identify some of the shorter coding regions. Compare the codon bias, 6-mer frequencies, etc. of long ORFs and short coding regions to see if the former comprise a reasonably accurate training set for the latter. Contact: Martin Tompa.

  14. The gene-finding algorithms based on Markov chains and Interpolated Context Models discussed in lecture do not seem to take the background frequency into account. Do so by using relative entropy instead of simple probability. Contact: Martin Tompa.

  15. Can the relative entropy site selection problem (see Theorem 9.1) be solved by an algorithm that runs in time exponential in n (the length of the site), but polynomial in the number and lengths of the input sequences? Contact: Martin Tompa


owner-cse527@cs.washington.edu (Last Update: )