DNA-Sequitur: An Algorithm to Compress and Identify Hierarchical Structures in DNA

by
Ray Fortna

Sequitur is a compression algorithm, developed by Neville-Manning and Witten, which takes an arbitrary sequence of symbols and produces a context free grammar that generates that sequence. The grammar is formed by taking repeated phrases in the string and substituting a non-terminal symbol that represents that phrase. This process continues recursively until all redundancy is removed and a grammar has been produced. As a by-product of the compression, a hierarchical model of the sequence can be inferred by the final grammar. Our research took the original sequitur algorithm and specialized it to recognize unique characteristics of DNA that separate DNA from an arbitrary sequence of characters. DNA Sequitur recognizes patterns that both appear in a normal forward direction, as well the reverse complement form. Since a pattern and its reverse complement occurs more often than would be expected by chance, we believe that this will result in the removal of more redundant information and higher compression rates. In addition, the hierarchical by-product of the grammar produced might hold valuable information to aid in the study and interpretation of DNA. Joint project with Bao-Hoang Nguyen.

Advised by Richard Ladner

EE1 037
January 10, 2001
3:30 pm