DNA-Sequitur: An Algorithm to Compress and Identify Hierarchical Structures in DNA
by
Ray Fortna
Sequitur is a compression algorithm, developed by Neville-Manning and
Witten, which takes an arbitrary sequence of symbols and produces a context
free grammar that generates that sequence. The grammar is formed by taking
repeated phrases in the string and substituting a non-terminal symbol that
represents that phrase. This process continues recursively until all
redundancy is removed and a grammar has been produced. As a by-product of
the compression, a hierarchical model of the sequence can be inferred by the
final grammar. Our research took the original sequitur algorithm and
specialized it to recognize unique characteristics of DNA that separate DNA
from an arbitrary sequence of characters. DNA Sequitur recognizes patterns
that both appear in a normal forward direction, as well the reverse
complement form. Since a pattern and its reverse complement occurs more
often than would be expected by chance, we believe that this will result in
the removal of more redundant information and higher compression rates. In
addition, the hierarchical by-product of the grammar produced might hold
valuable information to aid in the study and interpretation of DNA.
Joint project with Bao-Hoang Nguyen.
Advised by Richard Ladner
EE1 037
January 10, 2001
3:30 pm