DNA Sequence Alignment
A dynamic programming algorithm

DNA Sequence Alignment (aka “Longest Common Subsequence”)
The problem
What is a DNA sequence?
DNA similarity
What is DNA sequence alignment?
Using English words
The Naïve algorithm
The Dynamic Programming algorithm
Idea of Dynamic Programming

What is a DNA sequence
DNA: string using letters A,C,G,T
Letter = DNA “base”
e.g. AGATGGGCAAGATA
DNA makes up your “genetic code”

DNA similarity
DNA can mutate.
Change a letter
AACCGGTT à ATCCGGTT
Insert a letter
AACCGGTT à ATAACCGGTT
Delete a letter
AACCGGTT à ACCGGTT
A few mutations makes sequences different, but “similar”

Why is DNA similarity important
New sequences compared to existing sequences
Similar sequences often have similar function
Most widely used algorithm in computational biology tools
e.g. BLAST at http://www.ncbi.nlm.nih.gov/BLAST/

What is DNA sequence alignment?
Match 2 sequences, with underscore ( _ ) wildcards.
Best Alignment à minimum underscores (slight simplification, but okay for 326)
e.g.

Moving to English words

Naïve algorithm
Try every way to put in underscores
If it works, and is best so far, record it.
At end, return best solution.

Naïve Algorithm – Running Time
Strings size M,N:

Dynamic Approach – A table
Table(x,y): best alignment for first x letters of string 1, and first y letters of string 2
Decide what to do with the end of string, then look up best alignment of remainder in Table.

e.g. ‘a’ vs. ‘s’
“zasha” vs. “ashes”.  2 possibilities for last letters:
(1) match ‘a’ with ‘_’:
best_alignment(“zash”,”ashes”)+1
(2) match ‘s’ with ‘_’:
best_alignment(“zasha”,”ashe”)+1
à best_alignment(“zasha”,”ashes”)
=min(best_alignment(“zash”,”ashes”)+1,
         best_alignment(“zasha”,”ashe”)+1)

An example

Example with solution

Pseudocode (bottom-up)

Pseudocode (top-down)

Running time
Every square in table is filled in once
Filling it in is constant time
Q(n2) squares
à alg is Q(n2)

Idea of dynamic
programming
Re-use expensive computations
Identify critical input to problem (e.g. best alignment of prefixes of strings)
Store results in table, indexed by critical input
Solve cells in table of other cells
Top-down often easier to program