Overview of progressive alignment:
A heuristic approach to multiple sequence alignment


Based on Section 6.4 from Durbin et al.
Martin Tompa
CSE 427: Computational Biology
January 27, 2010

Overview of progressive alignment

Progressive alignment (Feng and Doolittle, 1987) is a heuristic for multiple sequence alignment that does not optimize any obvious alignment score. The idea is to do a succession of pairwise alignments, starting with the most similar pairs of sequences and proceeding to less similar ones. Here is an outline of the steps:

Inputs: N sequences

  1. Using standard pairwise alignment, calculate a matrix of distances (alignment scores) between each pair of sequences. Consider this as an N-clique G, where edge {i,j} is labeled with the score of an optimal alignment of the i-th and j-th sequences.
  2. Use Kruskal's algorithm to find a minimum spanning tree of G. Whenever a minimum spanning tree edge would connect two components, instead add a new root node with directed edges to the roots of the two components. This is the "guide tree".
  3. Do pairwise alignments according to the guide tree, working from the leaves to the root. A node u with children v and w corresponds to an alignment of the leaves of v's subtree (already aligned inductively) with the leaves of w's subtree (already aligned).

Details of guide tree construction

  1. Initially, each node is the root of its own tree.
  2. Consider edges in increasing order of edge label.
  3. If the next edge e connects nodes {a,b} in the same tree, discard e.
  4. Otherwise, find the root v of the tree containing a, and the root w of the tree containing b. Add a new root u with children v and w, thus merging the trees containing a and b into a single tree.

Details of pairwise alignment

Suppose V is an alignment of the sequences at the leaves of v's subtree, and W is an alignment of the sequences at the leaves of w's subtree. Let {a,b} be the pair of sequences that caused these subtrees to be merged, and let A be the optimal alignment of a and b. Use A to guide the alignment of the two alignments V and W.

References

  1. Durbin, R., Eddy, S. R., Krogh, A. and Mitchison, G., Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, 1998.
  2. Feng, D. F. and Doolittle, R. F., "Progressive Alignment of Amino Acid Sequences and Construction of Phylogenetic Trees from Them". Methods in Enzymology, vol. 266, 1996, 368-382,