Overview of the BLAST algorithm:
Fast approximation of optimal local alignment
Martin Tompa
CSE 427: Computational Biology
January 18, 2010
Outline of the BLAST algorithm
In what follows, w, T, S, E, Xu, A, and Xg are
parameters that can be set by the user.
Inputs: query sequence Q, database D of sequences
- Find exact (or nearly exact, for protein sequences) "word
matches" or "hits" between substrings of Q and D, each of length w
(and alignment score ≥ T, for protein sequences).
- For each word match, do ungapped extension
to the left and right to construct a "high-scoring segment pair"
(HSP) of alignment score ≥ S.
- For each HSP, do gapped extension to the left
and right, and output those final alignments with alignment score x
and E-value ≤ E, where the "E-value" is the expected number of
sequences in a randomly constructed database that have alignment
score ≥ x.
- For nucleotide sequences, these are exact matches of w
consecutive residues. The default for NCBI BLAST is w=28.
- For protein sequences, construct a list of all possible words of
length w and alignment score ≥ T when aligned with some length w
substring of Q. The default for NCBI BLAST is w=3. The word list
can be constructed in time "essentially proportional" to its length
by branch and bound: Explore the complete tree with 20w
leaves, but terminate any partial branch for which completing the
word by perfect matches to the query does not yield score ≥ T.
To search for all occurrences of the word list in D, employ the
finite automaton for string matching with multiple patterns.
- There is a speed/sensitivity tradeoff in the choice of w (and T):
increasing w decreases the number of word matches dramatically, but
increases the chance of missing good alignments.
- Heuristic: continue to extend only if the alignment score is at
most Xu (the "ungapped dropoff") below the best score
seen in the
extension up until this point.
- For protein sequences, only extend if there are two word matches
on the same dynamic programming matrix diagonal within distance A.
- Heuristic:
standard dynamic programming alignment, but only expand entries if the
alignment score is at most Xg (the "gapped dropoff") below
the best score seen in the extension up until this point.
References
-
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman,
D.J. (1990)
"Basic local alignment search tool". Journal of Molecular Biology
215:403-410.
-
Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W.,
Lipman D.J. (1997)
"Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs". Nucleic Acids Research, 25:3389-3402.