CSE 417 Assignment #3
Winter 2002
Due: Friday, February 1, 2002.
Reading:
Chapter 4.
Problems:
- Simulate the execution of the Approximate String Matching
algorithm (section 3.1.3) on the strings
S1 = A C G A G T T
S2 = A A C G A T G T
Show both the D matrix and the "backtrace" arrows
in it for this input. How many optimal edit sequences are
there? List it/them.
- A common mistake when typing English text is the
transposition: tow consecutiev lettesr typde ni hte
wrogn roder. When comparing two such strings, the
Approximate String Matching algorithm will charge 2 for each
transposition, but only 1 for equally common mistakes like
inserted or deleted letters. Give a variant of the
algorithm that also allows a "Flip" operation, which
interchanges two consecutive letters in one of the strings,
also at a cost of one unit. Give a brief justification of
why your algorithm is correct, and analyze its run time.
Show an example of its execution on the sequences
"hte" vs "the".
- Skiena's text page 78, Problem 3-3.
- (Extra Credit)
For many applications of Approximate String Matching
(including biological sequence comparison), it is not
appropriate to assume the cost of a gap (several consecutive
inserts or consecutive deletes) is simply a multiple of the
length of the gap. I.e., a gap of length 10 may not be
exactly half as common as a gap of length 5. One attractive
alternative (usually called "affine gap
penalties") is to charge a*x+b units for a gap
of length x, for some appropriate constants a
and b. (Although it doesn't matter for this problem,
you would typically have a << b, so that gaps
are relatively expensive to start, but once started, are not
so expensive to extend.) Give a variant of the string
matching algorithm that handles this scoring model for
insert/delete, in addition to the usual one for
matching/substituting letters. Hint: One solution I know
uses an additional n by m matrix.