CSE 417 Assignment #3
Winter 2002

Due: Friday, February 1, 2002.

Reading:

Chapter 4.

Problems:

  1. Simulate the execution of the Approximate String Matching algorithm (section 3.1.3) on the strings
    S1 = A C G A G T T
    S2 = A A C G A T G T
    Show both the D matrix and the "backtrace" arrows in it for this input. How many optimal edit sequences are there? List it/them.

  2. A common mistake when typing English text is the transposition: tow consecutiev lettesr typde ni hte wrogn roder. When comparing two such strings, the Approximate String Matching algorithm will charge 2 for each transposition, but only 1 for equally common mistakes like inserted or deleted letters. Give a variant of the algorithm that also allows a "Flip" operation, which interchanges two consecutive letters in one of the strings, also at a cost of one unit. Give a brief justification of why your algorithm is correct, and analyze its run time. Show an example of its execution on the sequences "hte" vs "the".

  3. Skiena's text page 78, Problem 3-3.

  4. (Extra Credit) For many applications of Approximate String Matching (including biological sequence comparison), it is not appropriate to assume the cost of a gap (several consecutive inserts or consecutive deletes) is simply a multiple of the length of the gap. I.e., a gap of length 10 may not be exactly half as common as a gap of length 5. One attractive alternative (usually called "affine gap penalties") is to charge a*x+b units for a gap of length x, for some appropriate constants a and b. (Although it doesn't matter for this problem, you would typically have a << b, so that gaps are relatively expensive to start, but once started, are not so expensive to extend.) Give a variant of the string matching algorithm that handles this scoring model for insert/delete, in addition to the usual one for matching/substituting letters. Hint: One solution I know uses an additional n by m matrix.