An earlier section described the dynamic programming algorithm (DPA)
which calculates the
edit-distance
of two strings s1 and s2.
This algorithm takes O(|s1|*|s2|) time.
An optimal alignment which displays an actual sequence of operations
editing s1 into s2
can be recovered from the distance matrix `m' using O(|s1|*|s2|) space.
Hirschberg (CACM 18(6) 341-343 1975)
showed that an optimal alignment can
be recovered in O(|s1|*|s2|) time and only O(|s2|) space
using binary-recursion.
If A contains zero characters or one character, an alignment is easily found.
Otherwise the approach is to cut string s1 in the middle to form s1a and s1b
and to find the corresponding place to cut string s2 into s2a and s2b.
Note, s2a and s2b may have quite different lengths.
The alignment problem is then solved recursively
on s1a and s2a and on s1b and s2b.
This is an example of the divide and conquer paradigm.
Recall that one row of `m' can be calculated from just the previous row.
This allows the last row of `m' to be calculated in O(|s2|) space.
By adding a boundary row m(|s1|+1, ) and a boundary column m( ,|s2|+1),
the DPA can also be run in reverse from m(|s1|+1,|s2|+1) to m(1,1).
Note that D(reverse(s1), reverse(s2)) = D(s1,s2).
This allows the DPA to be run both forwards and backwards.
The DPA is run forward on s1a and on all of s2 with a step of +1 and
in reverse on s1b and all of s2 with a step of -1.
An optimal alignment of s1 and s2 could be partitioned between s1a and s1b.
This would give s2a and s2b.
The edit-distances of s1a and all possible s2a's are contained in the
result of comparing s1a and s2.
Similarly, the edit-distances of s1b and all possible s2b's
are contained in the result of comparing s1b and s2.
The last or bottom row for the forward calculation
and last or top row for the reverse calculation
are found in O(|s2|) space.
Adding corresponding elements of these two rows together
gives the total cost of editing s1 into s2
for each possible way of dividing s2 into s2a and s2b.
Choosing a minimum sum gives an optimal s2a and s2b.
The sub-problems of aligning s1a with s2a and s1b with s2b
are solved recursively.
1 1
0 1 2 3 4 5 6 7 8 9 0 1
A C G T A C G T A C G T
* . . . . . . . . . . . .
0 A
. * . . . . . . . . . . .
1 C
. . * * . . . . . . . . . forward DPA
2 T
. . . . * . . . . . . . . down and right
3 A
. . . . . * . . . . . . .
4 C
. . . . . . * . . . . . .
5 C
. . . . . . . * . . . . .
| an optimal
| path & split
. . . . . . . * . . . . .
6 T
. . . . . . . . * . . . .
7 A
. . . . . . . . . * . . .
8 C
. . . . . . . . . . * . . reverse DPA
9 A
. . . . . . . . . . * . . up and left
10 G
. . . . . . . . . . . * .
11 T
. . . . . . . . . . . . *
A C G T A C G T A C G T
The base cases
when the length of s1 is zero or one are easily dealt with.
Complexity
The space used is dominated by the rows of the edit matrix and is O(|s2|).
The time used to calculate a value for D(s1,s2) is O(|s1|*|s2|)
but the recursion must be considered.
The two, top-level recursive calls each solve a smaller problem.
Together these add up to half of the size of the original problem,
and take O(|s1|*|s2|/2) time in total
(think about the area of matrix).
At the next recursive level, four problems
in total one quarter of the original size are solved, and so on.
Therefore the total time taken is O(|s1|*|s2|*(1+1/2+1/4+...))
which is O(2*|s1|*|s2|)
which is O(|s1|*|s2|).
This use of recursion
reduces the space used from O(|s1|*|s2|) to O(|s2|)
and roughly doubles the time taken.
A decision to use the technique can only be made
with regard to the length of typical strings
and the memory size and speed of the computer.
Strings from the molecular-biology application are typically a few hundred
to a few thousand characters long which can make its use worthwhile.
L . A l l i s o n
Try `go', change the strings in the HTML FORM below and experiment:
The trace above shows the sequence of recursive calls
with indentation used to indicate the depth of recursion.
Notes
D. S. Hirschberg.
A linear space algorithm for computing maximal common subsequences.
Comm. A.C.M. 18(6) p341-343, 1975.
The original algorithm was given in terms of the
longest common subsequence (LCS) problem, but it is easily adapted
for the edit-distance problem, as above.
D. R. Powell, L. Allison, T. I. Dix.
A versatile divide and conquer technique for optimal string alignment.
Inf. Proc. Lett. 70 p127-139, 1999.
A variation on the theme allows
divide and conquer to be applied to more difficult problems,
involving complex gap-costs, and even to
Ukkonen's greedy, fast algorithm.