An Information Measure for the String to String Correction Problem with Applications

Algorithms for approximate string matching are widely used to infer a relation between two strings in situations where strings mutate, evolve or contain noise, eg. DNA or protein strings. Such a relation is (just) a hypothesis. It has been shown how to calculate a probability for such hypotheses. Unfortunately the matching algorithms do not directly extend to more than two strings because of rapidly increasing complexity. However, a stochastic method is one way around this difficulty. The probabilities of competing hypotheses are used to generate random alignments of two or more strings. The alignments are sampled from the correct probability distribution. Averaging over many such alignments gives good estimates of how closely the strings are related and in what way. (Mediocrity is useful.) In addition, sampling in an increasingly selective way gives a simulated annealing search for an optimal multiple-alignment.

Keywords and Phrases: approximate string matching, edit distance, evolutionary tree, hidden Markov model, inductive inference, Gibbs sampling, Monte Carlo method, multiple alignment, simulated annealing, stochastic, string to string correction problem.

Introduction

Algorithms on strings are much studied in computer science. Exact matching algorithms are ruled out in many real-world problems by the properties of the data: DNA strings evolve and mutate, speech is uttered at different rates, computer images often contain noise, there may be experimental error and so on. Approximate matching algorithms based on Levenshtein's (1966) metric, Sellers' (1974) edit-distance, Wagner and Fischer's (1974) string to string correction problem, the longest common subsequence and other related notions can cope with this reality (Sankoff & Kruskall 1983). The idea is that characters can be changed, inserted or deleted and that a "good" relation between two or more strings will involve a "small" number of such character mutations. Note that other models of string mutation are possible, for example Tichy's (1984) block-moves model, and may be more appropriate in certain circumstances.

string A: ACGTAATACGT
string B: ACGTATAACCT

Some possible alignments:
 (i) ACGTAATA-CGT  (ii) ACGTAATACGT  (iii) ACGTAAT-ACGT
     ||||| || | |       |||||  || |        |||| || || |
     ACGTA-TAACCT       ACGTATAACCT        ACGT-ATAACCT
          |  | |
          |  | mismatch
          |  ins_B
          ins_A

Generation instruction sequence equivalent to alignment (i):
   match(A); match(C); match(G); match(T); match(A); ins_A(A);
   match(T); match(A); ins_B(A); match(C); mismatch(G,C); match(T)

Mutation instruction sequence equivalent to alignment (i):
   copy; copy; copy; copy; copy; del;
   copy; copy; ins(A); copy; change(C); copy

Figure 1: Alignment Examples.

Given K>=2 strings, we are interested in the following kinds of question: are the strings related, and if so are they related by a given evolutionary tree, and if so which parts of the strings are related to each other? Many programs exist for comparing multiple strings but most use more or less ad hoc weights or penalties to score relations, and employ heuristics to search for one good relation. The aim here is to treat the problems as inductive inference problems: an answer is just a hypothesis and an estimate of a parameter of a model is just that, an estimate. Weights or parameters should be inferred from the data. Probabilities, error bounds or other figures of merit should be given for hypotheses. There are efficient inference algorithms for K=2 strings. Bishop et al (1987) first cast the two string problem as a maximum likelihood problem. Thorne et al (1991, 1992) extended it to models that were more realistic for DNA sequences. Allison and Yee (1990) used a minimum message length (MML) approach and Allison et al (1992a) examined finite-state models or hidden Markov models of string evolution. Wallace and Freeman (1987) gave the statistical foundations of the MML approach to inductive inference.

The next section summarises the inductive inference approach to two strings as a basis for the remainder of the paper. The naive extensions of algorithms for the K=2 case to several strings are infeasible. Instead, they are used as the basis of a stochastic method for comparing K>=2 strings. Note that if the "correct" evolutionary tree to use is not known, various methods, such as Hein's (1989) or Thorne and Kishino's (1992), could be used to suggest one or more candidate trees to be evaluated in turn. The paper is framed in terms of DNA strings over the alphabet of four bases, {A, C, G, T} but the general stochastic technique is applicable to other strings and alphabets.

Two Strings

An alignment is a way of displaying a relation between the parts of two or more strings. Some or all of the strings are padded out with a special null character `-' so that all strings have the same length. The resulting strings are written out one above another. The columns of the alignment are tuples of proper characters and `-', one component coming from each padded string. The only restriction is that the all-null tuple is not allowed. Some examples of alignments are given in figure 1. The number of possible alignments grows exponentially with string length and one often wants a "good" alignment. Each tuple can be given a cost (or a score) in various ways and the cost (score) of an alignment is the sum of the costs (scores) of its tuples. Tuples with matching characters generally have a low cost (high score). Attention is restricted to K=2 in this section.

If we are given two strings, A and B, it is sometimes convenient to think of A as the ancestor and B as the descendant. In that case we relate A and B by a mutation machine which reads A and executes a sequence of mutation instructions to output B:

We can just as well consider B to be the ancestor and A the descendant by interchanging insert and delete and reversing changes. Insertions and deletions are called indels.

It is sometimes convenient to think of A and B as having equal standing, particularly if they are both descended from an unknown ancestor. They can then be related by a generation machine which outputs A and B by executing a sequence of generation instructions:

It is clear that there is a correspondence between an alignment, a mutation sequence and a generation sequence and they are used interchangeably as convenient. For example, the pair corresponds to the character plus the mutation instruction {x,change(y)} and to the generation instruction mismatch(x,y); see also figure 1. The longest common subsequence (LCS) problem is to find a sequence of generation-instructions (or an alignment etc.) which maximises the number of matches. That number can be found by the dynamic programming algorithm (DPA), figure 2(i), in O(|A|*|B|) time and in O(|B|) space if only two rows of the matrix D[,] are kept. An alignment can be recovered in O(|A|*|B|) space by a trace-back through D[,] if all of D[,] is kept, or in just O(|B|) space if Hirschberg's (1975) divide and conquer technique is used. The edit-distance problem is to minimise the number of mutations and can be solved by a second version of the DPA, figure 2(ii). There are faster algorithms for the LCS and edit-distance problems in the important special case of similar strings (eg. Ukkonen 1983) but they are not suitable for what follows.

Boundary Conditions:
   D[0, 0] = v
   D[i, 0] = f( D[i-1, 0], g(A[i], `-') ),  i = 1..|A|
   D[0, j] = f( D[0, j-1], g(`-', B[j]) ),  j = 1..|B|

General Step, i = 1..|A|, j = 1..|B|:
   D[i, j] = h( f(D[i-1, j-1], g(A[i], B[j])),   -- match/mismatch
                f(D[i-1, j],   g(A[i], `-' )),   -- delete A[i]
                f(D[i, j-1],   g(`-' , B[j])) )  -- insert B[j]

Return: D[|A|,|B|]

(i) longest common subsequence (LCS):
       v = 0
       f = +
       g(x,x)=1, g(x,y)=g(x,`-')=g(`-',y)=0
       h = max

(ii) edit-distance:
       v = 0
       f = +
       g(x,x)=0, g(x,y)=g(x,`-')=g(`-',y)=1
       h = min

(iii) most probable alignment:
       v = 1
       f = *
       g(x,x)   = P(<x,x>),   g(x,y)   = P(<x,y>),
       g(x,`-') = P(<x,`-'>), g(`-',y) = P(<y>)
       h = max

(iv) r-theory, P(A&B):
       v, f and g as for (iii)
       h = +

Figure 2: Versions of the Dynamic Programming Algorithm.

It turns out that we cannot easily answer the question, are A and B related? We can however answer the nearly equivalent question, are A and B related by a generation machine M? The machine is defined by the probabilities of its instructions. Given M, the probability of A and B being generated by the sequence of instructions S₁; S₂; ...; S_L is

The probability of the length, P(L), presents a nasty technical problem. If any good prior for lengths is known then it should certainly be used. Failing that, Rissanen's (1983) universal prior for integers or similar can be used. In any case, most reasonable priors on length behave as ~1/L^1+delta (delta>=0) and, as the lengths of all plausible alignments are similar, the probabilities of those lengths are very similar. Now, one instruction sequence represents one hypothetical relation between A and B. If the question of its length is ignored, a most-probable alignment, conditional on M, can be found by a third version of the dynamic programming algorithm, figure 2(iii), which deals in probabilities.

M's instruction probabilities are called its parameters. They can be estimated from an alignment if they are not known apriori. This leads to an iterative method of estimating the parameters: first, initial estimates are guessed and an initial alignment is found, then the parameters are reestimated from the alignment, a new alignment is found and the process repeated until convergence. Convergence is guaranteed, and four to eight iterations are usually sufficient, but it is possible to get stuck in local optima for distantly related strings. Machine instructions are drawn from a multi-state distribution and Wallace and Boulton (1968) show with what accuracy such probabilities can be legitimately inferred.

We do sometimes want a most-probable alignment, but it is too detailed an answer for the question, are A and B related by M, ie. in some unspecified way? This is called the r-theory. To calculate P(A&B|M) we need to sum the probability over all of M's instruction sequences that give A and B. Fortunately this can be done by yet another version of the DPA, figure 2(iv). If the parameters are to be estimated from the strings, the previous iterative method can be used and a weighted-average of the observed frequencies of machine-instructions is calculated as the DPA proceeds. Simulation results from Yee and Allison (1993) confirm that a single optimal alignment gives biased estimates of machine parameters and that the r-theory gives unbiased estimates. The reason is clear: mutation is random and does not follow an optimal path. Mutations can compensate for each other so estimates based on a single optimal alignment are literally over-optimistic and tend to underestimate systematically the amount of change that occurred. (We do not yet know how to model any pressure of selection.)

It is convenient for the last two DPAs to deal with -log₂ of probabilities, also known as message lengths, which are measured in bits, rather than with probabilities directly, as the final values of the latter can be very small.

Lastly, there is a natural null-theory, that the strings are not related. When the -log₂ of its probability is taken, this amounts to approximately two bits per character for the case of DNA. Comparison with the null-theory gives a hypothesis test. Note it can easily be the case that no single alignment is an acceptable hypothesis yet the r-theory, that the strings are related by M, is acceptable.

Several Strings

An alignment of K>=2 strings is formed as above by padding out the strings with `-' so that they have the same length and by writing the resulting strings one above another. This gives a formal language of multiple-alignments but its semantics must also be specified. A very natural interpretation of a multiple-alignment is conditional upon an evolutionary tree over the strings: Each tuple of the alignment represents the descendants, over all possible histories consistent with the tree, of an ancestral character or an inserted character. We consider unrooted binary trees here, the position of the ancestral string being arbitrary. Each edge of the tree is modelled by a mutation-machine. Unrooted trees are natural unless a nearly constant rate of evolution can be assumed and binary trees are sufficient for most purposes, although the general methods could be applied easily to rooted trees and to n-ary trees. The main problems considered here are to find an optimal multiple-alignment given a tree and to estimate the edges (machines) of the tree.

An unrooted tree over K strings has 2K-3 edges or machines. If all their parameters are known, the machines can be combined into an equivalent generation machine for the K strings. The probability of a K-tuple of characters can be calculated during an O(K)-time traversal of the tree; the details are given by Felsenstein (1981, 1983) who considered inferring an evolutionary tree from a fixed alignment which is the complementary problem to the multiple-alignment problem considered here. For example, given the tree ((1 2) (3 4)) and assuming that all edges are similar and that P(change) is small, the 4-tuple is less probable than . Conversely, if is more common than and in alignments, this provides evidence to support the tree ((1 3) (2 4)). Unfortunately, if the strings' lengths are about `n', the naive extensions of the DPA of the previous section would take O(K*n^K) time which is infeasible for typical values of K and n. (The practical limit for the simple edit-distance and using various additional short-cuts seems to be about six similar, short sequences - see Carrillo and Lipman (1988), Altschul and Lipman (1989), and Allison (1993).) However, just as two strings can be aligned, so two alignments of disjoint sets of strings can be aligned in O(K*n²) time to give an alignment of all of the strings. This has been used as a deterministic heuristic to improve a K-way alignment: projecting the K-way alignment onto two subsets of strings formed by breaking the tree at an edge and then realigning the subalignments. The process is iterated until it converges (Allison et al 1992b). The results may not be optimal, particularly if the strings are distantly related. In what follows, a simulated annealing search is described for an optimal multiple-alignment which can jump out of troublesome local optima. It may also avoid the problem identified by Lake (1991) where combining sequences deterministically can bias the search for an (unknown) tree topology.

An optimal multiple-alignment can be useful in its own right but it is likely to give biased estimates of the tree edges as does optimal pairwise-alignment. The solution used for K=2 of averaging over all alignments is infeasible for larger K, but a Gibbs sampling or Monte Carlo method (Hastings 1970) is described which averages over many, if not all, multiple-alignments to give a good approximation.

The key to the stochastic methods to be described is the generation of random alignments sampled from the correct probability distribution. The general step of the DPA (figure 2(iv)) calculates P(A[1..i]&B[1..j]|M) by adding the probabilities of its three immediate predecessors:

although A and B now represent alignments of sets of strings. If at the same time a random choice is made of one of these directions, in proportion to their probabilities, then a traceback of these choices will give a random alignment sampled from the probability distribution of alignments. This requires O(|A|*|B|) space which can be reduced to O(|B|) by using Hirschberg's (1975) technique.

The process to sample random multiple-alignments repeatedly chooses a random edge to break the tree on. The current multiple-alignment is projected onto two subalignments for the resulting partition of the strings. A new random K-way realignment is then generated as above. The initial multiple-alignment is formed by a suboptimal process. The search space has many degrees of freedom. The use of subalignments amounts to keeping most of them fixed at each step, but the realignment is chosen from the conditional posterior probability distribution which is sufficient for a Gibbs sampling or Monte Carlo method. Parameters are estimated from each multiple-alignment and averaged to give final estimates. Standard deviations are calculated and give an indication of accuracy. A working estimate from "recent" alignments is used during the sampling process.

          s8                      s15
s9        .                        .        s14
   .      .                        .      .
      .   .                        .   .
         ..                        ..
          s4                      s7
            .                    .
             .                  .
              .                .
               s2---------s1=s3
              .                .
             .                  .
            .                    .
          s5                      s6
         ..                        ..
      .   .                        .   .
   .      .                        .      .
s10       .                        .        s13
         s11                      s12

During Evolution:
   Ancestor, s1=s3, length(s1)=500
   and for each edge P(copy)=0.8; P(change)=0.1;
   P(insert)=P(delete)=0.05

Figure 3: Full Tree.

The Gibbs sampling process generates random alignments by the repeated choice of directions in proportion to their probabilities. The choice can be influenced in a useful way if the probabilities are raised to a power, q>=0. When q=0 the choice is uniform. When q=1 the choice is in proportion to probability. When q>1 the choice is biased towards more probable alignments and when q is large the bias is to optimal alignments. Increasing q can therefore be used to implement the cooling of a simulated annealing search for an optimal multiple-alignment. Recall that the programs actually deal with -log₂ probabilities, so multiplying these quantities by q is equivalent to raising probabilities to the power q. This strategy is very different from that of Ishikawa et al (1992) which repeatedly makes a small perturbation to the multiple-alignment and either accepts the result or rejects it.

Implementation

A Pascal program has been written to analyse DNA strings using the methods described above. A hash-table is used to store the probabilities of tuples when they are met for the first time in the DPA so that later meetings do not need recalculation. The program also uses Hirschberg's technique. Two alignments of four strings of length 500 characters can be aligned to give an alignment of eight strings in 15 to 30 seconds on a Sparc-station. 1000 random alignments of eight strings can be sampled in a few hours. The running time is affected by the similarity of the strings which influences both the lengths of alignments and how many different kinds of tuple are evaluated. The present implementation embodies the simplest one-state mutation-machines and assumes that all characters are equally likely, and that for each edge all insertions are equally likely, all changes are equally likely and P(ins) = P(del). These assumptions are not unreasonable for DNA and they could be relaxed. Each edge of the tree has a different mutation machine.


edge         Simulated Ann'    Gibbs Sampling 1000x
------       --------------    --------------------------------
s2-s3 : 40chng 28ins 27del (55indel) ED=85
             .816 .169 .015    .799(.023) .107(.020) .094(.024)

s2-s4 : 51chng 29ins 25del (54indel) ED=96
             .825 .119 .056    .748(.018) .092(.015) .160(.026)
s2-s5 : 50chng 22ins 18del (40indel) ED=85
             .845 .119 .036    .817(.019) .083(.017) .100(.026)
s3-s6 : 60chng 29ins 23del (52indel) ED=103
             .828 .164 .008    .816(.014) .141(.015) .043(.013)
s3-s7 : 50chng 26ins 31del (57indel) ED=97
             .860 .080 .060    .759(.021) .109(.018) .132(.032)

s4-s8 : 48chng 25ins 24del (49indel) ED=90
             .842 .116 .042    .810(.018) .099(.009) .092(.019)
s4-s9 : 37chng 22ins 26del (48indel) ED=80
             .852 .084 .064    .837(.020) .079(.010) .085(.020)
s5-s10: 51chng 29ins 35del (64indel) ED=108
             .827 .062 .111    .763(.023) .068(.013) .169(.032)
s5-s11: 53chng 23ins 17del (40indel) ED=92
             .830 .139 .031    .817(.015) .111(.013) .071(.012)
s6-s12: 60chng 23ins 29del (52indel) ED=99
             .795 .145 .060    .773(.014) .138(.012) .089(.019)
s6-s13: 53chng 21ins 20del (41indel) ED=92
             .861 .078 .061    .835(.011) .096(.014) .069(.017)
s7-s14: 54chng 18ins 23del (41indel) ED=89
             .825 .122 .053    .798(.012) .109(.009) .093(.013)
s7-s15: 58chng 30ins 28del (58indel) ED=108
             .826 .097 .077    .785(.013) .100(.010) .115(.015)
             --------------    --------------------------------
      Mean:  .833 .115 .052    .797       .102       .101

-log₂ probabilities or message lengths (bits):
             7705              9336(+-274) mean(SD)
                   Null=8088

Key:
edge: actual #changes #inserts #deletes (#indels) Edit-Distance
     P(copy) P(chng) P(indel)  P(copy)(SD) P(chng)(SD) P(indel)(SD)

Evolution, each edge: P(copy)=0.8; P(chng)=0.1; P(ins)=P(del)=0.05
Length(s1=s3) = 500

Table 1: Inferred Parameters.

Results

The current program has been used on both artificial and real DNA data, the evolutionary history of artificial strings being known.

Figure 3 shows a tree used in one test. The ancestral string is s1, the first generation descendants s2 and s3, and so on. S1 and s3 are identical because we are dealing with unrooted trees. The machines on all edges were similar during evolution, causing 20% mutation on average: P(copy) = 0.8, P(change) = 0.1 and P(ins) = P(del) = 0.05. The process was random so there was variation around these figures. These parameter values were not known to the analysis program which was given only the tree topology and the third generation descendants s8 to s15. 20% mutation per edge may seem a small amount but brothers, such as s8 and s9 say, are separated by two edges, cousins such as s8 and s10 by four edges, and s8 and s12 by five edges. This data is at quite a high level of mutation for DNA where the small character set causes many spurious matches.

Table 1 summarises the results of the test. It shows the edge parameters estimated from the putative optimal alignment found by simulated annealing and those estimated by averaging over 1000 alignments from Gibbs sampling. The following values from the evolution of the strings are also given for each edge: actual numbers of changes, inserts and deletes, and the simple edit-distance of the two strings at the ends of the edge. Note that every one of these edit-distances is less than the sum of the corresponding mutations as a "better" explanation than the true events can be found. The optimal(?) multiple-alignment underestimates P(indel) in particular and this is a general tendency that is greatest for internal edges of the tree and increases with mutation level.

       CATQ                 CATB
           .               .
             .           .
               .       .
                 .   .
                   .
                   .
                   .
                   .
    (.81,.19,.00)  .            <--- (P(copy),P(change),P(indel))
CCOLI--------------.
    +              .
       +           .
          +        .
(.67,.32,.00)+     .
                +.   .
               .       .
             .           .
         CATP             CATD

-log₂ probabilities or message lengths (bits):
 Tree                                   optimal  Gibbs mean(SD)
 ----                                   ----     --------------
 (((CATB CATQ)(CATD CATP))CCOLI) --- :  4369     4555(+-40)
 ((CATB CATQ)(CATD(CATP CCOLI))) +++ :  4376     4502(+-26)
 Null: 6449


Data:
 Genbank-id     organism        designation
 m55620 Clostridium perfringens CATQ  chloramphenicol acetyltransf...
 m28717 Clostridium perfringens CATP  chloramphenicol-resistance p...
 x15100 Clostridium difficile   CATD  gene for chloramphenicol ace...
 m93113 Clostridium butyricum   CATB  chloramphenicol acetyltransf...
 m35190 Campylobacter coli      CCOLI chloramphenicol acetyltransf...


Figure 4: Two Alternative Trees for Five Bacterial Sequences.

The mean machine from simulated annealing further illustrates this bias. The estimates from Gibbs sampling do not show bias and the discrepancy between the "true" and estimated values of P(indel) is less than two standard deviations, except for the edge s3-s6, and is often much less. The standard deviations of the estimates are large, as is to be expected for this level of mutation, but a single (optimal) alignment gives no indication of this inherent uncertainty. The optimal alignment beats the null-theory on message length and so is an acceptable hypothesis but an "average" alignment fails to be acceptable by a wide margin. However the latter collectively dominate the search space and the total probability which gives an indication of how many of them there are; only a tiny fraction is sampled by Gibbs sampling.

Many similar tests were done, with balanced and unbalanced trees and with different levels of mutation, some being recorded by Allison and Wallace (1993). At low levels of mutation, <10% per edge, the deterministic heuristic is rarely beaten by simulated annealing in the search for an optimal alignment and such an alignment can give reasonable edge estimates. At greater levels of mutation, >=15% per edge, simulated annealing invariably beats the deterministic heuristic and optimal alignments give increasingly biased edge estimates. These effects would be modified for a different alphabet. Much experimentation remains to be done; in particular the tuning of simulated annealing algorithms is something of a black art.

Five bacterial DNA sequences for chloramphenicol acetyltransferase (CAT)-encoding resistance determinant (Huggins et al 1992) provide an interesting case summarised in figure 4. The coding regions or open reading frames were compared; these are about 640 characters long. It is clear from comparisons of four sequences that the (unrooted) tree topology ((CATB CATQ) (CATD CATP)) is strongly preferred over the two other possible topologies. Where CCOLI fits in is less clear. The tree topology (((CATB CATQ) (CATD CATP)) CCOLI) is slightly preferred on the basis of an optimal alignment, but the tree topology ((CATB CATQ) (CATD (CATP CCOLI))) is preferred on the basis of average alignment probabilities. This suggests some genuine uncertainty. It is resolved with more information: Huggins et al performed an analysis using protein sequences, including sequences from more organisms, and bringing experts' knowledge to the problem, and (((CATB CATQ) (CATD CATP)) CCOLI) is a subtree of their preferred tree.

Conclusions

Deterministic algorithms to estimate the edges of an evolutionary tree and to find an optimal alignment of more than two strings are infeasible in general. Stochastic processes provide one way around this problem. Averaging over many random alignments, sampled from the correct probability distribution, gives good estimates of the tree's edges. Secondly, sampling in an increasingly selective way gives a simulated annealing search for an optimal alignment. Although valuable for some purposes, an optimal alignment gives biased estimates of tree edges, the effect increasing with the level of mutation and being worst for internal edges of the tree. A program based on these ideas has been implemented for DNA strings using a simple 1-state model or mutation machine. Current work includes generalising the model of mutation that is used.

The stochastic techniques could be applied to other kinds of strings and to related problems. It is not essential for probabilities to be calculated exactly for the simulated annealing method (only) to work; it is enough that the rank-ordering of the better alignments be correct at low temperature. This would allow simulated annealing to be applied to star and all-pairs costs (Altschul & Lipman 1989) and to multiple LCS and edit-distance problems in a direct way.

The processes described are good candidates for both fine-grain and coarse-grain parallel implementations. Collins and Coulson (1984) and Dautricourt et al (1991) have implemented fine-grain parallel versions of DPAs for two strings and this could be done for the central step of our stochastic processes. Dautricourt et al report that, for some algorithms, 1 hour's processing on a sparc-station could be done in 27 seconds on a Maspar. Independent copies of the stochastic processes could also be run as a coarse-grain parallel program.

References

Note, 17th Annual Computer Science Conference, ACSC-17, Christchurch, New Zealand, Proceedings are published in Australian Computer Science Communications Vol 16 (!) No 1, 1994, isbn:0-473-02313-X, ed. G. Gupta.