^Bioinformatics^ ^#Alignment^   [program source]

Minimum-Message Length Alignment Based on 1, 3 and 5-State Finite State Machines/ Hidden Markov Models.

C. N. Yee, L. Allison,
Department of Computer Science,
Monash University,
Australia 3168.

http://www.csse.monash.edu.au/~lloyd/tildeStrings/

April 1991

Introduction

This document consists of program documentation (only) for Minimum-Message Length (MML) alignment programs for DNA. The algorithms are based on one, three and five-state models of mutation or relation of DNA. The document describes `how' not `why'. For a gentle (?) introduction to `why' see TR90/148.

A simple driver routine is provided to allow the algorithms to be used. This is not intended to provide a practical sequence-analysis environment. The intention is that the alignment algorithms will be used for research into the alignment problem and may be extracted and built into sequence-analysis packages by others.

Conditions of Use:

If you do make use of these algorithms, please include these references in any subsequent publication:

  1. L.Allison, C. S. Wallace & C. N. Yee. Inductive inference over macro-molecules. TR 90/148 Dept. Computer Science, Monash University, Australia 3168 [HTML].

  2. L. Allison & C. N. Yee. Minimum message length encoding and the comparison of macro-molecules. Bull. Math. Biol. 52(3) 431-453 1990.

  3. L.Allison, C. S. Wallace & C. N. Yee. Finite-State Models in the Alignment of Macromolecules. J. Mol. Evol. (1992) 35:77-89 [paper]

The One-State Machine

The one-state machine has a single state and the transitions or arcs are named as follows:

M(atch)<char, char> same character in both strings,
C(hange)<char, char'> differing characters in the 2 strings,
V(insert in A)<char, null> and
H(insert in B)<null, char>.
(V from Vertical, H from Horizontal.)

The program forces P(I) = P(D) as this is a reasonable prior assumption. (It would be easy to let P(I) and P(D) differ; there would be a cost in stating an extra parameter but the data strings might be fitted better.)

The Three State Machine

The three-state machine corresponds to linear indel costs. There is a low probability of starting to insert (delete) and a higher probability of continuing a run of inserts (deletes).

           ----        ----
           |  |        |  |
        DM |  V        V  |DC
           |  ----------- |
           ---|         |--
        /---->|    D    |<----.
       /      |         |      .
      /    /->-----------<-.    .
     /    /    /       .    .    .
  HM/  HC/    /         .    .VC  .VM
   /    /    /           .    .    .
  |    /    /DH         DV.    .    .
  |   |    V               V   |    |
  |   --------    HV    --------    |
  ----|      |--------->|      |-----
      |  H   |          |  V   |
  --->|      |<---------|      |<---
HH|   --------    VH    --------   |VV
  |-----                     ------|


   Figure 1 : The Three State Machine.

The three states are called D (diagonal move, for matches and changes), H (horizontal move, or insert in B) and V (vertical move, or insert in A). There are four outgoing arcs in state D:

DM - a match, which leaves the machine in state D
DC - a change, which leaves the machine in state D
DH - an insert, which leaves the machine in state H, and
DV - a delete, which leaves the machine in state V.

Similarly the four outgoing arcs in state H are named: HM, HC, HH, and HV; and four for state V: VM, VC, VH, VV.

We make states H and V equivalent (i.e. DH = DV, VM = HM, VC = HC, VH = HV, VV = HH) as there is no evidence as yet to indicate that they should differ. This reduces the degrees of freedom and hence the number of parameters that need to be stated.

The Five State Machine

The five-state machine corresponds to piece-wise linear indel costs.
      -------                                 -------
  ----|     |-----.                     /-----|     |----
AA|   |  A  |      .AD               BD/      |  B  |   |BB
  --->|     |<-.    .                 /    /->|     |<---
      -------   .    .    DM   DC    /    /   -------
                 .    .   ---  ---  /    /
                DA.    |  | |  | | |    /DB
                   .   V  | V  V | V   /
                    .-----------------/
                     |               |
               /---->|       D       |<----.
              /      |               |      .
             /    /->-----------------<-.    .
            /    /    /             .    .    .
         HM/  HC/    /               .    .VC  .VM
          /    /    /                 .    .    .
         |    /    /DH               DV.    .    .
         |   |    V                     V   |    |
         |   --------       HV       --------    |
         ----|      |--------------->|      |-----
             |  H   |                |  V   |
         --->|      |<---------------|      |<---
       HH|   --------       VH       --------   |VV
         |-----                           ------|

         Figure 2 : The Five State Machine (Bugs Bunny).
               2001: This sort of finite state control machine (with many variations in "architecture") has subsequently also come to be known as a Pair Hidden Markov Model (PHMM), pair from the two sequences.

The five-state machine is an extension of the three-state machine with two extra states \A and B for long insert in string A and string B respectively. To reduce the number of arcs, and thus parameters, to estimate and state, we restrict the transition to and from A and B through state D. Note that the arcs AD and BD bring the machine back to state D but they do not generate any character.

For the reasons stated for the Three-State machine the H and V; A and B states are made equivalent. We also fix the probabilities of long indels as they do not occur frequently enough for them to be meaningfully infered from single pairs of strings of typical lengths. At present these parameters are fixed at:

   PDA = PDB = 0.001; PAD = ABD = 0.01; PAA = PBB = 0.99.

See file "5State.c" to change them.

The Iteration Procedure to Find Optimal Parameters

We describe here an iterative procedure that adjusts the parameter values to minimise the message length. We believe that from a `reasonable' set of starting parameters the procedure finds the global minimum except in exceptional, artificial cases. A similar technique is used for the three and five-state machines.

Each cell D[i,j] is associated with the ML field plus four frequency counters M,C,H,V. The ML stores the message length to transmit A[1..i] and B[1..j]. The four frequency counters store the number of times the machine makes transitions through the various arcs to get to D[i,j]. The general dynamic programming step is as follows:

We first compute the odds-ratios of entering D[i,j] from the three incoming neighbours; D[i-1,j](Phoriz), D[i-1,j-1](Pdiag) and D[i,j-1](Pvert).

  Pvert = exp2(-D[i,j-1].ML - CostIns);
  Phoriz = exp2(-D[i-1,j].ML - CostDel);
  if (A[i] = B[j]) then Pdiag = exp2(-D[i-1,j-1].ML - CostMatch);
                   else Pdiag = exp2(-D[i-1,j-1].ML - CostChange);

Arriving from the vertical direction involves a transition on the V arc so we increment D[i,j-1].V by 1. Similarly for the horizontal direction we increment the H field of D[i-1,j]. For D[i-1,j-1] on the diagonal direction we either increment the M or the C field dependent on whether the A[i] and B[j] is a match or a change. The frequency counters of D[i,j] are the sums of their counterparts from D[i,j-1], D[i-1,j] and D[i-1.j-1], weighted by the odds-ratios of entrance from each of the directions. Finally the message length D[i,j].ML = log2(Pvert + Phoriz + Pdiag). (NB. A fast routine is used to avoid calling exp and log.)

The boundary conditions are given by:

  D[0,0].ML = 0 ; D[0,0].M,C,H,V = 0 ;
  D[i,0].ML = CostDel*i; D[i,0].V = i; D[i,0].M,C,H = 0 ;
  D[0,j].ML = CostIns*j; D[0,j].H = j; D[0,j].M,C,V = 0 ;

We start the procedure with estimated costs (probabilities) for the arcs. However at the end of a run we can use the frequency counters of cell D[m,n] to compute what the parameter values should have been for the alignment that we have computed. Changing the parameters changes the optimality of alignments so we iterate with these new parameter values. The iteration converges quickly.

We can also extend the idea by recomputing the arc costs at each step from the frequency counters. This amounts to adjusting the cost of the arcs by what have encountered so far in the input strings. In this case the starting parameters for cell D[0,0] is not crucial as the adaptive process stabilises very quickly. A typical value is P(M) = P(C) = P(H) =P(V) = 0.25. Note that this method is similar to computing the message length of ONE alignment using adaptive coding. We only use this process as a heuristic to generate good starting parameters for subsequent iterations because it is not "sound".

The R-theory Driver Program

A simple driver program has been written to demonstrate the algorithms. It is not intended to provide a practical sequence-analysis environment.

An Example Output

The sample run below require a file called "examplestrings", texts beginning with "#" are explanatory notes:

>example 1                   # a short name must begin with '>'
example string 1             # a 1 line description
CAACCAAC TGCA                # the strings itself over
TTAGCT                       # 1 or more lines
>example 2                   # the second string, begin with '>'
example string 2
ACCAACCA TGCA
GGGG
TTAGCT

The output from a complete run is given below. Texts in Italics are user responses to prompts from the driver program.

#
#        The driver program
#
Input the file name where the strings are in
or type <RET> to input strings from terminal : examplestrings


example 1        18 bases
example string 1
CAACCAACTGCATTAGCT


example 2        22 bases
example string 2
ACCAACCATGCAGGGGTTAGCT


Do you want to run:
    1. One-State machine
    2. Three-State machine
    3. Five-State machine
    4. One-State machine one (best) alignment
    5. Three-State machine one (best) alignment
    6. All machines

  Input option : 4

File name to output density plot matrix for One-State machine? or type <RET> if density plot is not required : 1State.plot File name to output density plot matrix for Three-State machine? or type <RET> if density plot is not required : 3State.plot File name to output density plot matrix for Five-State machine? or type <RET> if density plot is not required : 5State.plot

Initial parameters -- do you want to :
    1. Do adaptive coding on 1st run,  or
    2. Use default parameters,  or
    3. Input your own
  Input option : 1

#
#        Output from the r-theory alignment routines
#
ONE-STATE MACHINE :
iteration  1: PM=0.426 PC=0.173 PID=0.401 ML=  92.864 (  79.064) SumFq:   25.1
iteration  2: PM=0.496 PC=0.129 PID=0.374 ML=  91.160 (  77.285) SumFq:   24.6
iteration  3: PM=0.531 PC=0.101 PID=0.368 ML=  90.396 (  76.394) SumFq:   24.5
iteration  4: PM=0.548 PC=0.078 PID=0.373 ML=  90.102 (  75.942) SumFq:   24.6
iteration  5: PM=0.558 PC=0.061 PID=0.381 ML=  89.919 (  75.590) SumFq:   24.7
iteration  6: PM=0.565 PC=0.049 PID=0.386 ML=  89.771 (  75.292) SumFq:   24.8


iteration  7:
 Fwd[i,j]+Rev[i+1,j+1]-MML
  1 avg algn= 102.674 ( - chars - len =   32.514 )
 null-theory=  94.15 =  10.87(sumlen) +   3.28(difflen) +  80.00(chars)  bits
    OneState=  89.66 =   4.88(pars) +  75.07(algn) +   9.71(len)
algn - chars=  14.62
probability related=  0.9572



   ACCAACCATGCAGGGGTTAGCT
  ##*-.
C:*-*#-..
A:++--#-..
A:.*-.-#-....
C:.-*---#-....
C: .-*-.-#-....
A:  .-*--**+--...
A:   .-*+-*++---..
C:    ..+*#--+---..
T:     ..--#--++--..
G:      ...-#---+--.
C:       ...-#++++-..
A:        ...-#####-..
T:         ...-++++#-.
T:          .....--+#-.
A:            ......-#-.
G:             ......-#-.
C:                ....-#-
T:                  ...-#

KEY:  #   *   +   -   .
   :    1   2   4   8   16  bits over mml

1-State Machine: 2.2443(bits/symb)  89.77(ML)  75.29(Matrix)



THREE-STATE MACHINE :
  Iteration  1 :
               M     C     H     V    sumfq
    D 0.5640 0.461 0.159 0.190 0.190  13.8
    H 0.4360 0.328 0.180 0.254 0.238   7.9
    V 0.4360 0.328 0.180 0.238 0.254   3.8
    ML=   95.8524 (matrix part=   79.0891)
  Iteration  2 :
               M     C     H     V    sumfq
    D 0.5967 0.589 0.110 0.151 0.151  14.5
    H 0.4033 0.313 0.154 0.312 0.221   7.3
    V 0.4033 0.313 0.154 0.221 0.312   3.3
    ML=   92.9009 (matrix part=   76.0451)
  Iteration  3 :
    ...
    ...
    ...
  Iteration  8 :
               M     C     H     V    sumfq
    D 0.6190 0.774 0.036 0.095 0.095  14.6
    H 0.3810 0.295 0.067 0.550 0.089   7.2
    V 0.3810 0.295 0.067 0.089 0.550   2.9
    ML=   87.5069 (matrix part=   69.0361)


   ACCAACCATGCAGGGGTTAGCT
  #**-...
C:*--*....
A:*---*....
A:-*-..*.....
C:..*...*......
C:...*..-*-.......
A:....*--+*--......
A: ....*++*--.......
C:  ...-**#-----....
T:   ...---#------...
G:    ..-..-#------..
C:     ..-..-#++++-...
A:     ...----#####-...
T:       ..-------+#-..
T:         ......---#...
A:         ..........#...
G:            ........#..
C:              .......#.
T:                 .....#

KEY:  #   *   +   -   .
   :    1   2   4   8   16  bits over mml

3-State Machine: 2.1877(bits/symb)  87.51(ML)  69.04(Matrix)




FIVE-STATE MACHINE :
Iteration  1 :
    D 0.6779 0.532(M) 0.182(C) 0.185(H) 0.100(V) 0.001(A) 0.001(B)  21.3(sumfq)
    H 0.1837 0.379(M) 0.278(C) 0.281(H) 0.061(V)   6.2(sumfq)
    V 0.0805 0.359(M) 0.333(C) 0.115(H) 0.193(V)   2.4(sumfq)
    A 0.0204 0.010(D) 0.990(A)   0.2(sumfq)
    B 0.0374 0.010(D) 0.990(B)   0.7(sumfq)
    ML=   98.7571 (matrix part=   77.9097)
Iteration  2 :
    ...
    ...
    ...
Iteration 11 :
    D 0.6522 0.725(M) 0.064(C) 0.126(H) 0.082(V) 0.001(A) 0.001(B)  19.0(sumfq)
    H 0.2147 0.274(M) 0.140(C) 0.527(H) 0.059(V)   6.5(sumfq)
    V 0.0830 0.421(M) 0.160(C) 0.115(H) 0.305(V)   2.4(sumfq)
    A 0.0263 0.010(D) 0.990(A)   0.3(sumfq)
    B 0.0239 0.010(D) 0.990(B)   0.2(sumfq)
    ML=   92.7831 (matrix part=   71.1250)
minimum MesgLength is   92.78


   ACCAACCATGCAGGGGTTAGCT
  ###-...
C:*-+#-...
A:*+--#....
A:.*-.-#.....
C:..*-.-#-....
C:...*-.-#-......
A: ...*--+*---....
A:  ...*++*----....
C:   ..-**#------...
T:    ..---#------..
G:     .-..-#------..
C:      .-..-#++++-..
A:       ....-#####-..
T:        ....---++#-..
T:         ......---#..
A:            .......#..
G:             .......#..
C:               ......#.
T:                 .....#

KEY:  #   *   +   -   .
   :    1   2   4   8   16  bits over mml

5-State Machine: 2.3196(bits/symb)  92.78(ML)  71.13(Matrix)




SUMMARY
-------
message length (bits/symbol) :
Null: 2.354   1-S: 2.244(1.882)   3-S: 2.188(1.726)   5-S: 2.320(1.778)

message length (bits) :
Null:   94    1-S:   90(  75)     3-S:   88(  69)     5-S:   93(  71)

Displaying a Density Plot on a Graphics Device

Two utilities are provided to display the density plot generated by the r-theory machines for display on a POSTSCRIPT graphic device.

The program scale transforms the density matrix into matrix of grey-scale of the range 0-255. It also provides a few transformation options that highlight the different features of the r-theory alignment.

The l option displays the matrix in a logarithmic scale of message length. The optional `contrast' factor controls the sharpness of the high probability region. The default value is 1. A high value (e.g. 4) gives a sharper picture while a low value (e.g. 0.2) includes more ML's in the high intensity region.

The p option displays the image in a linear scale in probability. The m option displays the image in a linear scale in ML. If the optional threshold value is supplied then all cells in excess of this value will be ignored, and the remaining values scaled accordingly.

The program pimg accepts the output from scale and generate a POSTSCRIPT program to display the matrix.

For example, to display 1State.plot generated by the sample run in last section in logarithmic scale:

scale l < 1State.img | pimg | lpr -Plaser

Resource Utilizations

The generation of density plots requires o(n^2) sapce. At present the program can generate plots for strings of length 300, and this requires 10000 kbytes of stacksapce. Make sure that you have allocated enough stackspace on your machine by the command:

limit stacksize 10000

You can adjust the constant MATRIXSIZE in "Strings.h" accordingly depending on your requirements and the memory capacity of your machine.

When density plot is not required the machine can handle strings of much greater length. It is controlled by the constant MAXSEQLENGTH in "Strings.h", which is currently set to 4000.


© C. N. Yee, L. Allison, 1991;   HTML 2000

Now: School of Computer Science and Software Engineering, Monash University, Australia 3168