program MMLAlign( input, output, LogSumData ); {L. Allison, Dept. Computer Science, Monash University, Australia 3800. (now Faculty of Information Technology, as of 2006) lloyd.allison AT infotech.monash.edu.au Prototype MML Evolutionary Tree and Multiple-Alignment 25 May 1993. Based on an underlying 1-state model of string mutation. This code available under GNU copyleft, General Public Licence, GPL. Definition of "Prototype": has some good ideas but should be thrown away and rewritten. ------------------------------------------------------------------------------- If you use this program, bits of the code, or ideas in it, please reference: [1] L.Allison, C.S.Wallace & C.N.Yee Minimum message length encoding, evolutionary trees and multiple-alignment. 27th Hawaii Int Conf Sys Sci (HICSS-27) Vol.1 pp.663-674 IEEE Jan 1992. [2] L. Allison & C. S. Wallace. The Posterior Probability Distribution of Alignments and its Application to Estimation of Evolutionary Trees and Optimisation of Multiple Alignments. Jrnl. Molec. Evol. Vol.39 pp.418-430, 1994 http://dx.doi.org/10.1007/BF00160274 definitely, and maybe also: [3] L.Allison, C.S.Wallace & C.N.Yee Finite-state models in the alignment of macro-molecules. Jrnl. Molec. Evol. Vol.35 pp.77-89, 1992. http://dx.doi.org/10.1007/BF00160262 [4] C. N. Yee & L. Allison. Reconstruction of strings past. CABIOS (now Bioinformatics) 9(1) pp.1-7, Feb 1993. ------------------------------------------------------------------------------- Output: ------- The output is moderately voluminous as information is given during the iterative alignment-search process. One Alignment is given per tree. The `bit' quantities are bits of information. And `params' refers to the number of bits spent in specifying the estimated parameters of the edges of the tree. This gives an indication of the number of significant figures in the estimates. The parameters are inferred from the data. Each edge has 3 parameters : p(match), p(change), p(indel). These sum to 1.0 so there are 2 degrees of freedom per edge. As a ROUGH guide ONLY, -log(1-p(match)) can be taken as a kind of distance on strings, especially if p(match) is high. See [1] for how to combine, or add, 2 edges properly. A difference in message lengths can be taken as a posterior -log odds-ratio on 2 hypotheses, eg. 2 trees. (I would not bet any $money$ on the strength of a few bits!) Limitations: ------------ The program has a certain statistical model[4] of evolution. The results must *** NOT *** be relied upon if the model is inappropriate: . No gap penalty [this is a hard problem in multiple alignment], so do NOT compare strings of very different lengths. . All substitutions are equally probable so the program is unsuitable for proteins. A single optimal alignment is used which can be expected to bias edge estimates, especially for distant strings, see [4]. We would prefer to average over all (or most) alignments, but this is hard for several strings. The alignment search-heuristic is not guaranteed to find the optimal alignment. It grows less reliable as the strings grow more dissimilar. The `hash-table' (see below) is really a lookup-table of size 5**K (for DNA/RNA). K <= 8 is possible, depending on the memory available. Beyond this a real hash-table would be needed [not too hard]. Input: ------ K -- number of strings >shortname -- each string has a short id | long description on one line -- | acgtacgtacgt -- sequence on 1 or more lines | 1st string acgtacgt -- bad characters give a warning | acgt acgt -- digits and spaces ignored. | -- terminated by blank line | >shortname2 -- description 2 -- acgt... -- ((1 2)(3 4)) -- say, i.e. one or more trees ((1 3)(2 4)) -- NB they are really UNrooted trees. -- end of file You must specify the trees that are to be tried. The intention is that any pair-wise distance matrix method be used to suggest a few likely trees. ------------------------------------------------------------------------------- K strings, average length N Time is O(K * N**2) - per tree. Space is O(N**2) bits - if packed obeyed except that the HashTable is (|Alphabet|+1)**K - could easily be rather more compact =============================================================================== } label 99; {finish} const AlphabetSize = 4; {DNA -> 4} Null = 0; {Null "character"} Infinity = 1.0E+8; {at least big! bigger can give "underflow"} Kmax = 8; {max # of real strings} HypMax = 14; {max real or hyp' string #, = 2*Kmax-2} McMax = 13; {max machine #, = 2*Kmax-3} TupleMax = 390624; {(AlphabetSize+1)**Kmax - 1} Nmax = 2000; {max string length} Nmax1 = 2001; {Nmax+1} Convergence = 1.0; {convergence limit (bits)} type alfa = packed array [1..10] of char; Alphabet = 0..AlphabetSize; {0 is Null} AlphabetSet = set of Alphabet; AlphaProb = array [Alphabet] of real; {a fuzzy char} StringIndex = 0..Nmax1; String = record name :alfa; description :packed array [1..80] of char; N :StringIndex; {string's length} c :array [StringIndex] of char {chars as input} end; StringNum = 0..Kmax; {real, present-day, leaf strings} StringSet = set of StringNum; Hypothetical = 0..HypMax; {real or ancestral strings} Tuple = array [StringNum] of Alphabet; {a column of an alignment} TupleNum = 0..TupleMax; MachineNum = 0..McMax; Operation = (NoOp, Copy, Change, Indel); {NB. NoOp == Non Op !!!} Machine = array [Operation] of real; Machines = array [MachineNum] of Machine; Tree = ^ node; {NB. "real" tree is UNrooted, top node is unused} node = record left, right :Tree; s :Hypothetical; m :MachineNum {machine "above" s} end; Direction = (Horizontal, Vertical, Diagonal); Alignment = record K :StringNum; {# of strings in the alignment} s :array [StringNum] of StringNum; {which ones in A'} N :StringIndex; {alignment length} t :array [StringIndex] of TupleNum; {"fuzzy" "chars"} i :array [StringNum, StringIndex] of StringIndex end; { A.K = the number of strings in an alignment A. A.s[x] is the string number of the x-th string in A, 1<=x<=A.K. A.N = the number of columns in A. A.t[x] is the x-th column of the alignment as number of K digits in base |Alphabet|+1, corresponding to A.K as most sig' digit down to 1. eg. for DNA, "AA_C" gives 1102 base 5. A.i[x, y] is an index into string A.s[x] corr' to column y of A, 1<=y<=A.N The index is zero if a Null is indicated. S[ A.s[x] ].c[ A.i[x, y] ] gets a char of the x-th string in alignment A corresponding to the y-th column of A. (Fake 0-th char if index 0.) } SearchMode = (Deterministic, NonDeterministic); {manner of (re)aligning} var T :Tree; K :StringNum; S :array [StringNum] of String; str :StringNum; Mc, Est :Machines; M :MachineNum; A :Alignment; StrML :real; {message length of characters of strings only} CharToAlpha :array [char] of Alphabet; ch :char; {conversion tables} AlphaToChar :array [Alphabet] of char; A2toOp :array [Alphabet, Alphabet] of Operation; x, y :Alphabet; ioffset, joffset :array [Direction] of integer; seed :record lo, hi :integer end; { seed needed by generator } LogSumArr :array[0..959] of real; {for LogSum} LogSumData :text; {-----------------------------------------------------------------------------} procedure Error( m :alfa ); begin writeln('Error: ', m); goto 99 end; procedure Assertion( b :boolean; m :alfa ); begin if not b then Error(m) end; function min( i, j :integer ) :integer; begin if i1 then begin x := log2(x); logstar := x+logstar(x) end else logstar := 0 end; function U( L :real ) :real; { Code length for integers L >= 1 ; Rissanen} begin U := logstar(L) + log2(2.865) end; procedure NullTheory; {P(len1, len2, ... lenK | len1+...+lenK) = K_omial distribution (len1+...+lenK)! / (len1!*...*lenK! K**(len1+..+lenK) } var str :StringNum; i, TotalLength :integer; TotLenML, LengthsML, StrML :real; begin TotalLength := 0; for str := 1 to K do TotalLength := TotalLength + S[str].N; TotLenML := U( TotalLength );{should this be the average instead ??? ***} LengthsML := TotalLength * log2( K ); for i := 2 to TotalLength do LengthsML := LengthsML - log2(i); for str :=1 to K do for i := 2 to S[str].N do LengthsML := LengthsML + log2(i); StrML := TotalLength * log2(AlphabetSize); {assume equi-probable} writeln( 'Null Theory=', U(K)+TotLenML+LengthsML+StrML :6:1, ' bits', '=', U(K):4:1, '(K)', '+', TotLenML :6:1, '(tot len)', '+', LengthsML:6:1, '(diff len''s)', '+', StrML :6:1, '(chars)' ) end {NullTheory}; function paramcost( fcopy, fchange, findel :real ) :real; { see: Boulton and Wallace or Wallace and Freeman } const NumOps = 3; {the "alphabet" of operations} var fsum, pcopy, pchange, pindel, V, deltaVsq :real; begin fsum := fcopy + fchange + findel; pcopy := fcopy / fsum; pchange := fchange / fsum; pindel := findel / fsum; deltaVsq := pcopy * pchange * pindel / exp((NumOps-1)*ln(fsum/12)); V := 1/2 { the param-space volume, = 1/(NumOps-1)! }; paramcost := ln( 1 + V*V/deltaVsq) / 2 / ln(2){make bits} end; function errorcost :real; { delta increase in msg length through using finite approx to param values} const NumOps = 3; begin errorcost := (NumOps - 1)/2 * log2(exp(1)) end; function Expn( x :real; n :integer ) :real; begin if n < 0 then Expn := 1/Expn( x, -n ) else if n = 0 then Expn := 1 else if odd(n) then Expn := x*sqr(Expn(x, n div 2)) else Expn := sqr(Expn(x, n div 2)) end; function sum( P :AlphaProb ) :real; var s :real; x :Alphabet; begin s := 0; for x := Null to AlphabetSize do s := s + P[x]; sum := s end; {-----------------------------------------------------------------------------} function LogSum(a, b :real):real; { Overall error is less than 10**-6 Table of polynomial coefficients for the piecewize approximation of the function log (1 + 2 ** -x) (log is to base 2) (x .ge. 0) To use the table, form y = 8 * x. If y .ge. 239.5, fun = 0 Otherwise, form k = nint(y), z = y-k. Then k is in 0 ... 239, z is in -0.5 ... 0.5. Fun is approx given by the poly in z whose coeffs are in d() d(4k+3) is the constant term, d(4k) is the coeff of z ** 3 Chris Wallace 1989 converted to Pascal, L.Allison 1993 } var x, y, abm, apr :real; i, k :integer; begin if LogSumArr[0] <> 0.0 then {must initialise LogSumArr[]} begin reset(LogSumData); for i := 0 to 959 do begin read(LogSumData, k); Assertion(i=k, 'LogSumInit'); readln(LogSumData, LogSumArr[i]) end end; x := (a - b); if x >= 0.0 then abm := a else begin abm := b; x := -x end; { abm is max (a,b), x is abs (a-b) } y := 8.0 * x; k := round(y); if k >= 240 then LogSum := abm else begin y := y - k; k := k * 4; apr := LogSumArr[k]; for i := 1 to 3 do begin k := k + 1; apr := apr * y + LogSumArr[k] end; LogSum := abm + apr end end {LogSum}; {-----------------------------------------------------------------------------} procedure NormaliseMachines( var Mc :Machines ); var M :MachineNum; Op :Operation; Optotal :real; begin for M := 1 to 2*K-3 do begin Optotal := 0.0; for Op := Copy to Indel do Optotal := Optotal+Mc[M, Op]; for Op := Copy to Indel do Mc[M, Op] := Mc[M, Op]/Optotal end end; procedure ShowMachines( Mc :Machines ); var M :MachineNum; Op :Operation; begin for M := 1 to 2*K-3 do begin write( 'mc', M:1, ': '); for Op := Copy to Indel do write(Mc[M, Op] :6:4, ' '); writeln end end; {-----------------------------------------------------------------------------} procedure ShowInfixTree( T :Tree ); begin if T^.left = nil then write(T^.s:1) else begin write('('); ShowInfixTree(T^.left); write(' '); ShowInfixTree(T^.right); write(')') end end; procedure ShowTree( T :Tree ); {print tree down the page} procedure ST(T :Tree; depth :integer); procedure PadOut; var i :integer; begin for i := 1 to depth do write(' ':14, '|') end; begin if T <> nil then begin ST(T^.right, depth+1); PadOut; writeln('<--(m', T^.m:2, ')-->s', T^.s:2, '|'); ST(T^.left, depth+1) end end {ST}; begin ST(T^.right, 0); writeln('^'); writeln('|'); writeln('v'); ST(T^.left, 0); writeln('NB. UNrooted tree') end {ShowTree}; {-----------------------------------------------------------------------------} function ParseTree :Tree; type symbol = (open, close, numeral, eoT); var sy :symbol; ch :char; n :integer; StrSeen :StringSet; procedure Error(m :alfa); begin writeln; writeln('Error: ', m, ' sy=', ord(sy):1, ' ch=', ch, ' n=', n:1); while not eof do begin while not eoln do begin read(ch); write(ch) end; readln; writeln end; writeln; goto 99 end; procedure getch; begin if eof then ch := '#' else if eoln then ch := '#' else read(ch) end; procedure insymbol; begin while ch = ' ' do getch; if ch in ['(', ')'] then begin case ch of '(': sy := open; ')': sy := close end; getch end else if ch in ['0'..'9'] then begin sy := numeral; n:=0; while ch in ['0'..'9'] do begin n := n*10+ord(ch)-ord('0'); getch end {ch not in ['0'..'9']} end else if ch='#' then sy:=eoT else Error('bad symbol') end {insymbol}; procedure check(s :symbol; m :alfa); begin if sy=s then insymbol else Error(m) end; function PT :Tree; var P :Tree; begin new(P); if sy = open then begin insymbol; P^.s := 0; P^.left := PT; P^.right:=PT; check(close, 'no ) ') end else if sy = numeral then begin Assertion((n>=1)and(n<=K)and not (n in StrSeen), 'string # '); P^.s := n; StrSeen := StrSeen+[n]; insymbol; P^.left:=nil; P^.right:=nil end else Error('bad Tree '); PT := P end {PT}; begin StrSeen := []; ch := ' '; insymbol; ParseTree := PT; check(eoT, 'end of T '); Assertion(StrSeen = [1..K], 'string #s '); readln end {ParseTree}; {. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .} procedure Name( T :Tree ); {Name internal nodes} var dontcare :integer; function N( T :Tree; newname :integer ) :integer; begin if T^.left <> nil then {not a leaf} begin T^.s := N(T^.right, N(T^.left, newname)); N := T^.s + 1 end else {leaf is already named} N := newname; T^.m := min(T^.s, 2*K-3) end; begin T^.m := 0;{unrooted} dontcare := N( T^.right, N(T^.left, K+1) ); T^.right^.m := min(T^.left^.m, T^.right^.m); T^.left^.m := T^.right^.m end {Name}; {-----------------------------------------------------------------------------} procedure ReadString( var S :String ); var ch :char; i :integer; finished :boolean; BadChars :set of char; NBadChars :integer; begin Assertion(input^ = '>', 'missing > '); read(ch); S.name := '? '; i:=0; while not eoln do {name} begin i:=i+1; read(ch); if i<=10 then S.name[i]:=ch end; readln; for i := 1 to 80 do S.description[i] := ' '; i := 0; while not eoln do {description} begin i:=i+1; read(ch); if i<=80 then S.description[i]:=ch end; readln; S.N := 0; S.c[0] := '?'; finished:=false; BadChars := []; NBadChars := 0; while not finished do begin while not eoln do begin read( ch ); if not( ch in ['0'..'9', ' '] ) then {skip numbering and space} begin if ch in ['a','c','g','t','u','A','C','G','T','U'] then begin Assertion( S.N < Nmax, 'too long '); S.N := S.N+1; S.c[S.N] := ch end else begin BadChars := BadChars + [ch]; NBadChars := NBadChars + 1 end end end;{eoln} readln; if eof then finished:=true {terminated with eof or} else if eoln then begin finished:=true; readln end {blank line or} else if input^ = '>' then finished:=true {next string} end;{finished} Assertion( S.N > 0, 'Str empty '); if NBadChars > 0 then begin write(' WARNING: ', NBadChars:1, ' bad chars skipped: ['); for ch := chr(0) to chr(127) do if ch in BadChars then write(ch); writeln(']') end end; procedure StrToAlign( str :StringNum; var A :Alignment ); {convert S[str] to a trivial alignment of 1 string} var i :StringIndex; begin A.N := S[str].N; A.K := 1; A.s[1] := str; A.t[0] := 0; A.i[1,0] := 0; for i := 1 to S[str].N do begin A.t[i] := CharToAlpha[ S[str].c[i] ]; A.i[1,i] := i end end; procedure Project( A :Alignment; ss :StringSet; var X, Y :Alignment ); {project an alignment A onto 2 subalignments X and Y as determined by} {the strings in ss. Those in ss -> X, others -> Y.} var i, inA :StringIndex; str, strN :StringNum; strX, strY :integer; procedure RemoveZeros( var A :Alignment );{remove all-null tuples} var i, N :StringIndex; str :StringNum; begin N := A.N; A.N := 0; for i := 1 to N do if A.t[i] <> 0 then {remove zero tuples} begin A.N := A.N + 1; A.t[A.N] := A.t[i]; for str := 1 to A.K do A.i[str, A.N] := A.i[str, i] end end; begin X.N := A.N; X.K := 0; X.t[0] := 0; Y.N := A.N; Y.K := 0; Y.t[0] := 0; for str := 1 to A.K do if A.s[str] in ss then begin X.K := X.K+1; X.s[X.K] := A.s[str]; X.i[X.K,0] := 0 end else begin Y.K := Y.K+1; Y.s[Y.K] := A.s[str]; Y.i[Y.K,0] := 0 end; Assertion(X.K+Y.K = A.K, 'project '); for i := 1 to A.N do begin strX := X.K+1; strY := Y.K+1; X.t[i] := 0; Y.t[i] := 0; for str := A.K downto 1 do {up a column of A} begin strN := A.s[str]; inA := A.i[str, i]; if strN in ss then {one for X} begin strX := strX - 1; X.i[strX,i] := inA; X.t[i] := X.t[i] * (AlphabetSize+1) + CharToAlpha[ S[strN].c[inA] ] end else {one for Y} begin strY := strY - 1; Y.i[strY,i] := inA; Y.t[i] := Y.t[i] * (AlphabetSize+1) + CharToAlpha[ S[strN].c[inA] ] end; end {str} end {i}; RemoveZeros( X ); RemoveZeros( Y ) end {Project}; procedure ShowNames( A :Alignment ); var str :StringNum; i :integer; begin for str := 1 to A.K do begin write( 's', A.s[str] :1, ': ', S[A.s[str]].name, ': ' ); for i := 1 to 60 do write(S[A.s[str]].description[i]); writeln end end; procedure ShowNumbers( A :Alignment ); var str :StringNum; begin for str := 1 to A.K do write(' s', A.s[str]:1 ) end; procedure ShowAlignment( A :Alignment ); const cols = 50; var i, j :StringIndex; str, strN, temp :StringNum; Perm :array [StringNum] of StringNum; col :1..cols; Letters :array [1..cols] of AlphabetSet; L :Alphabet; Letter :array [1..cols] of Alphabet; NLetters:array [1..cols] of integer; begin write('Alignment:'); for str := 1 to A.K do Perm[str] := str; for str := 1 to A.K-1 do {Bubblesort} {inv: Perm[1..str-1] sorted and before Perm[str..A.K]} for strN := A.K-1 downto str do if A.s[Perm[strN]] > A.s[Perm[strN+1]] then begin temp := Perm[strN]; Perm[strN] := Perm[strN+1]; Perm[strN+1] := temp end; for i := 0 to (A.N - 1) div cols do {each slab} begin writeln; for col := 1 to cols do begin Letters[col]:=[]; NLetters[col]:=0 end; for temp := 1 to A.K do {each string in A} begin str := Perm[temp]; strN := A.s[str]; j := i*cols + 1; {find next non-zero index for str} while (j < A.N) and (A.i[str, j] = 0) do j:=j+1; {j=A.N or A.i[str,j]<>0} write( 's', strN :2, ' ', S[strN].name, '[', A.i[str,j] :4, '.. ]: '); for j := i*cols+1 to min((i+1)*cols, A.N) do begin col := j-i*cols; L := CharToAlpha[ S[strN].c[ A.i[str,j] ] ]; if not(L in Letters[ col ]) then begin Letters[ col ] := Letters[ col ] + [L]; NLetters[ col ] :=NLetters[ col ] + 1; Letter[ col ] := L end; if A.i[str, j] > 0 then write( S[strN].c[A.i[str,j]] ) else write('-') end; writeln end; write(' CONSENSUS [', i*cols+1 :4, '.. ]: '); for col := 1 to cols do if NLetters[col] = 1 then write( AlphaToChar[ Letter[col] ] ) else if NLetters[col] = 2 then write('+') else write('.'); writeln end; writeln; ShowNames( A ); writeln('**********') end {ShowAlignment}; {-----------------------------------------------------------------------------} procedure Align( A, B :Alignment; var R :Alignment; Mode :SearchMode; Multiplier :real; T :Tree; Mc :Machines; var Freq :Machines; var StrML :real ); {Multiplier multiplies msg len's ie. -log(probs) ie. raise probs to that power, NonDet & Multiplier=1 => Gibbs Sampling Multiplier>1, increasing => Simulated Annealing} type TwoRows = array[0..1, StringIndex] of real; var EDir :packed array [StringIndex, StringIndex] of Direction; {N**2 space} EML :TwoRows; {EML[,] = values of Msg Len of partial alignments; EDir for traceback.} ET :real; {expected # tuples / char of (arbitrary) "ancestor"} leftshift :integer; HashTable :array [TupleNum] of record occurs :integer; ML :real end; HTentries :integer; Tn :TupleNum; Tp :Tuple; str :StringNum; TotalProb :real; Transition :array [MachineNum, Alphabet, Alphabet] of real; {Op probs} x, y :Alphabet; Pcopy, Pchange, Pinsert, Pdelete :real; M :MachineNum; Op :Operation; Optotal :real; Estimating {new mc params in Freq} :boolean; ParamsML : real; { . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . } procedure Explain( Tp :Tuple; Tn :TupleNum ); var P, Left, Right :array [Hypothetical] of AlphaProb; {global Up, Down} Sleft, Sright :Hypothetical; x :Alphabet; Downleft, Downright :AlphaProb; CharBelow :array [Hypothetical] of integer; {real chars} NRealChar :integer; i :StringNum; TupleProb :real; function Allowed( A1, A2 :Alphabet; NR :integer ) :boolean; { A1<---------------------->A2 can only insert A1->A2 if no real NRealChar-NR NR chars below A1 & similarly A2->A1 ... } begin if not(Null in [A1,A2]) then Allowed := true else if [A1,A2]=[Null] then Allowed := true else if (A1=Null) and (NRealChar-NR = 0) then Allowed := true else if (A2=Null) and (NR = 0) then Allowed := true else Allowed := false end; procedure Up( T :Tree ); {post-order} var str, Sleft, Sright :Hypothetical; Mleft, Mright :MachineNum; Tleft, Tright :Tree; x, y :Alphabet; begin str := T^.s; Tleft := T^.left; Tright := T^.right; if Tleft = nil then {leaf} begin for x := Null to AlphabetSize do P[str, x] := 0; P[str, Tp[str]] := 1.0; CharBelow[str] := ord( Tp[str] <> Null ) end else {not a leaf} begin Up( Tleft ); Up( Tright ); CharBelow[str] := CharBelow[Tleft^.s] + CharBelow[Tright^.s]; Sleft := Tleft^.s; Sright := Tright^.s; Mleft := Tleft^.m; Mright := Tright^.m; for x := Null to AlphabetSize do begin Left[str,x] := 0; Right[str,x] := 0 end; for x := Null to AlphabetSize do for y := Null to AlphabetSize do begin if Allowed(x, y, CharBelow[Tleft^.s]) then Left[str,x] := Left[str,x] + P[Sleft,y]*Transition[Mleft,x,y]; if Allowed(x, y, CharBelow[Tright^.s]) then Right[str,x] := Right[str,x] + P[Sright,y]*Transition[Mright,x,y] end; for x := Null to AlphabetSize do P[str,x] := Left[str,x]*Right[str,x] end end {Up}; { |above | | NB. real tree is unrooted so directions are "arbitrary" v T^:* P[T^.s] contains probs of value of char at this node ^ ^ . . . . left. .right } procedure Down( T :Tree; above :AlphaProb; TopRight :boolean );{preorder} var abv, DL, DR :AlphaProb; str :Hypothetical; mc :MachineNum; x, y :Alphabet; Mcne :Machine {the one "above" str, T^.m}; Op :Operation; delta:real; begin str := T^.s; mc := T^.m; for Op := NoOp to Indel do Mcne[Op] := 0.0; for x := Null to AlphabetSize do abv[x] := 0; for x {above} := Null to AlphabetSize do for y {me} := Null to AlphabetSize do begin if Allowed(x, y, CharBelow[str]) then delta := above[x]*Transition[mc, x, y] else delta := 0; abv[y] := abv[y] + delta; Op := A2toOp[ x, y ]; Mcne[Op] := Mcne[Op] + P[str, y] * delta end; for x := Null to AlphabetSize do P[str, x] := P[str,x] * abv[x]; if Estimating then begin if not TopRight then begin delta := 0.0; {normalise} for Op := NoOp to Indel do delta := delta + Mcne[Op]; for Op := NoOp to Indel do Freq[mc, Op] := Freq[mc, Op] + HashTable[Tn].occurs * Mcne[Op] / delta end; if T^.left <> nil then {not a leaf} begin for x := Null to AlphabetSize do begin DL[x] := Right[str,x]*abv[x]; DR[x] := Left [str,x]*abv[x] end; Down( T^.left, DL, false ); Down( T^.right, DR, false ) end end {Estimating} end {Down}; begin {Explain} NRealChar := 0; for i := 1 to K do if Tp[i]<>Null then NRealChar := NRealChar+1; Up( T^.left ); Up( T^.right ); Sleft := T^.left^.s; Sright := T^.right^.s; Downleft[Null] := P[Sright, Null]; Downright[Null]:= P[Sleft, Null]; {***following could be changed to reflect base (letter) frequencies} for x := 1 {sic} to AlphabetSize do begin Downleft[x] := P[Sright, x] / AlphabetSize; {assume equi-probable} Downright[x]:= P[Sleft, x] / AlphabetSize end; Down( T^.left, Downleft, false ); Down( T^.right, Downright, true ); if HashTable[Tn].occurs < 0 then begin HTentries := HTentries + 1; HashTable[Tn].occurs := 0 {present}; TupleProb := sum( P[Sleft] ) / ET{per tuple}; HashTable[Tn].ML := -log2( TupleProb ); TotalProb := TotalProb + TupleProb {total only over tuples seen} end end {Explain}; { . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . } function cost( i, j :StringIndex ) :real; {Msg Len of a single tuple} var Tn :TupleNum; Tp :Tuple; str, strN :StringNum; begin {cost} Tn := A.t[i] * leftshift + B.t[j]; {NB. A.t[0] = B.t[0] = 0} if HashTable[Tn].occurs < 0 then {absent} begin for str := 1 to K do Tp[str] := Null; for str := 1 to B.K do begin strN := B.s[str]; Tp[strN] := CharToAlpha[ S[strN].c[B.i[str,j]] ] end; for str := 1 to A.K do begin strN := A.s[str]; Tp[strN] := CharToAlpha[ S[strN].c[A.i[str,i]] ] end; Explain( Tp, Tn ) end; cost := HashTable[Tn].ML end {cost}; { . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . } procedure DPA(var EML :TwoRows; {The Dynamic Programming Algorithm} Alo,Astep{+-1},Ahi, Blo,Bstep{+-1},Bhi :StringIndex; rightTerminal :boolean); var ML :array[Direction] of real; choice :Direction; i, j :StringIndex; NewRow, OldRow :0..1; begin {DPA} {Top/Bottom Boundary Conditions:} NewRow := (Alo-Astep) mod 2; {0 or 1} EML [NewRow, Blo-Bstep] := 0.0; {NB. EDir[Alo-Astep, Blo-Bstep] is NOT assigned or changed!} j := Blo; while j*Bstep <= Bhi*Bstep do {Boundary conditions} begin if (Mode=NonDeterministic) and not rightTerminal and (Bstep<0) then EML[NewRow, j] := Infinity {NB. breaks after A[?]} else EML[NewRow, j] := EML[NewRow,j-Bstep] + cost(0,j); EDir[Alo-Astep,j] := Horizontal; j := j+Bstep end; {Main Loops:} i := Alo; while i*Astep <= Ahi*Astep do begin OldRow := NewRow; NewRow := i mod 2; {Left/Right Boundary Conditions:} EML [NewRow, Blo-Bstep] := EML[OldRow, Blo-Bstep] + cost(i,0); EDir[i, Blo-Bstep] := Vertical; j := Blo; while j*Bstep <= Bhi*Bstep do {along a row} begin {General Step:} if (Mode=NonDeterministic) and (Astep>0) and (i=Ahi) then ML[Horizontal] := Infinity{must end with } else ML[Horizontal] := EML[NewRow,j-Bstep] + cost(0,j); ML[Vertical] := EML[OldRow,j ] + cost(i,0); ML[Diagonal] := EML[OldRow,j-Bstep] + cost(i,j); case Mode of Deterministic: begin if ML[Horizontal] < ML[Vertical] then{min cost, max probability} if ML[Horizontal] < ML[Diagonal] then choice := Horizontal else choice := Diagonal else {MLvert<=MLhoriz} if ML[Vertical] < ML[Diagonal] then choice := Vertical else choice := Diagonal; EML [NewRow,j] := ML[choice]; EDir[i, j] := choice; end; NonDeterministic: EML [NewRow,j] := {-log2(sum of probabilities)} -LogSum(-ML[Diagonal],LogSum(-ML[Vertical],-ML[Horizontal])) end {case}; j := j+Bstep end {j}; i := i+Astep end {j} {Main Loop}; end {DPA}; { . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . } procedure ChooseAlignment( Alo,Ahi, Blo,Bhi :StringIndex ); const ten6 = 1000000; {Reminiscent of Hirschberg's divide and conquer, but has another purpose!} var FwdRows, RevRows :TwoRows; {for DPA fwd, DPA rev and Choose} str :StringNum; procedure Step( i, j :StringIndex ); var str :StringNum; begin {Step} R.N := R.N+1; {next column of alignment} Assertion(R.N <= Nmax, 'Aln length'); R.t[R.N] := 0;{init' column's tuple number} if i>0 then R.t[R.N] := R.t[R.N]+A.t[i]*leftshift; {A is most sig' posn} if j>0 then R.t[R.N] := R.t[R.N]+B.t[j]; {B is least sig' } HashTable[R.t[R.N]].occurs := HashTable[R.t[R.N]].occurs+1; StrML := StrML + HashTable[R.t[R.N]].ML {msg length of new alignment R}; for str := 1 to B.K do R.i[str, R.N] := B.i[str, j]*ord(j>0); for str := 1 to A.K do R.i[str+B.K, R.N] := A.i[str, i]*ord(i>0); end {Step}; procedure CA( Alo,Ahi, Blo,Bhi :StringIndex; rightTerminal :boolean ); var Amid, Blo2, Bmid, i, j :StringIndex; FwdR, RevR :0..1; insB, ML :real; function Choose( Blo, Bhi :StringIndex ) :StringIndex; var j, Bmid, Blast :StringIndex; rand, MinVal, ML, Prob, ProbTotal :real; {choose Bmid randomly with prob ~ exp2(-FwdRows[FwdR,Bmid])} {POST: Blo <= Bmid <= Bhi; return(Bmid)} begin {Choose} MinVal := FwdRows[FwdR,Blo]; for j := Blo+1 to Bhi do {find minimum value} begin ML := FwdRows[FwdR, j]; if ML < MinVal then MinVal := ML end; ProbTotal := 0.0; for j := Blo to Bhi do {for normalisation} begin ML := FwdRows[FwdR, j]-MinVal { >= 0 }; if ML <= 20 {bits, say} then begin Prob := exp2(-ML){0 0.0 do {make the "random" choice} if rand <= FwdRows[FwdR, Bmid] then rand := -1.0 {stop !} else begin rand := rand-FwdRows[FwdR, Bmid]; Bmid := Bmid+1 end; Assertion((Blo<=Bmid)and(Bmid<=Bhi), 'Choose '); Choose := Bmid { >= Blo & <= Bhi} end {Choose}; begin {CA} if Alo > Ahi then {A null} for j := Blo to Bhi do Step(0, j){insert B} else {A ~null} if Blo>Bhi then {B null} for i := Alo to Ahi do Step(i, 0){delete A} else {A ~null and B ~null} if Alo = Ahi then {A length = 1} begin insB := 0.0;{eventually cost to insert all of Blo..Bhi} for j := Blo to Bhi do begin ML := cost(0, j){insert}; FwdRows[0, j] := ML; insB := insB+ML end; if rightTerminal then Blo2:=Blo else Blo2:=Bhi{break alignment JUST after A[Alo]}; for j := Blo2 to Bhi do FwdRows[1, j] := (insB-FwdRows[0,j]+cost(Alo, j){1 match/change})*Multiplier; FwdRows[1,Blo2-1] := (insB+cost(Alo,0){del})*Multiplier; if rightTerminal then FwdRows[1,Blo2-1]:=FwdRows[1,Blo2-1]-log2(Bhi-Blo+2){all ways}; FwdR := 1{for Choose!}; Bmid := Choose(Blo2-1, Bhi); if Bmid=Blo2-1 then {choose a,b1,b2 | b1,a,b2 | b1,b2,a say} begin if rightTerminal then Bmid:=Blo+random(Bhi-Blo+2){uniform; Blo<=Bmid<=Bhi+1} else Bmid:=Bhi+1 {A[Alo] comes LAST}; for j := Blo to Bmid-1 do Step(0, j); {insert 0 or more} Step(Alo, 0); {delete} for j := Bmid to Bhi do Step(0, j) {insert 0 or more} end else {Blo2 <= Bmid <= Bhi; eg. -b1,a|b2,-b3 } begin for j := Blo to Bmid-1 do Step(0, j); {insert} Step(Alo, Bmid); {match/change} for j := Bmid+1 to Bhi do Step(0, j) {insert} end end else {A length >= 2 & B not null} begin Amid := (Alo+Ahi)div 2; DPA( FwdRows, Alo, 1,Amid, Blo, 1,Bhi, false); DPA( RevRows, Ahi,-1,Amid+1, Bhi,-1,Blo, rightTerminal); FwdR := Amid mod 2; RevR := 1-FwdR; for i := Blo-1 to Bhi do {MsgLengths of partitions of B} FwdRows[FwdR,i]:=(FwdRows[FwdR,i]+RevRows[RevR,i+1])*Multiplier; Bmid := Choose(Blo-1, Bhi); {NB. Alo < Amid < Ahi and Blo-1 <= Bmid <= Bhi} CA( Alo, Amid, Blo, Bmid, false ); {Divide and ...} CA( Amid+1,Ahi, Bmid+1,Bhi, rightTerminal ) {Conquer} end end {CA}; begin {ChooseAlignment} StrML := 0.0; {init' alignment msg length} R.N := 0; {clear alignment R} R.K := A.K+B.K; R.t[0] := 0; for str := 1 to B.K do R.s[str ] := B.s[str]; for str := 1 to A.K do R.s[str+B.K] := A.s[str]; for str := 1 to R.K do R.i[str,0] := 0; CA( Alo, Ahi, Blo, Bhi, true ) {does the work} end {ChooseAlignment}; { . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . } procedure ExtractAlignment; procedure path( i, j :StringIndex ); var dir :Direction; str :StringNum; begin if i+j > 0 {either > 0} then begin dir := EDir[i,j]; path( i-ioffset[dir], j-joffset[dir] ); {recursion} R.N := R.N + 1; {A forms most sig' digits of R.t[], B least sig' digits} R.t[R.N] := A.t[i] * leftshift * ioffset[dir] + B.t[j] * joffset[dir]; HashTable[R.t[R.N]].occurs := HashTable[R.t[R.N]].occurs + 1; for str := 1 to B.K do {B becomes least sig digits of R} R.i[str , R.N] := B.i[str,j] * joffset[dir]; for str := 1 to A.K do {A most sig digits of R} R.i[str+B.K, R.N] := A.i[str,i] * ioffset[dir]; end else { i = j = 0; arrived at origin } begin R.N := 0; R.K := A.K + B.K; R.t[0] := 0; for str := 1 to B.K do R.s[str ] := B.s[str]; for str := 1 to A.K do R.s[str+B.K] := A.s[str]; for str := 1 to R.K do R.i[str,0] := 0; end end {path}; begin path(A.N, B.N) end {ExtractAlignment}; {. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .} begin {Align} write('Align '); if Mode=NonDeterministic then write('Non'); write('Det Mult=', Multiplier:5:2); ShowNumbers(A); write(' with'); ShowNumbers(B); writeln; Estimating := false; leftshift := round( Expn( AlphabetSize + 1, B.K ) ); TotalProb := 0; {of seen tuples} ET := 1.0; {Expected # visible & invisible Tuples / char of (arb') ancestor} for M := 1 to 2*K-3 do begin Pcopy := Mc[M][Copy]; Pchange := Mc[M][Change]; Pinsert := Mc[M][Indel] / 2; Pdelete := Pinsert {could change this}; ET := ET + Pinsert/(1-Pinsert); {expected # tuples / char} {***Here is where one would alter the rates for transitions/transversions:} for x := 1 to AlphabetSize do for y :=1 to AlphabetSize do {NB. per char rates} Transition[M, x, y] := Pchange/(AlphabetSize - 1)/(1-Pinsert); {***Insert rate could be changed to reflect base (letter) frequencies:} for x := 1 to AlphabetSize do begin Transition[M, Null, x] := Pinsert/AlphabetSize/(1-Pinsert); Transition[M, x, Null] := Pdelete/(1-Pinsert); Transition[M, x, x] := Pcopy/(1-Pinsert) end; Transition[M, Null, Null] := 1.0 {!!!} end; HTentries := 0; for Tn := 0 to round( Expn( AlphabetSize+1, A.K+B.K ) ) - 1 do HashTable[Tn].occurs := - 1 {absent}; {now get alignment:} case Mode of Deterministic: begin DPA( EML, {A:}1,1{fwd},A.N, {B:}1,1{fwd},B.N, true); StrML := EML[ A.N mod 2, B.N ]; ExtractAlignment; end; NonDeterministic: ChooseAlignment( {A:}1,A.N, {B:}1,B.N ) end{case}; write('Hash Table = ', 100*HTentries/(Expn( AlphabetSize+1, R.K )-1) :7:2, '%, ', ' P(seen tuples)=', TotalProb:6:4); if (TotalProb<0) or (TotalProb>1.00001) then write(' warning ***'); writeln; ParamsML := 0; for M := 1 to 2*K-3 do ParamsML := ParamsML + {this may be a slight overestimate ***} paramcost( Mc[M,Copy]*R.N, Mc[M,Change]*R.N, Mc[M,Indel]*R.N ); write( 'MsgL = ', U(K)+log2(NTrees(K))+U(R.N)+ParamsML+errorcost+StrML :7:1, ' bits, '); ShowInfixTree( T ); writeln; writeln(' =', U(K) :4:1, '(K)', '+', log2(NTrees(K)) :4:1, '(tree)', '+', U(R.N) :5:1, '(length)', '+', ParamsML :5:1, '(params)', '+', errorcost :3:1,'(error)', '+', StrML :6:1, '(algn)'); {Estimate new parameters:} Estimating := R.K=K; if Estimating then begin for M := 1 to 2*K-3 do for Op := NoOp to Indel do Freq[M, Op] := 1/3; for Tn := 1 to round( Expn(AlphabetSize+1, R.K) ) - 1 do if HashTable[Tn].occurs > 0 then begin for str := 1 to R.K do Tp[ R.s[str] ] := Tn div round(Expn(AlphabetSize+1,str-1)) mod (AlphabetSize+1); Explain( Tp, Tn ) {set Freq} end; writeln; writeln('next estimate of parameters:'); for M := 1 to 2*K-3 do begin Optotal := 0.0; for Op := Copy {!} to Indel do Optotal := Optotal + Freq[M, Op]; write( 'mc', M:1, ': '); for Op := Copy {!} to Indel do begin Freq[M, Op] := Freq[M, Op] / Optotal; write(Freq[M,Op]:6:4, ' ') end; writeln('; ', paramcost(Freq[M,Copy]*R.N, Freq[M,Change]*R.N, Freq[M,Indel]*R.N) :5:1, ' bits') end end end {Align}; {-----------------------------------------------------------------------------} procedure RecAlign( var R :Alignment; Mc :Machines; var Est :Machines; var StrML :real ); {Get an initial alignment, not guaranteed optimal} {Time is (2*K-3)*Time(Align) ie. (2*K-3)*N**2 or O(K*N**2) } procedure RA( var R :Alignment; T :Tree ); { Bottom-up alignment } var Left, Right :Alignment; begin if T^.left = nil then StrToAlign( T^.s, R ) else begin RA( Left, T^.left ); RA( Right, T^.right ); Align(Right,Left,R, Deterministic,0, {local}T, Mc,Est, StrML ) end end {RA}; begin RA( R, T ) end {RecAlign}; {-----------------------------------------------------------------------------} procedure Improve( var R :Alignment; Mc :Machines; var Est :Machines; var StrML :real); {attempt to improve the alignment} var A, B :Alignment; ss :StringSet; {BEWARE: A, B, R are global to Im} StrML1 :real; procedure Im( Tlocal :Tree; var ss :StringSet ); var ss1, ss2 :StringSet; begin if Tlocal^.left <> nil then { Tlocal not a leaf } begin Im( Tlocal^.left, ss1 ); Im( Tlocal^.right, ss2 ); ss := ss1 + ss2; Project(R, ss1, A, B); Align(A, B, R, Deterministic,0, {whole}T, Mc, Est, StrML); Mc := Est; if ss <> [1..K] then begin Project(R, ss2, A, B); Align(A,B,R, Deterministic,0, T{whole}, Mc,Est, StrML); Mc := Est end end else {leaf} ss := [ Tlocal^.s ] end {Im}; begin repeat StrML1 := StrML; Im(T, ss); until StrML >= StrML1 - Convergence; ShowAlignment( R ) end {Improve}; {. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .} procedure Sample( var R :Alignment; Mc :Machines; var Avg :Machines; Nsamples :integer); var A, B :Alignment; M :MachineNum; Mc2, StdDev :Machines; Op :Operation; sample :integer; MinML, MaxML, AvgML, StdDevML, Sum1ML, Sum2ML, Optotal :real; function find( m:MachineNum; T:Tree ):Tree; var T1:Tree; begin if T=nil then find:=nil else if T^.m=m then find:=T else begin T1:=find(m, T^.left); if T1<>nil then find:=T1 else find:=find(m, T^.right) end end; function getstrings(T:Tree):StringSet; begin if T=nil then getstrings:=[] else getstrings:=getstrings(T^.left)+getstrings(T^.right)+[T^.s] end; begin {Sample} for M := 1 to 2*K-3 do for Op := NoOp to Indel do begin Avg[M,Op]:=0.0; StdDev[M,Op]:=0.0 end; AvgML := 0.0; StdDevML := 0.0; MinML := Infinity; MaxML := 0.0; Sum1ML := Infinity; Sum2ML := Infinity; for sample := 1 to Nsamples do begin M := random(2*K-3)+1; write('break tree on mc', M:1, '; '); Project(R, getstrings(find(M,T)), A, B); {break Tree at an edge} Align(A, B, R, NonDeterministic, {mul:}1.0 {+5*sample/Nsamples{>1=>Sim Ann'}, T, Mc, Mc2, StrML); {re-align} if StrML < MinML then MinML := StrML; if StrML > MaxML then MaxML := StrML; AvgML := AvgML+StrML; StdDevML := StdDevML+sqr(StrML); Sum1ML := -LogSum( -Sum1ML, -StrML); if sample >= Nsamples div 5 then Sum2ML := -LogSum(-Sum2ML,-StrML); {ShowAlignment( R );} for M := 1 to 2*K-3 do for Op := Copy to Indel do begin Avg[M, Op] := Avg[M, Op]+Mc2[M, Op]; StdDev[M, Op]:= StdDev[M,Op] + sqr(Mc2[M,Op]); Mc [M, Op] := Mc[M, Op] + 0.1 * Mc2[M, Op] {lagged}; NormaliseMachines( Mc ) end end; ShowAlignment( R ); AvgML := AvgML/Nsamples; StdDevML := StdDevML/Nsamples-sqr(AvgML); if StdDevML >= 0 then StdDevML := sqrt(StdDevML) else StdDevML := 0; writeln('After ', Nsamples:1, ' samples:', ' AvgMsgL (algn)=', AvgML:7:1, '(+-', StdDevML:6:1, ')', MinML:7:1, ' ..', MaxML:7:1); writeln('LogSum 100% sample (algn)=', Sum1ML :7:1, ', LogSum 80% sample =', Sum2ML :7:1); NormaliseMachines( Avg ); for M := 1 to 2*K-3 do begin write('mc', M:1, ': '); for Op := Copy to Indel do begin StdDev[M, Op] := StdDev[M, Op]/Nsamples-sqr(Avg[M,Op]); if StdDev[M,Op] >= 0 then StdDev[M,Op] := sqrt(StdDev[M,Op]) else StdDev[M,Op] := 0.0; write( Avg[M, Op]:6:4, '(+-', StdDev[M,Op]:5:3, ') ') end; writeln end; write('Avg: '); for Op := Copy to Indel do begin Optotal:=0.0; for M := 1 to 2*K-3 do Optotal:=Optotal+Avg[M,Op]; write(Optotal/(2*K-3):6:4, ' ':10) end; writeln end {Sample}; {-----------------------------------------------------------------------------} begin {main} writeln('(c) L.Allison, Dept. Comp. Sci, Monash University, Australia:'); writeln('Minimum Message Length Evolutionary-Tree and Multiple-Alignment'); writeln('Version: 1-state model, 25 May 1993'); Assertion(Nmax1 = Nmax+1, 'Nmax1 '); Assertion(HypMax = 2*Kmax-2, 'HypMax '); Assertion(McMax = 2*Kmax-3, 'McMax '); Assertion(TupleMax = Expn(AlphabetSize+1, Kmax)-1, 'TupleMax '); seed.lo := 13; seed.hi := 107;{for random number generator} LogSumArr[0] := -99; {flag it needs init'} {Initialise constants:} for ch := chr(0) to chr(127) do CharToAlpha[ch] := 0; CharToAlpha['a'] := 1; CharToAlpha['A'] := 1; CharToAlpha['c'] := 2; CharToAlpha['C'] := 2; CharToAlpha['g'] := 3; CharToAlpha['G'] := 3; CharToAlpha['t'] := 4; CharToAlpha['T'] := 4; CharToAlpha['u'] := 4; CharToAlpha['U'] := 4; {for RNA} AlphaToChar[0] := '-'; {standard output representation} AlphaToChar[1] := 'A'; AlphaToChar[2] := 'C'; AlphaToChar[3] := 'G'; AlphaToChar[4] := 'T'; for x := 1 to AlphabetSize do for y := 1 to AlphabetSize do A2toOp[x,y] := Change; for x := 1 to AlphabetSize do begin A2toOp[Null, x] := Indel; A2toOp[x, Null] := Indel; A2toOp[x, x] := Copy end; A2toOp[Null, Null] := NoOp; ioffset[Diagonal] := 1; joffset[Diagonal] := 1; ioffset[Horizontal] := 0; joffset[Horizontal] := 1; ioffset[Vertical] := 1; joffset[Vertical] := 0; {Default Machines:} for M := 1 to 2*Kmax-3 do begin Mc[M][Copy] := 0.9; Mc[M][Change] := 0.05; Mc[M][Indel] := 1.0 - Mc[M][Copy] - Mc[M][Change] end; {Get Strings:} writeln('Type # of strings >= 2 and <= ', Kmax:1); readln(K); Assertion( (K>=2) and (K<=Kmax), 'come on! '); writeln('Type ', K:1, ' strings'); for str := 1 to K do ReadString( S[str] ); {Do Business:} writeln('Type Tree, eg. ((1 2)(3 4))'); while not eof do begin T := ParseTree; Name(T); ShowTree(T); RecAlign( A, Mc, Est, StrML ); Improve( A, {initial} Est, {better} Est, StrML ); NullTheory; Sample( A, Est, Est, 1000{samples}); writeln('********** end of tree **********'); writeln('Type next tree? or ') end; 99: writeln('********** finish **********') {Lloyd ALLISON, Dept.Comp.Sci. (now FIT), Monash University, Australia 3800} end.