Modelling v. Shuffling

A seminar to the Bioinformatics Group & Department of Computer Science, University of Wales Aberystwyth, B20, 4pm, Wednesday, 11 August 2004.

We Define:: Global and local, optimal and summed, affine (linear) gap-costs (i.e. 3-state), modelling-alignment (M-alignment).
Compare it to:: Standard Smith-Waterman local-alignment with; significance-test based on shuffling and re-alignment.
Results:: Both solve easy problems equally well.; M-alignment performs better on compressible populations, mixed populations, and populations of mixed sequences.

This file is at users.monash.edu/~lloyd/Seminars/200408-Align/index.shtml and includes links to extra material.

Some Different Questions on Sequences S1 & S2

1. Globally Related?: I.e. Is (all of) S1 related to (all of) S2?; Sum probabilities of all alignments (an alignment ~ a sed script).
2. Global optimal-alignment:: I.e. How is (all of) S1 best related to (all of) S2?
3. Locally Related?: I.e. Is part of S1 related to part of S2?; Sum prob's of all possible ways this can be so.
4. Local optimal-alignment:: I.e. How is part of S1 best related to part of S2?

S1 and S2 are related if S1 tells us something new & useful about S2, and v.v., i.e. if: Pr(S1&S2 | unrelated) = Pr(S1) . Pr(S2) < Pr(S1&S2 | related) = Pr(S1) . Pr(S2 | S1&related) = Pr(S2) . Pr(S1 | S2&related); ---> logs; think compression; use [MML].
`Compressible', `non-random', & `low to medium information content' all mean the same thing.

Local v. global.

3-State model (machine): Affine (linear) gap-costs, summed- or optimal-alignment

e.g. LCS ~ max; Edit D' ~ min; summed ~ Σ probs.

Characters (bases, amino acids) are not all equal in all contexts, hence model in context (red):

generic (global, 1-state) DPA	(global, optimal) M-alignment
m[0,0] = z m[i,0] = f(m[i-1,0], c(a[i],`_')), i=1..\|a\| m[0,j] = f(m[0,j-1], c(`_',b[j])), j=1..\|b\| m[i,j] = g(f(m[i-1,j-1], c(a[i],b[j])), f(m[i-1,j ], c(a[i],`_')), f(m[i, j-1], c(`_',b[j]))), i=1..\|a\|, j=1..\|b\|	g = max f = × c(x, y) = Pr(x, y) z = 1	r e a d o f f l i n e
	At m[i,j]: c(x,x)= Pr(match) × Pr(x \| a[1..i-1], b[1..j-1]) c(x,y)= Pr(mismatch) × Pr(x,y \| x<>y, a[1..i-1], b[1..j-1]) c(x,`_')= Pr(delete) × Pr(x \| a[1..i-1]) c(`_',y)= Pr(insert) × Pr(y \| b[1..j-1])

Generalizes to k-states, local & global, optimal & summed.

ROC curves often used in this area.

Want high coverage (few false -ves) and few errors (few false +ves), i.e. near bottom-right of graph is "good".

(Printer warning: May lose graph lines if page is shrunk or ¬colour.)

Easiest problem, new method ~ shuffling.
green: PRSS p-value (blue: S-W raw score); red: (summed) M-align (Markov=1).	Uniform random pop'n (2 bits/base): All methods should do well, & they do.

(Printer warning: May lose graph lines if page is shrunk or ¬colour.)

Easy problem, new method ~ shuffling
green: PRSS p-value (blue: S-W raw score); red: (summed) M-align (Markov=1).	0-order data, biased composition: PRSS good, M-align best.

(Printer warning: May lose graph lines if page is shrunk or ¬colour.)

Harder problem, new M-alignment method (red) best
green: PRSS p-value (blue: S-W raw score); red: (summed) M-align (Markov=1).	Mixed pop'n, high entropy 0-order seq's and low entropy 0-order seq's

(Printer warning: May lose graph lines if page is shrunk or ¬colour.)

Hard problem, new M-alignment method (red & purple) best
green: PRSS p-value; (blue: S-W raw score); red: (summed) M-align (Markov=1); purple: M-align (blended seq' model).	Pop'n of mixed seq's of high (2-bit/base) & low entropy 1st-order regions

Modelling-alignment (M-alignment) can (& should(!)) change (i) optimal alignments & (ii) rank-order of matches against a data-base.
Can use it with almost any population model.
On easy problems (0-order data): M-alignment ~ Smith Waterman + shuffling sig' test.
On hard problems: M-alignment is much more accurate (fewer false +ves, fewer false -ves).

Reading:

L. Allison, D. Powell & T. I. Dix (1999). Compression and approximate matching. The Computer Journal, 42(1) pp1-10 -- Describes principle, compression. Implements 1-state, global, optimal version.
L. Allison, D. Powell & T. I. Dix (2000). Modelling is more versatile than shuffling. TR2000/83, CSSE, Monash University.
D. R. Powell, L. Allison & T. I. Dix (2004). Modelling-Alignment for Non-Random Sequences[*]. 17th Australian Computer Soc. (ACS) Australian Joint Conf. on Artificial Intelligence (AI2004), pp.203-214, Springer-Verlag, LNCS/LNAI 3339. [*] Link also leads to software.
LA: [Bioinformatics], [MML], [pub'ns].

[*] Like many things out of Computer Science, Monash, this work owes a debt to Chris Wallace (1933 - 7/8/2004).