-----JAVASCRIPT IS OFF-----
A Sequence on Sequences
1. Information content of two related(?) sequences.
2. Information content of multiple alignment given a phylogenetic tree.
3. Information content (compression) of one sequence.
4. Modelling-alignment, information content of non-random sequences.
5. Another compression model of biological sequences.
users.monash.edu/~lloyd/Seminars/2008-Sequences/index.shtml
1. Two related sequences [AWY92]
Global alignment, S1 || S2 :
(a) optimal: ... (A : A)
... (C : G)
... (- : G)
... (T : - ) ...,
(b) sum probabilities (exclusive hypotheses!).
NB. Don't forget lengths, |S1|, |S2| and |(S1||S2)|.
Gap costs:
Point costs :
1-state mutation (or generation) machine.
Linear gap costs :
3-state mutation machine.
Piecewise-linear :
5-state mutation machine.
Local alignment:
S1 = α; β; γ
S2 = δ; β'; ε
(effectively special gaps, at the ends)
total = α; γ; δ; ε; (β||β')
Null hypothesis:
I (S1 & S2) = I (S1) + I (S2).
2. Multiple alignment given a tree [AW94]
Algorithmic complexity a big concern.
Gibbs sampling is possible.
3. Compression of one sequence [AED98]
The Approximate Repeats Model
(ARM ) [AED98] is based on
biological principles:
Subsequences can be duplicated
as forward or reverse-complement,
then diverge (mutate) --
as in alignment models!
Gives good compression, and is sensitive ,
but algorithm is rather slow (but see XM ).
4. ‘Modelling-alignment’ for non-random sequences [APD99, PAD04]
Most alignment algs. consider S1 to be uniform random, ditto S2.
Low-info-content ≡ non-random ≡ compressible,
& cause problems.
Quick fixes, poor ones:
(a) masking out of repeats,
(b) shuffling and realignment,
but
(a) throws information away, and
(b) asks, is a bad alignment significant?!
A better fix:
model (e.g., MMk , ARM, XM, etc.)
each sequence within the alignment alg. --
modelling-alignment .
Changes what is an optimal alignment, and
asks, is a good alignment significant?
E.g., beats PRSS on ROC curves
(but v.hard to get it accepted).
5. Another compression model (XM) [CDAM07]
Very simple model:
committee of “ex perts”,
“base”-model, e.g., a MMk ,
forward and reverse-complement copy-experts,
weighted by Bayesian averaging.
Cannot consider all possible copy-experts (time).
Key sub-problem:
An index of some kind proposes likely candidates.
XM's compression
v. nearly as good as the ARM, but
XM is a much faster algorithm.
Summary
Interesting chicken and egg situation:
(a) Alignment of ≥2 sequences should incorporate
a population-model for an individual sequence, and
(b) compression (modelling) one sequence is like
local alignment of a sequence against itself.
How can you say what is ‘related’ (a) ,
unless you know what is ‘typical’ (b) ,
and v.v.?
Reading
[AED98]
L. Allison, T. Edgoose & T. I. Dix,
Compression of strings with approximate repeats ,
Intell. Sys. in Mol. Biol., pp.8-16, 1998.
[APD99]
L. Allison, D. Powell & T. I. Dix,
Compression and approximate matching ,
Comp. J., 42(1), pp.1-10, 1999,
doi:/10.1093/comjnl/42.1.1 .
[AWY92]
L. Allison, C. S. Wallace & C. N. Yee,
Finite-state models in the alignment of macro-molecules ,
J. Mol. Evol., 35(1), pp.77-89, 1992,
doi:10.1007/BF00160262 .
[AW94]
L. Allison & C. S. Wallace,
The posterior probability distribution of alignments and its application
to parameter estimation of evolutionary trees and to optimization of
multiple alignments ,
J. Mol. Evol., 39(4), pp.418-430, 1994,
doi:10.1007/BF00160274 .
[AY90]
L. Allison & C. N. Yee,
Minimum message length encoding and the comparison of macromolecules ,
Bull. of Math. Biol., 52(3), pp.431-453, 1990,
doi:10.1007/BF02458580 .
[BW69]
D. M. Boulton & C. S. Wallace,
The information content of a multistate distribution ,
J. Theor. Biol., 23, pp.269-278, 1969.
[CDAM07]
M. D. Cao, T. I. Dix, L. Allison, C. Mears,
A simple statistical algorithm for biological sequence compression ,
DCC, pp.43-52, March 2007,
doi:10.1109/DCC.2007.7 .
[PAD04]
D. R. Powell, L. Allison & T. I. Dix,
Modelling alignment for non-random sequences ,
Springer Verlag, LNCS/LNAI 3339, pp.203-214, 2004.
[W05]
C. S. Wallace ,
Statistical and Inductive Inference by Minimum Message Length ,
Springer-Verlag,
isbn:038723795X ,
2005.
©
Lloyd Allison
Faculty of Information Technology (Clayton),
Monash University, Australia 3800.
Created with "vi (Linux & Solaris)", charset=iso-8859-1