|
A C G T
.-------------------- P(S[i]|S[i-1])
A| 1/12 1/12 1/12 9/12
|
C| 9/20 1/20 1/20 9/10
|
G| 9/20 1/20 1/20 9/10
|
T| 9/12 1/12 1/12 1/12
MMg: an AT-rich, order-1 Markov model.
|
The two sequences initially in the FORM below
are generated from the AT-rich model MMg (right).
They are completely unrelated,
apart from both being generated from MMg,
and thus sharing some general statistical biases.
If the sequences are aligned under the usual
uniform model (try `Run'),
i.e. under the assumption that all characters are equally likely
then you will conclude that they are related by an
acceptable optimal alignment,
but the high number of matches is only
due to their both coming from MMg.
If
they are aligned under the correct model (here, MMg)
or under a model that can be fitted to their biases
(e.g. an order-1 Markov model),
then it is revealed that they are not related
by an acceptable alignment.
You can replace the sequences with others of similar length
and have them aligned.
The most interesting behaviour comes with sequences that are distantly related
and of medium or low information content.
In such cases,
aligning with knowledge of the true model (of the sequences) or
at least with knowledge of the model class,
gives more reliable results.
Demonstration program requires 2 seqs.
separated by at least one blank line: |
S1:
S2:
|
[precomputed]
|
NB.
It is assumed that the lengths of the sequences
are "common knowledge",
otherwise the lengths should also be included in the encodings,
using your favourite prior for a pair of lengths.
The formula used to estimate the cost
of a Hypothesis (H) and its parameters
assumes that any sequence is much longer than
the number of parameters
(12 for an order-1 MM over DNA).
This demonstration is based on the simple
point-mutation model of sequence mutation;
it is inappropriate to use it if the sequence lengths differ greatly,
for example.
But note that a similar treatment can be given to
linear
(affine) gap-costs,
piecewise linear gap costs,
global alignment,
local alignment,
optimal aligment,
summed alignment, etc.
[Powell et al 2004].
Limits
The web-demonstration above is
limited to sequences of up to a couple of hundred characters,
this is only because it is run on our web-server.
Models
The new alignment method can in general incorporate almost any model
of a "sequence"
-- see Allison et al (1998, 1999), Powell et al (2004) for details.
It is not limited to (hidden) Markov models (HMM).
References
- L. Allison.
Information-Theoretic Sequence Alignment, TR98/14,
School of Computer Science and Software Engineering, Monash University,
Australia, June 1998.
- L. Allison, D. Powell & T. I. Dix.
Compression and Approximate Matching.
Computer Journal, Oxford University Press,
42(1), pp.1-10, 1999.
- L. Allison, D. Powell and T. I. Dix.
Modelling Is More Versatile than Shuffling.
TR2000/98,
School of Computer Science and Software Engineering, Monash University,
Australia 3800, 2000.
-
D. R. Powell, L. Allison & T. I. Dix.
Modelling alignment for non-random sequences.
17th ACS Australian Joint Conf. on
Artificial Intelligence (AI2004),
Springer Verlag, LNCS/LNAI, vol.3339, pp.203-214, isbn:3-540-24059-4,
2004.
(Inc. [link] to software.)
|
- Seminars:
Sequence Complexity and Alignment,
and
Modelling v. Shuffling in Sequence Alignment.
- Uptodate: [#Alignment].
|
|