The parameters of a particular model (zero-order MM etc.) are fitted
to the given sequence and the calculations include an estimate
of the cost, in bits, required to state
the parameters of the hypothetical
model (H) to optimum accuracy.
The cost of encoding the data given the model
(D|H) is also given.
The sum, H+(D|H), is the total cost of encoding the sequence
under a particular model.
The model that gives the lowest total cost has the
greatest posterior probability.
Costing H prevents over-fitting:
although a more complex model may fit the data better (D|H),
it has to "pay" for its own great complexity, H.
There is a natural null-hypothesis which is that the sequence
simply consists of random characters and in this case (H)=0
and (D|H)=2 bits/base, for DNA.
Limitations
The demonstration program is a cut-down version.
It implements forward approximate repeats only, and
is limited to sequences of a few hundred characters because
it is run on our web server for demonstration purposes only.
The full program implements both forward and reverse complementary
approximate repeats and can analyse sequences of
hundreds of thousands of characters.
A later algorithm [CDA07 (click)]
can run on complete genomes.
Low Information Content Sequence:
Here is a low-information content sequence:
NB.
It is assumed that the length of the sequence is "common knowledge",
otherwise the length should also be included in the encoding,
using your favourite prior for
lengths.
Theformula used to estimate the cost
of the Hypothesis (H) and its parameters
assumes that the sequence is much longer than
the number of parameters
(12 for an first-order MM over DNA).
Random Sequence:
On the other hand,
this is a sequence of 100 bases,
generated uniformly and independently at random:
The more complex models find some chance patterns in the sequence and
give a figure of less than 2.0 bits per character for (D|H),
but when the models' complexities are included their totals
are greater than 2.0 bits per character and they are shown to be
unacceptable hypotheses: The sequence is seen to be
uniform random after all.
Long Sequences
The full algorithm (see ref' above)
also implements reverse complementary repeats and,
having certain speed-up techniques,
it can be run on sequences containing
tens or hundreds of thousands of bases (below).
L. Allison.
Information-Theoretic Sequence Alignment (HTML),
TR98/14 School of Comp. Sci. & SWE, Monash University, June 1998,
- on the alignment of non-random
(compressible, repetitive, low-information content) sequences, [example].
D. R. Powell, L. Allison &T. I. Dix.
Modelling alignment for non-random sequences,
Springer Verlag,
AI 2004: Advances in Artificial Intelligence,
LNCS/LNAI, vol.3339, pp.203-214, 2004.
Modelling v. Shuffling (11/8/2004), ... on the combination of models
(e.g., Probabilistic Finite State Automata, PFSA,
Hidden Markov Models, HMM, etc.) of populations with models of pairs of
sequences by mutation-machine, or equivalently generation-machine models.