## Alignment of Low to Medium Information Content Sequences

 home1 home2  Bib  Algorithms  Bioinfo  FP  Logic  MML  Prog.Lang and the  Book

Bioinformatics
Compression
+alignment

Also see
#Alignment
software
 ``` A C G T .-------------------- P(S[i]|S[i-1]) A| 1/12 1/12 1/12 9/12 | C| 9/20 1/20 1/20 9/10 | G| 9/20 1/20 1/20 9/10 | T| 9/12 1/12 1/12 1/12 ``` MMg: an AT-rich, order-1 Markov model.

The two sequences initially in the `FORM` below are generated from the AT-rich model MMg (right). They are completely unrelated, apart from both being generated from MMg, and thus sharing some general statistical biases.

If the sequences are aligned under the usual uniform model (try `Run'), i.e. under the assumption that all characters are equally likely then you will conclude that they are related by an acceptable optimal alignment, but the high number of matches is only due to their both coming from MMg. If they are aligned under the correct model (here, MMg) or under a model that can be fitted to their biases (e.g. an order-1 Markov model), then it is revealed that they are not related by an acceptable alignment.

You can replace the sequences with others of similar length and have them aligned. The most interesting behaviour comes with sequences that are distantly related and of medium or low information content. In such cases, aligning with knowledge of the true model (of the sequences) or at least with knowledge of the model class, gives more reliable results.

Demonstration program requires 2 seqs. separated by at least one blank line:
S1:

S2:

[precomputed]

NB. It is assumed that the lengths of the sequences are "common knowledge", otherwise the lengths should also be included in the encodings, using your favourite prior for a pair of lengths.
The formula used to estimate the cost of a Hypothesis (H) and its parameters assumes that any sequence is much longer than the number of parameters (12 for an order-1 MM over DNA).
This demonstration is based on the simple point-mutation model of sequence mutation; it is inappropriate to use it if the sequence lengths differ greatly, for example. But note that a similar treatment can be given to linear (affine) gap-costs, piecewise linear gap costs, global alignment, local alignment, optimal aligment, summed alignment, etc. [Powell et al 2004].

### Limits

The web-demonstration above is limited to sequences of up to a couple of hundred characters, this is only because it is run on our web-server.

### Models

The new alignment method can in general incorporate almost any model of a "sequence" -- see Allison et al (1998, 1999), Powell et al (2004) for details. It is not limited to (hidden) Markov models (HMM).

### References

 Coding Ockham's Razor, L. Allison, Springer A Practical Introduction to Denotational Semantics, L. Allison, CUP

 Linux  Ubuntu free op. sys. OpenOffice free office suite The GIMP ~ free photoshop Firefox web browser

 © L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated), Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University, was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.) Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Thursday, 02-Feb-2023 07:05:26 AEDT.