Bioinformatics: Computing for Molecular Biology

Contents of this page:

Compression of Molecular Sequences.

The complexity (information content, entropy, compressibility, ...) of DNA and other biological sequences is of interest for a variety of reasons: Repetitive subsequences (poly-A, (AT)*, etc.) have low information content; they may be of interest in their own right or may be given a low weighting in sequence matches if they can be identified. Informally, two extreme kinds of sequence are "boring", those with near zero information content, e.g. AAAAAAA...., and those with maximum information content, i.e. random sequences, as "typed" by the proverbial monkey. Generally, interesting sequences have some structure or pattern and lie somewhere between these extremes. The Alu sequences form a family of sequences, each about 300 bases long, that occur thousands of time in human DNA. The typical Alu is about 87% similar to the concensus Alu therefore the information content of an Alu is much less than 600 bits. Gene duplication, regulatory signals, codon bias, etc. all reduce the information content below 2 bits/base and may be of biological interest.

See also the alignment of compressible sequences [below], Comp. Jrnl (1999), etc..

Alignment of Two Strings.

Probabilistic or Statistical Alignment Density Plot from Probabilistic Finite State Automata PFSA or Pair Hidden Markov Model PHMM HMM

One of the products of this work is the alignment density plot (right). This is a good way to visualize the collection of all possible alignments of two strings and the grey-scale level indicates the degree of (un)certainty as to which is the correct or "true" alignment.

The plot from the sequence alignment algorithm shows the degree of certainty and uncertainty in various sequence alignments, see:

Alignment Probabilistic Finite-State-Machine (PFSM), Automaton (PFSA), or (later) Pair Hidden Markov Models PHMM

The 1, 3 and 5-state (etc.) models are models of alignment, evolution or mutation having increasing complexity.

3-state mutation machine (right) from the MML computational biology sequence alignment program corresponds to affine or linear gap costs as commonly used in aligning DNA sequences, see:

Low- (or moderate-) information content (non-random, repetitive, compressible,...) sequences cause problems in alignment algorithms and in database searching, causing false-positive matches, for example. The information content of the sequences can be taken into account giving better results; see:

Minimum message length [MML] is used to compare sequences objectively. MML also gives a natural null-theory and a significance-test for hypotheses. The models are based on finite-state machines (~hidden Markov models). The theory of multi-state distributions is used to calculate the accuracy of the parameter estimates.


Alignment of Three Strings.

Note that each internal node in an unrooted binary tree has three neighbours so that a good heuristic for k-way alignment, given an evolutionary-tree, is to do repeated 3-way alignment, inferring the strings at internal nodes in the process.

Multiple Alignment and Evolutionary Trees.

Restriction Site Mapping.

See also Trevor Dix, Darren Platt and C. N. Yee.

Protein Structure.

snake toy

Rubik's snake (tm) (right) makes a good digital model for protein folding and is also fun. It is made of 24 triangular prisms which meet on their square faces. Two meeting prisms can rotate relative to each other about the centre of the meeting face so as to vary the "dihedral" angle. The snake can be packed in a ball as a "globular" protein and can form "helices" and "extended" conformations.

Minimum Message Length Encoding.

Many of the above projects make use of a method of inductive inference known as Minimum Message Length [MML] encoding. MML is a realisation of Occam's razor. It is invaluable for dealing with error and uncertainty in data and for comparing models that have different complexities. For example, an alignment-model with gap-penalties is more complex (has more degrees of freedom) than one without gap-penalties. The former will in general fit the data strings better than the latter - but is the improvement enough to "pay for" the extra complexity of the model? MML can judge this trade-off fairly. (Note that maximum-likelihood ignores the complexity of the model which can result in the well known phenomenon of over-fitting.)

Hidden Markov Models

(Hidden) Markov Models (HMM) are often called probabilistic finite state automata (PFSA), or probabilistic finite state machines (PFSM), in computing because of their close relationship to finite state automata, as used in formal language theory. Whatever they are called, they are useful in pattern & structure discovery, compression, alignment of pair of sequences, etc.. [more]


Compression and Analysis of Biological Sequences. Why and How [2007], Modelling Alignment, Bioinformatics at CSSE [9/2001], Sequence Complexity.


Dr. , Dr. Trevor Dix, Prof. C. S. Wallace.