Bioinformatics: Computing for Molecular Biology
Contents of this page: 

Compression of Molecular Sequences.
The complexity
(information content, entropy, compressibility, ...
)
of DNA and other biological sequences is
of interest for a variety of reasons:
Repetitive subsequences (polyA, (AT)^{*}, etc.)
have low information content;
they may be of interest in their own right or may be given
a low weighting in sequence matches if they can be identified.
Informally, two extreme kinds of sequence are "boring",
those with near zero information content, e.g. AAAAAAA....
,
and those with maximum information content, i.e. random sequences,
as "typed" by the proverbial monkey.
Generally, interesting sequences have some structure or pattern
and lie somewhere between these extremes.
The Alu sequences form a family of sequences, each about 300 bases
long, that occur thousands of time in human DNA.
The typical Alu is about 87% similar to the concensus Alu therefore
the information content of an Alu is much less than 600 bits.
Gene duplication, regulatory signals, codon bias, etc.
all reduce the information content below 2 bits/base
and may be of biological interest.
D.R. Powell, D.L. Dowe, L. Allison and T.I. Dix. Discovering Simple DNA Sequences by Compression. Pacific Symposium on Biocomputing 3 (PSB98), pp.595606, 1998.
L. Allison, T. Edgoose & T. I. Dix, Compression of Strings with Approximate Repeats, Intelligent Systems in Molecular Biology (ISMB'98), pp.816, Montreal, 28 June  1 July 1998, [HTML] [pdf].
AED's Approximate Repeats Model (ARM) for DNA sequences allows for approximate forward and reversecomplementary repeats and achieves good sensitivity at pattern discovery.L. Allison, L. Stern, T. Edgoose & T. I. Dix. Sequence Complexity for Biological Sequence Analysis. Computers and Chemistry 24(1), pp.4355, Jan' 2000.

L. Stern, L. Allison, R. L. Coppel, & T. I. Dix. Discovering Patterns in Plasmodium falciparum Genomic DNA. Molecular and Biochemical Parasitology, 118(2), pp.175186, Dec' 2001.

Minh Duc Cao, Trevor I. Dix, Lloyd Allison & Chris Mears. A Simple Statistical Algorithm for Biological Sequence Compression. IEEE Data Compression Conference (DCC), 2007.
The expert model (XM) gives good compression of DNA and also protein sequences, and it is fast. 
T. I. Dix, D. R. Powell, L. Allison, J. Bernal, S. Jaeger & L. Stern. Comparative Analysis of Long DNA Sequences by Per Element Information Content Using Different Contexts. BMC Bioinformatics, 8(Suppl.2):S10, May 2007.
Informationcontent plots are compact and show what new information a given context tells about a sequence, and where.
See also the alignment of compressible sequences [below], Comp. Jrnl (1999), etc..
Alignment of Two Strings.
Personal publications on sequence alignment algorithms.
TR90/148 Inductive Inference over MacroMolecules (HTML) (or postscript)
Source code for 1, 3 and 5state alignment, both optimal alignment and rtheory.

One of the products of this work is the alignment density plot (right). This is a good way to visualize the collection of all possible alignments of two strings and the greyscale level indicates the degree of (un)certainty as to which is the correct or "true" alignment.
The plot from the sequence alignment algorithm shows the degree of certainty and uncertainty in various sequence alignments, see:
L. Allison, C. S. Wallace & C. N. Yee. When is a String Like a String? Artificial Intelligence in Mathematics (AIM), Ft. Lauderdale, Florida, January 1990 [HTML].
L.Allison, C.S.Wallace & C.N.Yee. Inductive Inference over MacroMolecules, Technical Report 90/148 [HTML] or....
L.Allison, C.S.Wallace & C.N.Yee, FiniteState Models in the Alignment of MacroMolecules, J.Molec.Evol. 35(1) 7789, 1992.
C. N. Yee & L. Allison. Reconstruction of Strings Past, Comp. Appl. in BioSciences (CABIOS) 9(1) pp.17, 1993, [HTML].
By summing the probability over all alignments even very distant relationships can be detected. By averaging parameter estimates over all alignments, rather than just the optimal alignment, unbiased estimates of parameter values are obtained. This is demonstrated for the 1state (simple) and 3state (linear, affine gap costs) models.

The 1, 3 and 5state (etc.) models are models of alignment, evolution or mutation having increasing complexity.
3state mutation machine (right) from the MML computational biology sequence alignment program corresponds to affine or linear gap costs as commonly used in aligning DNA sequences, see:
L.Allison, Normalization of Affine Gap Costs Used in Optimal Sequence Alignment, J.Theor.Biol. 161 263269, 1993 [www inc' pdf paper].
You can interpret your favourite integer penalties or scores as probabilities or you can choose integer penalties (scores) that approximate probabilities that you believe are appropriate.
Low (or moderate) information content (nonrandom, repetitive, compressible,...) sequences cause problems in alignment algorithms and in database searching, causing falsepositive matches, for example. The information content of the sequences can be taken into account giving better results; see:
D. R. Powell, L. Allison, T. I. Dix and D. L. Dowe. Alignment of low information sequences. Proceedings of the Fourth Australasian Theory Symposium [CATS98] pp.215229, February 1998.
L. Allison. InformationTheoretic Sequence Alignment (HTML), TR98/14 School of Comp. Sci. & SWE, Monash University, June 1998,  on the alignment of nonrandom (compressible, repetitive, lowinformation content) sequences.
L. Allison, D. Powell & T. I. Dix. Compression and Approximate Matching, Computer Journal, 42(1), pp.110, 1999 [www inc' pdf paper].
L. Allison, D. Powell and T. I. Dix. Modelling Is More Versatile Than Shuffling. [TR 2000/98] (HTML), 2000.
Fewer false positives, fewer false negatives, can (and should) change rankorder of alignments, applicable to more general models than the "shuffling" correction for nonuniform populations of sequences, identifies correct population model(s).Seminars: (DNA) Sequence Complexity and Alignment, to the Walter and Eliza Hall (WEHI) Bioinformatics Group, 13 Feb. 2001, on the relationship between sequence alignment, compression & modelling of biological sequences, and alignment of nonrandom sequences.
Modelling v. Shuffling, to the Bioinformatics Group & Dept. Computer Science [UWA], 11 Aug. 2004, on PRSS, masking, shuffling, Malignment, low to medium information content sequences, and alignment accuracy, significance & ranking.
D. R. Powell, L. Allison, T. I. Dix. ModellingAlignment for NonRandom Sequences. 17th ACS Australian Joint Conf. on Artificial Intelligence (AI2004), pp.203214, 610 Dec 2004, SpringerVerlag, LNCS/LNAI 3339, isbn:3540240594.
Comparison with PRSS from FASTA package. Link leads to software and paper. 
Variations on Sequence Alignment (2007), dynamic programming and related algorithms for two or more sequences.
Minimum message length [MML]
is used to compare sequences objectively.
MML also gives a natural nulltheory and a significancetest for hypotheses.
The models are based on finitestate machines
(~hidden Markov models).
The theory of multistate distributions is used to calculate
the accuracy of the parameter estimates.
Algorithms.
L. Allison & T. I. Dix. A bitstring longest common subsequence algorithm, Inf. Proc. Lett., 23(6), pp.305310, 1986 [paper] [code].
Uses bitvector operations to get a speedup roughly equal to the wordlength of the computer.L. Allison, Lazy dynamic programming can be eager. Inf. Proc. Lett., 43(4), pp.207212, Sept' 1992 [paper]
When written in a lazy functional programming language, the simple, O(n^{2})time dynamic programming algorithm (DPA) can be made to run in just O(n.d)time by adjusting the central comparison test.L. Allison. Using Hirschberg's algorithm to generate random alignments. Inf. Proc. Lett., 51 pp.251255, 1994.
Allows alignments of two strings to be sampled at random from their posterior probability distribution. This gives a MonteCarlo method for estimating evolutionary parameters and, when cooled, gives a simulated annealing method for optimal alignment. It has been applied to multiple alignment, see Allison & Wallace Jrnl. Molec. Evol., 39, pp.418430, 1994 under multiple alignment.D. R. Powell, L. Allison & T. I. Dix. A versatile divide and conquer technique for optimal string alignment. Inf. Proc Lett., 70(3), pp.127139, 1999 [www inc' pdf document]
Common string alignment algorithms such as the basic dynamic programming algorithm (DPA) and the time efficient Ukkonen algorithm use quadratic space to determine an alignment between two strings. In this paper we present a technique that can be applied to these algorithms to obtain an alignment using only linear space, while having little or no effect on the time complexity. This new technique has several advantages over previous methods for determining alignments in linear space, such as: simplicity, the ability to use essentially the same technique when using different cost functions, and the practical advantage of easily being able to trade available memory for running time.L. Allison. Longest biased interval and longest nonnegative sum interval. Bioinformatics, 19(10), pp.12941295, 1 July 2003.
Finds the longest interval of a sequence that has at least a specified minimum bias (e.g. 80%) towards certain specified characters (bases, amino acids, e.g. AT). It can process very long sequences quickly.L. Allison. Finding approximate palindromes quickly and simply. TR 2004/162, School of Computer Science & Software Eng., Monash U., 2004.
Alignment of Three Strings.
Fast 3way alignment procedure: O(n.d^{2}) time at worst, O(n+d^{3}) on average, where the strings' lengths are ~n and d is the 3way editdistance.
L. Allison. A fast algorithm for the optimal alignment of three strings. J. Theor. Biol. 164(2) pp.261269, 1993 [www inc' pdf paper].
Simple mutation costs; for affine gap costs see Powell et al 2000D. R. Powell, L. Allison & T. I. Dix. Fast, optimal alignment of three sequences using linear gap costs. J. Theor. Biol. 207(3) pp.325336 Dec 2000 [www inc' pdf paper].
Gap (indel) costs of the form a+b*k for a gap of length k, where a and b are small integers. The algorithm runs in O(n+d^{3}) time on average, where the string lengths are ~n and d is the 3way edit distance. i.e. It is fast when the strings are similar, and the edit distance, not string length, is the main influence on running time.
Note that each internal node in an unrooted binary tree has three neighbours so that a good heuristic for kway alignment, given an evolutionarytree, is to do repeated 3way alignment, inferring the strings at internal nodes in the process.
Multiple Alignment and Evolutionary Trees.
Personal publications in multiple alignment.
Technical Report 91/155 Minimum Message Length Encoding, Evolutionary Trees and Multiple Alignment.
L. Allison, C. S. Wallace and C. N. Yee. Minimum message length encoding, evolutionary trees and multiple alignment. Hawaii Int. Conf. Sys. Sci., HICCS25, v1, pp.663674, Jan 1992 [paper].
L.Allison & C. S. Wallace. The Posterior Probability Distribution of Alignments and its Application to Estimation of Evolutionary Trees and Optimisation of Multiple Alignments, Technical Report 93/188, July 1993.
L. Allison & C. S. Wallace. An Information Measure for the String to String Correction Problem with Applications, 17th Annual Annual Comp. Sci. Conf., ACSC17, Christchurch, New Zealand, and Australian Computer Science Communications 16(1C), pp.659668, Jan' 1994 [paper (HTML)]
L. Allison and C. S. Wallace. The Posterior Probability Distribution of Alignments and its Application to Estimation of Evolutionary Trees and Optimisation of Multiple Alignments. Jrnl. Molec. Evol. 39 pp.418430, 1994.
Minimum message length [MML] is used to determine the accuracy with which branchlengths in an evolutionary tree can be estimated  it is rather low in general. Simulated annealing is used as a way out of the localoptima problem for a practical multiplealignment algorithm. Gibbssampling is used as a practical way of estimating edge lengths, and standard deviations of estimates give another indication of reliable accuracy.Source code of a stochastic multiple alignment computer program [6/'95].
This does iterated, stochastic 3way alignment at each internal node of the tree to sample ancestral strings. This strategy should be straightforward to adapt to sophisticated mutation models. There is also good potential for speed up on a parallel computer or on a network of workstations.Technical Report 95/225 postscript (50K) Towards Modelling Evolution = Mutation modulo Selection in Sequence Alignment.
Biological evolution takes place through the mutation of DNA modulated by the pressure of selection which determines which changes are viable. A proposal for modelling both of these factors in the multiplealignment problem is presented. It is argued that treebased alignment methods primarily model mutations events, and that most nontree methods (e.g. profiles) model selection pressures on a family of sequences. A treebased method specifies mutation probabilities and, if used in isolation, the hope must be that biases due to selection are small or average out. A nontree method can be used to model the family of sequences  which sequences are fit and viable, and which are nonviable, as members of the family. It is suggested that the two approaches can be combined: the latter can be used to modify the mutation probabilities of the former  to model which mutations are viable and are observed.
Restriction Site Mapping.
 Personal publications in mapping.
 L. Allison & C. N. Yee. Restriction Site Mapping is in Separation Theory. CABIOS 4(1), pp.97101, Jan. 1988 (now Jrnl Bioinformatics).
See also Trevor Dix, Darren Platt and C. N. Yee.
Protein Structure.
Personal publications in protein structure (short).
Technical Report 92/163 A Decision Graph Explanation of Protein Secondary Structure Prediction. [HTML]
D. L. Dowe, J. Oliver, T. I. Dix, L. Allison & C. S. Wallace A decision graph explanation of protein secondary structure prediction. 26th Hawaii Int. Conf. Sys. Sci. Vol.1 pp.669678 Jan' 1993 [HTML]
D. L. Dowe, L. Allison, T. I. Dix, L. Hunter, C. S. Wallace & T. Edgoose Circular clustering of protein dihedral angles by minimum message length. Pacific Symposium on Biocomputing '96 (PSB96), World Scientific, pp.242255, Jan' 1996 [paper (ps)]
T. Edgoose, L. Allison & D. L. Dowe. An MML classification of protein structure that knows about angles and sequence. Pacific Symposium on Biocomputing '98 (PSB98), pp.585596, Jan' 1998 [paper (pdf)]
Rubik's snake (tm) (right) makes a good digital model
for protein folding and is also fun. It is made of 24 triangular
prisms which meet on their square faces. Two meeting
prisms can rotate relative to each other about the centre
of the meeting face so as to vary the "dihedral" angle.
The snake can be packed in a ball as a "globular" protein
and can form "helices" and "extended"
conformations.
Minimum Message Length Encoding.
Many of the above projects make use of a method of inductive inference known as Minimum Message Length [MML] encoding. MML is a realisation of Occam's razor. It is invaluable for dealing with error and uncertainty in data and for comparing models that have different complexities. For example, an alignmentmodel with gappenalties is more complex (has more degrees of freedom) than one without gappenalties. The former will in general fit the data strings better than the latter  but is the improvement enough to "pay for" the extra complexity of the model? MML can judge this tradeoff fairly. (Note that maximumlikelihood ignores the complexity of the model which can result in the well known phenomenon of overfitting.)
Hidden Markov Models
(Hidden) Markov Models (HMM) are often called probabilistic finite state automata (PFSA), or probabilistic finite state machines (PFSM), in computing because of their close relationship to finite state automata, as used in formal language theory. Whatever they are called, they are useful in pattern & structure discovery, compression, alignment of pair of sequences, etc.. [more]
Misc'
Compression and Analysis of Biological Sequences. Why and How [2007], Modelling Alignment, Bioinformatics at CSSE [9/2001], Sequence Complexity.