
Abstract:
A new statistical model for DNA considers a sequence to be a mixture of
regions with little structure and regions that are approximate repeats of
other subsequences, i.e. instances of repeats do not need to match
each other exactly. Both forward and reversecomplementary repeats
are allowed. The model has a small number of parameters which are
fitted to the data. In general there are many explanations for a
given sequence and how to compute the total probability of the data
given the model is shown. Computer algorithms are described
for these tasks. The model can be used to compute the
information content of a sequence, either in total or base by base.
This amounts to looking at sequences from a datacompression point of
view and it is argued that this is a good way to tackle
intelligent sequence analysis in general.
Keywords:
Algorithm; DNA; Complexity; Entropy; Pattern discovery; Sequence analysis

