Search engine run on: http://users.monash.edu.au/
Glookbib search for: MolBio LAllison
%A S. Rajapaksa
%A L. Allison
%A P. J. Stuckey
%A M. Garcia de la Banda
%A A. S. Konagurthu
%T The divergence time of protein structures modelled by Markov matrices and
its relation to the divergence of sequences
%J arXiv
%M AUG
%D 2023
%K TR, c2023, c202x, c20xx, zz0723, MolBio, protein, sequence alignment,
similarity, time, evolution, related, LAllison, ArunK, Sandun,
midnight, twilight zone
%X "A complete time-parameterized statistical model quantifying the divergent
evolution of protein structures in terms of the patterns of conservation of
their secondary structures is inferred from a large collection of protein
3D structure alignments. This provides a better alternative to
time-parameterized sequence-based models of protein relatedness, that have
clear limitations dealing with twilight and midnight zones of sequence
relationships ..."
-- [arxiv:2308.06292]['23].
%A P. R. Amarasinghe
%A L. Allison
%A P. J Stuckey
%A M. Garcia de la Banda
%A A. M. Lesk
%A A. S. Konagurthu
%T Getting 'φψχal' with proteins: minimum message length inference
of joint distributions of backbone and sidechain dihedral angles
%J Bioinformatics
%V 39
%N s.1
%P i357-i367
%M JUN
%D 2023
%O ISMB/ECCB 23-27 July 2023, Lyon France
%K jrnl, MolBio, Piyumi, c2023, c202x, c20xx, zz0723, phipsichial, ISMB, ECCB,
protein structure, angle, mixture, von Mises, vonMises, Dunbrack,
rotamer library, backbone, side chain, sidechain, statistical model,
MML, mdl, LAllison, AMLesk, ArunK
%X "... We model the joint distribution of the observed mainchain & sidechain
dihedral angles ... by a mixture of a product of von Mises probability
distributions. ..."
-- [doi:10.1093/bioinformatics/btad251]['23].
Also see AK & [LCB][2023].
[Also search for: MolBio MML].
%A D. Sumanaweera
%A L. Allison
%A A. S. Konagurthu
%T Bridging the Gaps in Statistical Models of Protein Alignment
%J Bioinformatics
%V 38
%N s.1
%P i229–i237
%M JUL
%D 2022
%O ISMB July 2022, Madison, USA
%K conf, ISMB, MolBio, c2022, c202x, c20xx, zz0722, LAllison, ArunK, Dinithi,
sequence alignment, gap, indel, model, substitution, PAM, BLOSUM, MML, MDL
%X "... To overcome this gap, this article demonstrates how a complete stat.
model quantifying the evolution of pairs of aligned proteins can be
constructed using a time-parameterized substitution matrix &
a time-parameterized alignment state machine. Methods to derive all
parameters of such a model from any benchmark collection of aligned protein
seqs. are described here. This has not only allowed us to generate a unified
stat.model for each of the nine widely used subst.matrices (PAM, JTT, BLOSUM,
JO, WAG, VTML, LG, MIQS & PFASUM), but also resulted in a new unified model,
MMLSUM. Our underlying methodology measures the Shannon info. content using
each model to explain losslessly any given collection of alignments, which
has allowed us to quantify the performance of all the above models on six
comprehensive alignment benchmarks. Our results show that MMLSUM results in a
new & clear overall best performance ..."
-- [doi:10.1093/bioinformatics/btac246]['22],
& 2010.00855@[arXiv]['20].
%A S. Rajapaksa
%A D. Sumanaweera
%A A. M. Lesk
%A L. Allison
%A P. Stuckey
%A M. Garcia de la Banda
%A P. Stuckey
%A D. Abramson
%A A. S. Konagurthu
%T On the Reliability and Limits of Protein Sequence Alignments
%J Bioinformatics
%V 38
%N s.1
%P i255–i263
%M JUL
%D 2022
%O ISMB July 2022, Madison, USA
%K conf, ISMB, MolBio, c2022, c202x, c20xx, zz0722, sequence alignment,
protein, structure, sequence, proximity, twilight, midnight, zone,
AMLesk, LAllison, ArunK, Sandun
%X "... Using techniques not prev. applied to these questions, by weighting
every possible seq. alignment by its posterior prob. we derive a formal
math. expectation, & develop an efficient alg. for computation of the
distance between alternative alignments ... By analyzing the seqs. &
structures of 1 million protein domain pairs, we report the variation of the
expected distance between seq.-based & structure-based alignments, as a fn
of (Markov time of) seq. divergence. Our results clearly demarcate the
'daylight', 'twilight' & 'midnight' zones for interpreting residue-residue
correspondences from seq. information alone."
-- [doi:10.1093/bioinformatics/btac247]['22].
%A S. Rajapaksa
%A D. Sumanaweera
%A M. Garcia de la Banda
%A P. Stuckey
%A D. Abramson
%A L. Allison
%A A. Lesk
%A A. Konagurthu
%Y On identifying statistical redundancy at the level of amino acid subsequences
%J IEEE Int. Conf. on Bioinformatics and Biomedicine (BIBM)
%W Houston, USA
%M DEC
%D 2021
%K conf, BIBM, c2021, c202x, c20xx, zz0122, MolBio, LAllison, AMLesk, ArunK,
protein, information content, stats, compress, compression, library,
fragments, sequence, subsequence, structure, indels, source code
%X "... presents a framework to characterize & identify local sequences of
proteins that are statistically redundant under the measure of Shannon
information content while accounting for variations in their occurrences over
evolutionary insertions, deletions, & substitutions of amino acids. The
identification of such local seqs. provides insights for downstream studies
on proteins. Here, we have applied our methods to amino acid seq. data sets
derived from a database corr.to 935,552 substructural regions of varying
sizes, covering 113,724 proteins from the protein data bank. The results
identify, among others, a surjective mapping between 110,598 local seqs.
(with an avg. length of 82 AAs/seq.) & 1,493 topological shapes. ..."
-- [doi:10.1109/BIBM52615.2021.9669282]['22],
[more].
Also see supplementary@[lcb]['22].
[Also search for: MolBio protein compression].
%A A. S Konagurthu
%A R. Subramanian
%A L. Allison
%A D. Abramson
%A P. J. Stuckey
%A M. Garcia de la Banda
%A A. M. Lesk
%T Universal architectural concepts underlying protein folding patterns
%J Front. Mol. Biosci.,
A Journey Through 50 Years of Structural Bioinformatics in
Memoriam of Cyrus Chothia
%P ?-?
%M APR
%D 2021
%K jrnl, FMB, MolBio, bioinformatics, c2021, c202x, c20xx, zz0121,
procodic, prosodic, 3D protein structure, fold, folds, tableau, motif,
motifs, pattern, model, concept, structural, terms, fragment, supersecondary,
local, dictionary, recurring, library, MML, II, ArunK, LAllison, AMLesk
%X "What is the architectural 'basis set' of the observed universe of protein
structures? Using information-theoretic inference, we answer this Q. with a
comprehensive dictionary of 1,493 substructures - called concepts - at a
sub-domain level, based on an unbiased subset of known protein structs. ...
An interactive site, PROCODIC, at ...[prosodic] provides access to
& navigation of the entire dictionary of concepts, & all assoc. info.."
-- [doi: 10.3389/fmolb.2020.612920]['21],
[more].
%A D. Sumanaweera
%A L. Allison
%A A. S. Konagurthu
%T Bridging the gaps in statistical models of protein alignment
%J arXiv
%M OCT
%D 2020
%K TR, MolBio, c2020, c202x, c20xx, zz1020, MML, protein alignment, sequence,
DPA, stats, model, gap, indel, indels, substitution, scoring, matrix,
proteins, PAM, BLOSUM, PFASUM, MMLSUM, HMM, minimum message length, MDL,
information, Dinithi, LAllison, ArunK
%X "... demonstrates how a complete statistical model quantifying the evolution
of pairs of aligned proteins can be constructed from a time-parameterised
substitution matrix and a time-parameterised 3-state alignment machine. All
parameters of such a model can be inferred from any benchmark data-set of
aligned protein seqs.. This allows us to examine nine well-known sub.matrices
on six benchmarks curated using various structural alignment methods; any
matrix that does not explicitly model a 'time'-dependent Markov process is
converted to a corr. base-matrix that does. [&] a new optimal matrix is
inferred for each of the six benchmarks. Using Minimum Message Length (MML)
inference, all 15 matrices are compared in terms of measuring the Shannon
information content of each benchmark. This has resulted in a new & clear
overall best performed time-dependent Markov matrix, MMLSUM, & its assoc.
3-state m/c, whose properties we have analysed in this work. For std use, the
MMLSUM series of (log-odds) scoring matrices derived from the above Markov
matrix, are available at lcb.infotech.monash.edu.au/mmlsum "
-- 2010.00855@[arXiv]['20].
%A A. M. Lesk
%A A. S. Konagurthu
%A L. Allison
%A M. Garcia de la Banda
%A P. J. Stuckey
%A D. Abramson
%T Computer modelling of a potential agent against SARS-Cov-2 (COVID-19)
protease
%J Proteins: Structure, Function and Bioinformatics
%V 88
%N 12
%P 1557-1558
%M JUL
%D 2020
%K jrnl, MolBio, c2020, c202x, c20xx, zz0720, SARS-CoV-2, SARSCoV2, protease,
protein, Mpro, 3CLpro, nsp5, covid19, covalent, inSilico, ligand,
inhibitor, Cys145, AMLesk, LAllison, ArunK
%X "We have modelled modifications of a known ligand to the SARS‐CoV‐2
(COVID‐19) protease, that can form a covalent adduct, plus additional
ligand‐protein hydrogen bonds."
-- [doi:10.1002/prot.25980]['20] (online 14/7/2020).
[Also search for: SARSCoV2 protease].
%A D. Sumanaweera
%A L. Allison
%A A. Konagurthu
%T Statistical compression of protein sequences and inference of marginal
probability landscapes over competing alignments using finite state models
and Dirichlet priors
%J Bioinformatics
%V 35
%N 14
%P i360–i369
%M JUL
%D 2019
%O ISMB/ECCB, Basel, .ch, July 2019
%K jrnl, conf, MolBio, c2019, c201x, c20xx, zz0719, LAllison, ArunK,
bioinformatics, sequence, alignment, landscape, DPA, homology, protein,
proteins, probability, probabilistic, Bayesian, information, MML, MDL,
ISMB, ECCB
%X "The information criterion of minimum message length (MML) provides a
powerful statistical framework for inductive reasoning from observed data. We
apply MML to the problem of protein sequence comparison using finite state
models with Dirichlet distributions. The resulting framework allows us to
supersede the ad hoc cost functions commonly used in the field, by
systematically addressing the problem of arbitrariness in alignment
parameters, and the disconnect between substitution scores and gap costs.
Furthermore, our framework enables the generation of marginal probability
landscapes over all possible alignment hypotheses, with potential to
facilitate the users to simultaneously rationalize and assess competing
alignment relationships between protein sequences, beyond simply reporting a
single (best) alignment. We demonstrate the performance of our program on
benchmarks containing distantly related protein sequences."
-- [doi:10.1093/bioinformatics/btz368]['19].
Also see [protein].
%A D. Sumanaweera
%A L. Allison
%A A. S. Konagurthu
%T The bits between proteins
%J Data Compression Conference (DCC)
%I IEEE
%W Snowbird, Utah, USA
%M MAR
%D 2018
%K conf, DCC, DCC2018, MolBio, c2018, c201x, c20xx, zz0418, bioinformatics,
LAllison, ArunK, Dinithi, DCC, DCC2018, protein, sequence, alignment, DPA,
algorithm, evolutionary, model, minimum message length, MDL, MML, AIC,
homology, information, similarity
%X "Comparison of protein sequences via alignment is an important routine in
modern biological studies. Although the technologies for aligning proteins
are mature, the current state of the art continues to be plagued by many
shortcomings, chiefly due to the reliance on: (i) naive objective functions,
(ii) fixed substitution scores independent of the sequences being considered,
(iii) arbitrary choices for gap costs, and (iv) reporting, often, one
optimal alignment without a way to recognise other competing sequence
alignments. Here, we address these shortcomings by applying the
compression-based Minimum Message Length (MML) inference framework to the
protein sequence alignment problem. This grounds the problem in statistical
learning theory, handles directly the complexity-vs-fit trade-off without
ad hoc gap costs, allows unsupervised inference of all the statistical
parameters, and permits the visualization and exploration of competing
sequence alignment landscape."
-- [more],
[doi:10.1109/DCC.2018.00026]['18].
(Also see [protein].)
%A A. M. Lesk
%A R. Subramanian
%A L. Allison
%A D. Abramson
%A P. J. Stuckey
%A M. G. de la Banda
%A K. S. Konagurthu
%T Universal architectural concepts underlying protein folding patterns
%J bioRxiv
%N 480194
%M DEC
%D 2018
%K TR, MolBio, c2018, c201x, c20xx, zz1120, AMLesk, LAllison, ArunK, protein,
3D, structure, procodic, prosodic, fold, folding, concept, concepts,
motif, pattern, substructure
%X "What is the architectural 'basis set' of the observed universe of protein
structures? Using information-theoretic inference, we answer this question
with a comprehensive dictionary of 1,493 substructural concepts. Each concept
represents a topologically-conserved assembly of helices and strands that
make contact. Any p.structure can be dissected into instances of concepts
from this dictionary. We dissected the world-wide protein data bank ..."
-- 480194@[bioRxiv]['18].
%A R. Subramanian
%A L. Allison
%A P. J. Stuckey
%A M. Garcia de la Banda
%A D. Abramson
%A A. M. Lesk
%A A. S. Konagurthu
%T Statistical compression of protein folding patterns for inference of
recurrent substructural themes
%J Data Compression Conf. (DCC)
%I IEEE
%W Snowbird, Utah, USA
%P 340-349
%M APR
%D 2017
%K conf, DCC, MolBio, c2017, c201x, c20xx, zz0517, protein, tertiary, 3D,
super secondary, structure, motif, motifs, pattern, blocks, library,
discovery, description, aic, MDL, minimum message length, MML,
LAllison, AMLesk, ArunK
%X "Computational analyses of the growing corpus of three-dimensional (3D)
structures of proteins have revealed a limited set of recurrent substructural
themes, termed super-secondary structures. Knowledge of super-secondary
structures is important for the study of protein evolution and for the
modeling of proteins with unknown structures. Characterizing a comprehensive
dictionary of these super-secondary structures has been an unanswered
computational challenge in protein structural studies. This paper presents an
unsupervised method for learning such a comprehensive dictionary using the
statistical framework of lossless compression on a database comprised of
concise geometric representations of protein 3D folding patterns. The best
dictionary is defined as the one that yields the most compression of the
database. Here we describe the inference methodology and the statistical
models used to estimate the encoding lengths. An interactive website for this
dictionary is available ..."
-- [more]
& [doi:10.1109/DCC.2017.46]['17].
(Also see [protein].)
%A J. H. Collier
%A L. Allison
%A A. M. Lesk
%A P. J. Stuckey
%A M. Garcia de la Banda
%A A. S. Konagurthu
%T Statistical inference of protein structural alignments using information and
compression
%J J. Bioinformatics
%I OUP
%V 33
%N 1
%P 1005-1013
%M APR
%D 2017
%O bioRxiv, June 2016
%K jrnl, OUP, MolBio, c2017, c201x, c20xx, zz0317, protein, alignment,
tertiary structure, 3D, information, MML, MMLigner, software,
JHC, JHCollier, ArunK, LAllison, AMLesk, AIC, bic, mdl
%X "... present here a statistical framework for the precise inference of
structural alignments, built on the Bayesian and information-theoretic
principle of Minimum Message Length (MML). The quality of any alignment is
measured by its explanatory power—the amount of lossless compression achieved
to explain the protein coordinates using that alignment. ..."
-- [doi:10.1093/bioinformatics/btw757][2017] (online January 2017),
[bioRxiv][6/2016],
[more].
(Also see MMLigner@[LCB][2016].)
%A M. Duc Cao
%A L. Allison
%A T. I. Dix
%A M. Boden
%T Robust estimation of evolutionary distances with information theory
%J Mol. Biol. Evol.
%V 33
%N 5
%P 1349-1357
%M MAY
%D 2016
%K jrnl, MBE, MolBio, c2016, c201x, c20xx, zz0816, bioinformatics, evolutionary,
phylogenetic, distance, trees, sequences, information, maximum, parsimony,
DNA, likelihood, MML, MDL, AIC, Minh Duc Cao, LAllison, TIDix, MBoden
%X "Methods for measuring genetic dist. in phylogenetics are known to be
sensitive to the evol. model assumed. However, there is a lack of established
methodology to accommodate the trade-off between incorporating suff. biol.
reality & avoiding model overfitting. [&] as trad. methods measure dists.
based on the observed # of substitutions, they tend to underestimate dist.
between diverged seqs. due to backward & parallel subsns. Various techniques
were proposed to correct this, but they lack the robustness [v.] sequs. that
are distantly related & of unequal base freqs.. ... we present a novel
genetic dist. est. based on inf.theory that overcomes the above two hurdles.
Instead of examining the obs. # of subsns, [it] ests. genetic dist. using
Shannon's mutual information. This naturally provides an effective framework
for balancing model complexity & goodness of fit. Our dist. est. is shown to
be approx. linear to elapsed time & hence is less sensitive to the divergence
of seq. data & compositional biased seqs.. Using extensive simulation data,
we show that our method 1) consistently reconstructs more accurate phylogeny
topologies than existing methods, 2) is robust in extreme condns such as
diverged phylogenies, unequal base freqs. data, & heterogeneous mutation
patterns, & 3) scales well with large phylogenies." [online Feb 2016.]
-- [doi:10.1093/molbev/msw019]['16].
[Also search for: MolBio evolutionary MML].
%A A. S. Konagurthu
%A P. Kasarapu
%A L. Allison
%A J. H. Collier
%A A. M. Arthur
%T On sufficient statistics of least-squares superposition of vector sets
%J J. Comp. Biol.
%V 22
%N 6
%P 487-497
%M MAY
%D 2015
%K jrnl, JCB, MolBio, bioinformatics, ArunK, LAllison, AMLesk, JHC, JHCollier,
c2015, c201x, c20xx, zz0116, 3D, point set, tertiary, protein, structure,
structural alignment, superposition, match, matching, estimation, stats
%X "The problem of superposition of two corr. vector sets by minimizing their
sum-of-squares error under orthogonal transformation ... can be solved
exactly using an alg. whose time complexity grows linearly with the # of
correspondences. ... particularly in studies involving macromolecular
structs.. ... formally derives a set of suff.stats. for the least-squares
superposition problem. These s. are additive. This permits a highly efficient
(const. time) computation of superpositions (& s.stats.) of vector sets that
are composed from its constituent v.sets under addition or deletion op.,
where the s.stats. of the constituent sets are already known (that is, [they]
have been previously superposed). ... a drastic improvement in the run time
of the methods that commonly superpose v.sets under addition or deletion
ops., where previously these ops. were carried out ab initio (ignoring the
s.stats.). ... demonstrate the improvement our work offers in the context of
protein structural alignment programs that assemble a reliable structural
alignment from well-fitting (substructural) fragment pairs. A C++ library for
this task is available online under an open-source license."
-- [doi:10.1089/cmb.2014.0154]['16].
(Based on the 2014 RECOMB paper.)
%A J. Collier
%A L. Allison
%A A. Lesk
%A M. Garcia de La Banda
%A A. Konagurthu
%T A new statistical framework to assess structural alignment quality using
information compression
%J ECCB
%W Strasbourg
%M SEP
%D 2014
%K conf, ECCB 14, MolBio, c2014, c201x, c20xx, zz0914, LAllison, ArunK, AMLesk,
JHCollier, protein, 3D, similar, structure, alignment, match, MML, MDL, AIC,
complexity, bioinformatics, 13th Euro, Conf, Comp, Biology, I value, Ivalue
%X "... proposes a new statistical framework to assess structural alignment
quality and significance based on lossless information compression. This is
a radical departure from the traditional approach of formulating scoring
functions. It links the structural alignment problem to the general class of
statistical inductive inference problems, solved using the
information-theoretic criterion of minimum message length. Based on this, we
developed an efficient and reliable measure of structural alignment quality,
I-value. The performance of I-value is demonstrated in comparison with a
number of popular scoring functions, on a large collection of competing
alignments. Our analysis shows that I-value provides a rigorous and reliable
quantification of structural alignment quality, addressing a major gap in
the field."
-- [doi:10.1093/bioinformatics/btu460]['14],
[more].
%A L. Allison
%T On the complexity of graphs (networks) by information content, and
conditional (mutual) information given other graphs
%I Clayton School of the Faculty of Info. Tech., Monash Uni.
%R 2014/277
%P 14
%M JUN
%D 2014
%K c2014, c201x, c20xx, zz0614, LAllison, TR, complex, compress, compression,
graph, network, model, graphModel, complexity, minimum message length, MML,
simple, random, description, AI, information content, AIC, MDL, mutual,
similarity, viagra, cialis, xanax, valium, structure, SMILES, organic, maths,
chemical compound, MolBio, molecule, probability, likelihood, LAllison
%X '... concerns the information content of a graph, optionally conditional on
one or more background, "common knowledge" graphs. It describes an algorithm
to estimate this information content, and includes some examples based on
chemical compounds.'
-- [more],
& 2008.04744@[arXiv][2020].
%A A. S. Konagurthu
%A P. Kasarapu
%A L. Allison
%A J. H. Collier
%A A. M. Lesk
%T On sufficient statistics of least-squares superposition of vector sets
%J RECOMB
%I SpringerVerlag
%S LNCS/LNBI
%V 8394
%M APR
%P 144-159
%D 2014
%K conf, RECOMB, MolBio, c2014, c201x, c20xx, zz0514, ArunK, LAllison, AMLesk,
JHCollier, bioinformatics, RECOMB18, protein, structure, alignment,
least squares, RMS, error, 3D, match, matching, additive, orthogonal, rigid,
vector set, Kearsley, algorithm
%X "Superposition by orthogonal transformation of vector sets by minimizing the
least-squares error is a fundamental task in many areas of science, notably
in structural molecular biology. Its widespread use for structural analyses
is facilitated by exact solns of this problem, computable in linear time.
However, in several of these analyses it is common to invoke this
superposition routine a very large number of times, often operating (through
addition or deletion) on previously superposed vector sets. This paper
derives a set of sufficient statistics for the least-squares orthogonal
transformation problem. These sufficient statistics are additive. This
property allows for the superposition parameters (rotation, translation, &
root mean square deviation) to be computable as constant time updates from
the statistics of partial solutions. We demonstrate that this results in a
massive speed up in the computational effort, when compared to the method
that recomputes superpositions ab initio . Among others, protein structural
alignment algorithms stand to benefit from our results."
-- [doi:10.1007/978-3-319-05269-4_11]['14],
[more].
%A A. S. Konagurthu
%A L. Allison
%A D. Abramson
%A P. J. Stuckey
%A A. M. Lesk
%T How precise are reported protein coordinate data?
%J Acta Cryst.
%V D70
%N 3
%P 904-906
%M MAR
%D 2014
%K jrnl, MolBio, c2014, c201x, c20xx, zz0314, protein, 3D, tertiary, structure,
precision, accuracy, PDB, ArunK, LAllison, AMLesk
%X "Atomic coordinates in the Worldwide Protein Data Bank (wwPDB) are generally
reported to greater precision than the experimental structure determinations
have actually achieved. By using information theory & data compression to
study the compressibility of protein atomic coordinates, it is possible to
quantify the amount of randomness in the coordinate data & thereby to
determine the realistic precision of the reported coordinates. On avg., the
value of each C_alpha coordinate in a set of selected p.structures solved at
a variety of resolutions is good to about 0.1A."
-- [doi:10.1107/S1399004713031787]['14],
[more].
%A A. S. Konagurthu
%A A. M. Lesk
%A D. Abramson
%A P. J. Stuckey
%A L. Allison
%T Statistical inference of protein "LEGO bricks"
%J ICDM
%M DEC
%D 2013
%K conf, ICDM, ICDM13, MolBio, c2013, c201x, c20xx, zz1213, LAllison, ArunK,
AMLesk, protein, tertiary, 3D, MML, structure, structures, recurrent,
structural, motifs, backbone, folds, MDL, library, dictionary, fragment,
blocks, fragments, bioinformatics, data mining
%X "Proteins are biomolecules of life. They fold into a great variety of
three-dimensional (3D) shapes. Underlying these folding patterns are many
recurrent structural fragments or building blocks (analogous to "LEGO(r)
bricks"). This paper reports an innovative statistical inference approach to
discover a comprehensive dictionary of protein structural building blocks
from a large corpus of experimentally determined protein structures. Our
approach is built on the Bayesian and information-theoretic criterion of
minimum message length [MML]. To the best of our knowledge, this work is the
first systematic and rigorous treatment of a very important data mining
problem that arises in the cross-disciplinary area of structural
bioinformatics. The quality of the dictionary we find is demonstrated by its
explanatory power - any protein within the corpus of known 3D structures can
be dissected into successive regions assigned to fragments from this
dictionary. This induces a novel one-dimensional representation of three-
-dimensional protein folding patterns, suitable for application of the rich
repertoire of character-string processing algorithms, for rapid
identification of folding patterns of newly determined structures. This paper
presents the details of the methodology used to infer the dictionary of
building blocks, and is supported by illustrative examples to demonstrate its
effectiveness and utility."
-- [doi:10.1109/ICDM.2013.73]['14],
[more], and
1310.1462@[arXiv]['13].
%A A. S. Konagurthu
%A A. M. Lesk
%A L. Allison
%T Minimum message length inference of secondary structure from protein
coordinate data
%J J. Bioinformatics
%I OUP
%V 28
%N 12
%P i97-i105
%M JUN
%D 2012
%O ISMB, Long Beach
%K conf, ISMB12, MolBio, c2012, c201x, c20xx, zz0612, LAllison, ArunK, AMLesk,
SST, bioinformatics, protein, secondary structure, DSSP, assignment, helix,
extended strand, sheet, coil, mmld, fold, MML, MDL, model
%X "Motivation: Secondary structure underpins the folding pattern and
architecture of most proteins. Accurate assignment of the SS elts is
therefore an important problem. Although many approx. solns of the SS
assignment problem exist, the statement of the problem has resisted a
consistent & math. rigorous defn. A variety of comparative studies have
highlighted major disagreements in the way the available methods define &
assign SS to coord.data.
Results: We report a new method to infer SS based on the Bayesian method of
Minimum Message Length (MML) inference. It treats assignments of SS as
hypotheses that explain the given coord.data. The method seeks to maximise
the joint probability of a hypothesis & the data. There is a natural null
hypothesis & any assignment that cannot better it is unacceptable. We
developed a program SST based on this approach & compared it to popular
programs such as DSSP & STRIDE amongst others. Our evaln suggests that SST
gives reliable assignments even on low resolution structures."
-- [doi:10.1093/bioinformatics/bts223]['12].
More: [www]['12].
%A A. S. Konagurthu
%A L. Allison
%A P. J. Stuckey
%A A. M. Lesk
%T Piecewise linear approximation of protein structures using the principle of
minimum message length
%J J. Bioinformatics
%V 27
%N 13
%P i43-i51
%M JUL
%D 2011
%K conf, MolBio, MML, c2011, c201x, c20xx, zz0711, ISMB, LAllison, ArunK,
AMLesk, protein, fold, cartoon, description, ribbon diagram, structure,
segmentation, minimum message length, MDL, information theoretic
%X "Simple & concise representations of protein-folding patterns provide
powerful abstractions for visualizations, comparisons, classifications,
searching & aligning structural data. Structures are often abstracted by
replacing standard secondary structural features - that is, helices & strands
of sheet - by vectors or linear segments. Relying solely on std secondary
structure may result in a sig. loss of structural information. Further,
traditional methods of simplification crucially depend on the consistency &
accuracy of external methods to assign SS to protein coord.data. Although
many methods exist automatically to identify SS, the impreciseness of
definitions, along with errors & inconsistencies in experimental structure
data, drastically limit their applicability to generate reliable simplified
representations, especially for structural comparison.
This article introduces a mathematically rigorous alg. to delineate protein
structure using the elegant statistical & inductive inference framework of
minimum message length (MML). Our method generates consistent & statistically
robust piecewise linear explanations of protein coordinate data, resulting in
a powerful & concise representation of the structure. The delineation is
completely independent of the approaches of using hydrogen-bonding patterns
or inspecting local substructural geometry that the current methods use.
Indeed, as is common with applications of the MML criterion, this method is
free of parameters & thresholds, in striking contrast to the existing
programs which are often beset by them.
The analysis of results over a large number of proteins suggests that the
method produces consistent delineation of structures that encompasses, among
others, the segments corresponding to standard secondary structure."
-- [doi:10.1093/bioinformatics/btr240]['11].
(Also see [more].)
%A M. D. Cao
%A T. I. Dix
%A L. Allison
%T A biological compression model and its applications
%B Software Tools and Algorithms for Biological Systems
%E H. R. Arabnia
%E Q.-N. Tran
%I SpringerVerlag
%S AEMB (Advances in Experimental Medicine and Biology)
%V 696
%P 657-666
%M APR
%D 2011
%K MolBio, AEMB, book chapter, c2011, c201x, zz0411, c20xx, LAllison, TIDix,
Minh Duc Cao, bioinformatics, entropy, compression, DNA, protein, sequence,
data, compress, MML, MDL
%X "A biological compression model, expert model, is presented which is superior
to existing compression algorithms in both compression performance and speed.
The model is able to compress whole eukaryotic genomes. Most importantly, the
model provides a framework for knowledge discovery from biological data. It
can be used for repeat element discovery, sequence alignment and phylogenetic
analysis. We demonstrate that the model can handle statistically biased
sequences and distantly related sequences where conventional knowledge
discovery tools often fail."
-- [more],
chapter 67 [doi:10.1007/978-1-4419-7046-6_67]['11],
in: hb us$204; uk us isbn:1441970452; uk us isbn13:978-1441970459.
%A M. D. Cao
%A T. I. Dix
%A L. Allison
%T A genome alignment algorithm based on compression
%J BMC Bioinformatics
%V 11
%N 1
%P 599
%M DEC
%D 2010
%K jrnl, eJrnl, MolBio, c2010, c201x, c20xx, Minh Duc Cao, LAllison, TIDix,
highly accessed paper, DNA, whole genome alignment, zz0111, local alignment,
sequence, long, compress, compression, MML, MDL, BIC, poa, XM, expert model,
XMAligner
%X "... Since genomic sequences carry genetic info., this article proposes that
the info. content of each nucleotide in a posn should be considered in
sequence alignment. An info.-theoretic approach for pairwise genome local
alignment, namely XMAligner, is presented. Instead of comparing sequences at
the character level, XMAligner considers a pair of nucleotides from two
seqs. to be related if their mutual info. in context is significant. The
info.content of nucleotides in sequences is measured by a lossless
compression technique. ... Experiments on both simulated data & real data
show that XMAligner is superior to conventional methods especially on
distantly related seqs. & statistically biased data. XMAligner can align
seqs. of eukaryote genome size with only a modest hardware requirement. ..."
-- [more],
[doi:10.1186/1471-2105-11-599][12/'10] (BMC 17/2/12 mail: 3372 accesses),
21159205@[pubmed][2/'11].
%A A. Konagurthu
%A L. Allison
%A T. Conway
%A B. Beresford-Smith
%A J. Zobel
%T Design of an efficient out-of-core read alignment algorithm
%J WABI
%I SpringerVerlag
%S LNCS/LNBI
%V 6293
%P 189-201
%M SEP
%D 2010
%K wShop, MolBio, c2010, c201x, c20xx, zz1010, WABI, WABI10, nicta, LAllison,
Syzygy, ArunK, Algorithms in Bioinformatics, short read, reads, NGS, align,
mapping, next generation, DNA, sequencing, algorithm
%X "New genome sequencing technologies are poised to enter the sequencing
landscape with significantly higher throughput of read data produced at
unprecedented speeds & lower costs per run. However, current in-memory
methods to align a set of reads to one or more reference genomes are
ill-equipped to handle the expected growth of read-throughput from newer
technologies. ... reports the design of a new out-of-core read mapping alg.,
Syzygy, which can scale to large volumes of read & genome data. The alg. is
designed to run in a constant, user-stipulated amount of main memory -
small enough to fit on standard desktops - irrespective of the sizes of read
& genome data. Syzygy achieves a superior spatial locality-of-reference that
allows all large data structures used in the alg. to be maintained on disk.
We compare our prototype implementation with several popular read alignment
programs. Our results demonstrate clearly that Syzygy can scale to very
large read volumes while using only a fraction of memory in comparison,
without sacrificing performance."
-- [more],
[doi:10.1007/978-3-642-15294-8_16]['10].
(In: uk us isbn:3642152937; uk us isbn13:978-3-642-15293-1.)
%A M. D. Cao
%A L. Allison
%A T. I. Dix
%T A distance measure for genome phylogenetic analysis
%J AI09
%I SpringerVerlag
%S LNCS
%V 5866/2009
%P 71-80
%M DEC
%D 2009
%K conf, MolBio, c2009, c200x, c20xx, zz1209, Minh Duc Cao, TIDix, AI, II, AI09,
LAllison, NICTA, bioinformatics, MML, MDL, phylogenetic, tree, evolutionary,
XM, trees, distance, measure, matrix, information, maximum parsimony,
maximum likelihood, method, methods, construction, reconstruction, inference,
Plasmodium, Plasmodia
%X "Phylogenetic analyses of species based on single genes or parts of the
genomes are often inconsistent because of factors such as variable rates of
evolution & horizontal gene transfer. ... phylogeny construction from
complete genomes that is less sensitive to such inconsistency. For such long
seqs., construction methods like maximum parsimony & maximum likelihood are
often not possible due to their intensive computational requirement. Another
class of tree construction methods, namely distance-based methods, require a
measure of distances between any 2 genomes. ... presents an information
theoretic measure of genetic distances between genomes based on the
biological compression algorithm 'expert model' [XM]. ... tree of a number
of Plasmodium parasites from their genomes ... successfully construct a
plausible evolutionary tree for the gamma-Proteobacteria group whose genomes
are known to contain many horizontally transferred genes."
-- [doi:10.1007/978-3-642-10439-8_8]['09],
& [phylogenetics].
(2009; uk us isbn:364210438X; uk us isbn13:978-3642104381).
%A J. Bernal
%A T. I. Dix
%A L. Allison
%A A. Bartholomeusz
%A L. Yuen
%T Modelling Hepatitis B virus antiviral therapy and drug resistant mutant
strains
%J ACAL09
%I SpringerVerlag
%S LNCS
%V 5865
%P 159-168
%M DEC
%D 2009
%K conf, ACAL, MolBio, c2009, c200x, c20xx, zz1109, TIDix, LAllison, NICTA,
vidrl, Hepatitis B, Hep B, HepB, HBV, bioinformatics, drug resistance, model,
simulation
%X "... an individual-based model of HBV inside an artificial liver, &
associated blood serum, undergoing antiviral therapy. This model aims to
provide insights into the evolution of the HBV quasispecies & the
individual contribution of HBV mutations in the outcome of therapy."
-- [doi:10.1007/978-3-642-10427-5_16]['09].
(2009; uk us isbn:3642104266; uk us isbn13:978-3642104268.)
(Also see [Bioinformatics].)
%A M. D. Cao
%A T. I. Dix
%A L. Allison
%T Computing substitution matrices for genomic comparative analysis
%J Advances in Knowledge Discovery and Data Mining
%I SpringerVerlag
%S LNCS
%V 5476 / 2009
%P 647-655
%M APR
%D 2009
%K conf, PAKDD, MolBio, c2009, c200x, c20xx, zz0509, bioinformatics, PAKDD09,
PAKDD13, DNA, genome, comparative analysis, comparison, substitution matrix,
PAM, blosum, estimation, alignment, free, homology, MML, MDL, information,
XM, Minh Duc Cao, TIDix, LAllison, (NICTA), malaria genome
%X "Substitution matrices ... and are important for many knowledge discovery
tasks such as phylogenetic analysis and sequence alignment. ... present a
novel algorithm that addresses this by computing a nucleotide substitution
matrix specifically for the two genomes being aligned. ... uses compression
... reconstructs, with high accuracy, the substitution matrix for
synthesised data generated from a known matrix with introduced noise. ...
successfully applied to real data for various malaria parasite genomes,
which have differing phylogenetic distances and composition that lessens the
effectiveness of standard statistical analysis techniques."
pdf@[doi:10.1007/978-3-642-01307-2_64]['09], and
[substitution matrices].
%A M. D. Cao
%A T. I Dix
%A L. Allison
%T A genome alignment algorithm based on compression
%R 2009/233
%I Faculty of Info. Tech. (Clayton), Monash University
%M JAN
%D 2009
%K TR, TR233, MolBio, c2009, c200x, c20xx, bioinformatics, zz0109, FIT, Monash,
Minh Duc Cao, TIDix, LAllison, data, DNA, sequence, compression, compressed,
compress, MML, MDL, align, strings, information, content, free, XM, XMAligner
%X "Traditional genome alignment methods based on dynamic programming are often
a. computational expensive, b. unable to compare the genomes of distant
species, & c. unable to deal with low information regions. ... information-
-theoretic approach for pairwise genome local alignment. ... the expert model
aligner, the XMAligner, relies on the expert model compression alg.. To align
2 seqs., XMAligner 1st compresses one sequence to measure the info. content
at each posn in the seq.. Then the seq. is compressed again but this time
with the background knowledge from the other seq. to obtain the conditional
info. content. The info. content & the conditional info. content from the
2 compressions are examined. Similar regions in the compressed seq. should
have the conditional info. content lower than the individual info. content.
... applied to align the genomes of Plasmodium falciparum & P. knowlesi v.
other 3 P. genomes with different levels of diversity. Despite the
differences in nucleotide composition of the reference seqs., the conserved
regions found by XMAligner in 3 alignments are relatively consistent. A
strong correlation was found between the similar regions detected by the
XMAligner & the hypothetical annotation of Plasmodium species. The alignment
results can be integrated into the DNAPlatform for visualisation."
-- [abs]['09].
[Also search for: Cao Dix Allison Bioinformatics c2010],
and see [compression]['09].
%A T. I. Dix
%A D. R. Powell
%A L. Allison
%A J. Bernal
%A S. Jaeger
%A L. Stern
%T Comparative analysis of long DNA sequences by per element
information content using different contexts
%J BMC Bioinformatics
%V 8(S2):S10
%M MAY
%D 2007
%K jrnl, ejrnl, MolBio, c2007, c200x, c20xx, zz0407, TIDix, LAllison, sequence,
analysis, data, compression, compress, low, mutual, information content,
simple, complex, DNA, plot, II, MDL, AI, context, description, profile,
signature, pattern discovery, genomic, minimum message length, MML,
Cyanidioschyzon merolae, alga, algae, Plasmodium falciparum, gui, browser
%X "... The paper presents a methodology for exploring long DNA sequences, of
the order of millions of bases, by means of their information content. ..."
-- [long DNA],
(@[csse]),
[doi:10.1186/1471-2105-8-S2-S10]['08],
17493248@[pubmed]['11].
%A M. D. Cao
%A T. I. Dix
%A L. Allison
%A C. Mears
%T A simple statistical algorithm for biological sequence compression
%J Data Compression Conf.
%W Snowbird, Utah
%I IEEE
%P 43-52
%M MAR
%D 2007
%K conf, MolBio, II, DCC, DCC07, c2007, c200x, c20xx, zz0307, DNA, protein,
minh, minhc, Minh Duc Cao, LAllison, TIDix, MML, MDL, compress, entropy,
information theory, expert model, XM, profile, models, bioinformatics,
dnacompress, pattern discovery, protein, compressible
%X "This paper introduces a novel algorithm for biological sequence compression
that makes use of both statistical properties & repetition within sequences.
A panel of experts is maintained to estimate the probability distribution of
the next symbol in the sequence to be encoded. Expert probabilities are
combined to obtain the final distribution. The resulting information sequence
provides insight for further study of the biological sequence. Each symbol is
then encoded by arithmetic coding. Experiments show that our algorithm
outperforms existing compressors on typical DNA & protein sequence datasets
while maintaining a practical running time."
[compression],
[compression],
[reprint.pdf],
-- [doi:10.1109/DCC.2007.7]['07].
%A T. I. Dix
%A D. R. Powell
%A L. Allison
%A S. Jaeger
%A J. Bernal
%A L. Stern
%T Exploring long DNA sequences by information content
%I Probabilistic Modeling and Machine Learning in Structural and
Systems Biology, Workshop Proc.
%W Tuusula, Finland
%P 97-102
%M JUN
%D 2006
%K MolBio, wShop, c2006, c200x, c20xx, modelling, models, sequence, MML, MDL,
LAllison, minimum message length, bioinformatics, comparative, chromosome,
description, 1D, TIDix, zz0806, approximate repeats, model, ARM, dotter,
dot plot, plots, 2D
%X ... long ... order of millions of bases, ...
Also see Dix et al, BMC Bioinf., 8(S2):S10, May '07,
[doi:10.1186/1471-2105-8-S2-S10]['08],
and [Bioinformatics].
CR classn '98: F.2, I.2.6, G.3, J.3.; isbn:9521032774.
%A L. Allison
%T Finding approximate palindromes in strings quickly and simply
%I School of Computer Science and Software Engineering, Monash University,
Australia 3800
%R 2004/162
%M DEC
%D 2004
%K TR 162, TR162, palindrome, MolBio, bioinformatics, algorithm, string,
LAllison, c2004, c200x, c20xx, zz1204
%X Described are two algorithms to find long approximate palindromes in
a string, for example a DNA sequence. A simple algorithm requires
O(n)-space and almost always runs in O(k.n)-time where n is the
length of the string and k is the number of ``errors'' allowed in the
palindrome. Its worst-case time-complexity is O(n^2) but this does
not occur with real biological sequences. A more complex algorithm
guarantees O(k.n) worst-case time complexity.
-- [arxiv...pdf][12/'04],
[approx.palindromes].
%A D. R. Powell
%A L. Allison
%A T. I. Dix
%T Modelling alignment for non-random sequences
%J Advances in Artificial Intelligence
%I SpringerVerlag
%S LNCS/LNAI
%V 3339
%P 203-214
%M DEC
%D 2004
%O 17th ACS Australian Joint Conf. on Artificial Intelligence (AI2004)
%K conf, AI, MolBio, dynamic programming algorithm, DPA, c2004, c200x, c20xx,
similar, sequence, homology, Malignment, minimum message length, MML, MDL,
PRSS, FASTA, BLAST, Smith Waterman, context, significance test, P values,
Pvalue, biased, DNA, bias, plasmodium falciparum, low information content,
score, dependent, pattern, repeats, repetitive, shuffling, masking, scoring,
quality, Markov, model, models, hidden, HMM, PHMM, HMMER, description, pair,
strings, DRPowell, LAllison, TIDix, bioinformatics, probabilistic, total,
average
%X "Populations of biased, non-random seqs. may cause standard alignment
algorithms to yield false-positive matches & false-negative misses. A std
significance test based on the shuffling of sequences is a partial solutions
applicable to pop'ns that can be described by simple models. Masking-out
low information content intervals throws information away. ... new & general
method, modelling alignment: Population models are incorporated into the
alignment process, which can (& should) lead to changes in the rank-order of
matches between a query seq. & a collection of seqs., compared to results
from std algorithms. The new method is general & places very few conditions
on the nature of the models that can be used with it. We apply modelling-
alignment to local alignment, global alignment, optimal alignment & the
relatedness problem. Results: As expected, modelling-alignment & the
standard PRSS program from the FASTA package have similar accuracy on
sequence populations that can be described by simple models, e.g. 0-order
Markov models. However, modelling-alignment has higher accuracy on popns that
are mixed or that are described by higher-order models: It gives fewer false
positives & false negatives as show by ROC curves & other results from tests
on real and artificial data". isbn:3540240594.
-- [doi:10.1007/978-3-540-30549-1_19]['11],
[mAlign] inc software.
[Also search for: Allison COMPJ c1999].
%A L. Allison
%T Longest biased interval and longest non-negative sum interval
%J J. Bioinformatics
%V 19
%N 10
%P 1294-1295
%M JUL
%D 2003
%K MolBio, jrnl, sequence analysis, DNA, RNA, zz0703, c2003, c200x, c20xx,
bias, density, skew, constraints, low information content, substring,
algorithm, program, locating, maximum, maximal, heaviest, longest, segment,
segments, intervals, regions, positive sum, nonnegative total, AT CG rich,
ATrich, CGrich, problem, CpG, LBI, CABIOS, biased interval, LAllison,
average, score, scores, highest
%X ... an algorithm to find the longest interval having at least a specified
minimum bias in a sequence of characters (bases, amino acids),
e.g. 'at least 0.95 (A+T)-rich'. It is based on an algorithm to find
the longest interval having a non-negative sum in a sequence of positive
and negative numbers. In practice, it runs in linear time; this can be
guaranteed if the bias is rational. ...
-- [LBI, inc' code],
[LBI, inc' code].
[can analyse long sequences, e.g. whole chromosomes, very quickly.]
-- [doi:10.1093/bioinformatics/btg135]['05],
[.pdf@bioinformatics.oxfordjournals.org][7/'03],
[.pdf@bioinformatics.oxfordjournals.org][7/'03].
%A L. Stern
%A L. Allison
%A R. L. Coppel
%A T. I. Dix
%T Discovering patterns in Plasmodium falciparum genomic DNA
%J Molecular and Biochemical Parasitology
%V 118
%N 2
%P 175-186
%M DEC
%D 2001
%K MolBio, LAllison, jrnl, c2001, c200x, c20xx, Plasmodium falciparum,
minimum message length, description, Markov, MDL, MBP, MML, Malaria, entropy,
sequence analysis, HMM, PFSA, PFSM, information content, genome, DNA,
pattern, repeat, repeats, DNA models, AED, approximate repeats model, repeat,
pattern discovery, knowledge, hidden, GHMM, Bayesian, LStern, TIDix,
bioinformatics
%X "A method has been developed for discovering patterns in DNA sequences.
Loosely based on the well-known Lempel Ziv model for text compression, the
model detects repeated seqs. in DNA. The repeats can be forward or inverted,
and they need not be exact. The method is particularly useful for detecting
distantly related seqs., and for finding patterns in seqs. of biased
nucleotide composition, where spurious patterns are often observed because
the bias leads to coincidental nucleotide matches. We show here the utility
of the method by applying it to genomic seqs. of Plasmodium falciparum. A
single scan of chromosomes 2 and 3 of P.falciparum, using our method and no
other a priori information about the seqs., reveals regions of low complexity
in both telomeric and central regions, long repeats in the subtelomeric
regions, and shorter repeat areas in dense coding regions. Application of the
method to a recently sequenced contig of chromosome 10 that has a
particularly biased base composition detects a long internal repeat more
readily than does the conventional dot matrix plot. Space requirements are
linear, so the method can be used on large sequences. The observed repeat
patterns may be related to large-scale chromosomal organization and control
of gene expression. The method has general application in detecting patterns
of potential interest in newly sequenced genomic material."
-- [paper],
[paper.pdf],
[doi:10.1016/S0166-6851(01)00388-7][6/'04].
More on [compression] and
[Bioinformatics].
%A D. R. Powell
%A L. Allison
%A T. I. Dix
%T Fast, optimal alignment of three sequences using linear gap costs
%J J. Theor. Biol.
%V 207
%N 3
%P 325-336
%M DEC
%D 2000
%K jrnl, JTB, MolBio, Biology, multiple, sequence, alignments, similarity,
affine, linear, cost, gaps, insert, delete, indel, indels, DNA, time, speed,
fast, string, strings, iterative, phylogenetic, family tree, Ukkonen,
edit distance, dynamic programming algorithm, DPA, DRPowell, LAllison, TIDix,
c2000, c200x, c20xx, zz1100, J Theoretical Biology, bioinformatics
%X [...] The obvious dynamic programming algorithm for optimally
aligning k sequences of length n runs in O(n^k) time. This is
impractical if k >= 3 and n is of any reasonable length.
[...] new algorithm [is] guaranteed to find the optimal alignment [...]
particularly fast when the (three-way) edit distance is small. [...]
O(n + d^3) on average.
[paper][11/'00] and code,
[doi:10.1006/jtbi.2000.2177]['04]
more on [bioinformatics].
%A L. Allison
%A D. Powell
%A T. I. Dix
%T Modelling is more versatile than shuffling
%R 2000/83
%I School of Computer Science and Software Engineering, Monash University,
Australia 3168
%D 2000
%K MolBio, pair, two, sequence, alignment, PFSA, hidden Markov model, PHMM,
low medium information content, repeat, repetition, structure, pattern,
model, shuffle, shuffling, randomize, DNA, permute, tuples, frequencies,
family, PFSM, HMM, dynamic programming algorithm, DPA, homology, algorithm,
minimum message length, MML, description, MDL, LAllison, DRPowell, TIDix,
mAlignment, TR 83, TR83, c2000, c200x, c20xx, bioinformatics
%X It is shown how to incorporate almost any (left to right) model of a
population of sequences into the alignment DPA. Doing so is an alternative
to shuffling/ randomizing (Fitsch; Altschul, Erickson ...) the sequences
to correct for population biases. The resulting algorithm gives
fewer false positives, fewer false negatives, and can (and should)
change the rank ordering of alignments.
[also search for: modelling alignment]
[more],
[mon], and
[Bioinformatics].
%A L. Allison
%A L. Stern
%A T. Edgoose
%A T. I. Dix
%T Sequence complexity for biological sequence analysis
%J Computers and Chemistry
%O (jrnl later became Computational Biology and Chemistry, c2006)
%V 24
%N 1
%P 43-55
%D 2000
%K jrnl, comp chem, MolBio, c2000, c200x, c20xx, data compression, compress,
MML, MDL, DNA, mine, approximate repeats, repeat, AED, model, repetitive,
complexity, high, moderate, low, information content, bioinformatics,
LAllison, TIDix, computational biology and chemistry, LStern, description,
minimum message length, bioinformatics, AI, inductive inference,
data mining, time series, timeSeries
%X "A new statistical model for DNA considers a sequence to be a mixture of
regions with little structure and regions that are approximate repeats of
other subsequences, i.e. instances of repeats do not need to match each other
exactly. Both forward- and reverse-complementary repeats are allowed. The
model has a small number of parameters which are fitted to the data. In
general there are many explanations for a given sequence and how to compute
the total probability of the data given the model is shown. Computer
algorithms are described for these tasks. The model can be used to compute
the information content of a sequence, either in total or base by base. This
amounts to looking at sequences from a data-compression point of view and it
is argued that this is a good way to tackle intelligent sequence analysis
in general."
-- [more],
[paper.pdf],
[doi:10.1016/S0097-8485(00)80006-6]['02].
Also see [Bioinformatics].
%A D. R. Powell
%A L. Allison
%A T. I. Dix
%T A versatile divide and conquer technique for optimal string alignment
%J IPL
%V 70
%N 3
%P 127-139
%D 1999
%K IPL, jrnl, c1999, c199x, c19xx, zz0899, dynamic programming algorithm, DPA,
MolBio, DRPowell, LAllison, TIDix, bioinformatics, space, complexity,
strings, Edit Distance, similarity, LCS, Fast, Hirschberg, Ukkonen, Myers,
time, speed, Linear Space, Check Point, pointing, checkpoint, algorithms
%X A check-pointing (CP) technique uses O(n) space but is simpler
than Hirschberg's O(n)-space technique; H' ('97) attributes
an O(N**2)-time simple edit-distance CP to Eppstein. Here, CP
is applied to more complex cost functions, e.g., linear gap costs, and ...
to Ukkonen's O(n*d)-time DPA, even including linear gap costs,
to give O(n)-space, O(n.log d + d**2)-average-time,
effectively O(d**2)-time in many practical situations.
-- [doi:10.1016/S0020-0190(99)00053-8][6/'04],
& [Divide-and-C.].
%A L. Allison
%A D. Powell
%A T. I. Dix
%T Compression and approximate matching
%J COMPJ
%V 42
%N 1
%P 1-10
%D 1999
%K jrnl, MolBio, COMPJ, Computer Journal, LAllison, DRPowell, TIDix, pair,
string, strings, sequence, alignment, analysis, algorithm, homology,
similarity, limits, HMM, MML, MDL, II, normalized, limit, significance test,
testing, jie, med, DPA, low information content, repeats, repetitive, wei,
non-random, nonrandom, compressible, bioinformatics, pair, probabilistic,
information theory, features, complexity, time, fast, speed, shuffling,
shuffle, randomize, DNA, edit distance, hidden Markov model, HMM, PHMM,
c19xx, c1999, c199x, modelling, Malignment, context dependent, scoring
%X A population of sequences is called non-random if there is a statistical
model and an associated compression algorithm that allows members of the
population to be compressed, on average. Any available statistical model
of a population should be incorporated into algorithms for alignment of
the sequences and doing so changes the rank-order of possible alignments
in general. The model should also be used in deciding if a resulting
approximate match between two sequences is significant or not. It is
shown how to do this for two plausible interpretations involving pairs
of sequences that might or might not be related. Efficient alignment
algorithms are described for quite general statistical models of sequences.
The new alignment algorithms are more sensitive to what might be termed
`features' of the sequences. A natural significance test is shown to be
rarely fooled by apparent similarities between two sequences that are merely
typical of all or most members of the population, even unrelated members.
-- [more],
[doi:10.1093/comjnl/42.1.1]['06],
[pdf@compj]['05].
Also see [Powell AI2004],
and [bioinformatics].
%A L. Allison
%A T. Edgoose
%A T. I. Dix
%T Compression of strings with approximate repeats
%J Intell. Sys. in Mol. Biol. '98
%P 8-16
%M JUN
%D 1998
%K conf, MolBio, c1998, c199x, c19xx, LAllison, Monash, ISMB6, ISMB98, ISMB 98,
DPA, inductive inference, II, DNA, Chen, Behzadi, F. Le Fessant, protein,
sequence analysis, data, text, repeat, compression, bioinformatics, Li,
repetitive, kwong, hidden Markov models, HMM, low information content,
lines, ALU, sines, ming, self similarity, AED, model, minimum message length,
MML, description, MDL, compress, algorithm
%X "We describe a model for strings of characters that is loosely based on the
Lempel Ziv model with the addition that a repeated substring can be an
approximate match to the original substring; this is close to the situation
of DNA, for example. ... The model has a small number of parameters and these
can be estimated from the given string by an expectation-maximization (EM)
algorithm. Each iteration of the EM algorithm takes O(n^2) time and a few
iterations are typically sufficient. O(n^2) complexity is impractical for
strings of more than a few tens of thousands of characters and a faster
approximation algorithm is also given. The model is further extended to
include approximate reverse complementary repeats when analyzing DNA
strings. Tests include the recovery of parameter estimates from known
sources and applications to real DNA strings."
-- [more] ISMB98, 28 June - 1 July 1998,
[.pdf] Montreal, Canada; uk us isbn:1577350537.
Also see [bioinformatics].
%A L. Allison
%T Information-theoretic sequence alignment
%I School of Computer Science and Software Engineering, Monash University
%R 98/14
%M JUN
%D 1998
%K TR14, TR 14, MolBio, LAllison, string, strings, similarity, edit-distance,
homology, approximate match, matching, DNA, DPA, hidden Markov model, HMM,
low information content, repeats, repetitive, compressible, MML, MDL,
data compression, content, sequences, c1998, c199x, c19xx, bioinformatics
%X [TR98/14],
[TR98/14].
Also see: Allison, Powell and Dix, `Compression and Approximate Matching',
Comp. J. 42(1) pp1-10, 1999, for a fuller explanation and later results.
Also see [Bioinformatics].
%A D. R. Powell
%A L. Allison
%A T. I. Dix
%A D. L. Dowe
%T Alignment of low information sequences
%J Australian Computer Science Theory Symposium, CATS '98
%W Perth
%P 215-230
%I NUS
%M FEB
%D 1998
%K conf, MolBio, align, HMM, DPA, DNA, ACSC, CATS, CATS98, LAllison, DLDowe,
bioinformatics, probability, low information, simple, features,
TIDix, Monash, c1998, c199x, c19xx, DRPowell
%X "Alignment of two random sequences over a fixed alphabet can be shown to be
optimally done by a Dynamic Programming Algorithm (DPA). It is normally
assumed that the sequences are random and incompressible and that one
sequence is a mutation of the other. However, DNA and many other sequences
are not always random and unstructured, and the issue arises as how to best
align compressible sequences. Assuming our sequences to be non-random and
to emanate from mutations of a first order Markov model, we note that
alignment of high information regions is more important than alignment of
low information regions and arrive at a new alignment method for low
information sequences which performs better than the standard DPA for data
generated from mutations of a first order Markov model."
-- [more], uk us isbn:9813083921,
[paper.ps]['98].
(Also see [Bioinformatics].)
%A T. Edgoose
%A L. Allison
%A D. L. Dowe
%T An MML classification of protein structure that knows about angles and
sequence
%J Pacific Symposium on Biocomputing '98
%P 585-596
%M JAN
%D 1998
%K conf, MolBio, PSB, PSB3, PSB98, von Mises, vonMises, angle, dihedral, class,
cluster, clustering, HMM, SNOB, time series, timeSeries, ARC A49602504,
LAllison, MDL, directional, distribution, bioinformatics, Monash,
c1998, c199x, c19xx
%I World Scientic
%X SNOB + vonMises circular probability distribution + 1st order Markov model.
phi-psi pairs give 17 classes and a class seq' correlation matrix.
[paper]
[paper]
[paper.pdf@stanford.edu]['98]; uk us isbn:9810232780.
von Mises, probability density:
f(x | mu, kappa) = (1/(2.pi.I0(kappa))).exp(kappa.cos(x-mu))
where I0(kappa) is a normalisation constant.
[Bioinformatics].
%A D. R. Powell
%A D. L. Dowe
%A L. Allison
%A T. I. Dix
%T Discovering simple DNA sequences by compression
%J Pacific Symposium on Biocomputing '98
%P 597-608
%M JAN
%D 1998
%K conf, MolBio, PSB, PSB3, PSB98, repeat, repetition, data, model, random,
ARC A49602504, bioinformatics, DRPowell, LAllison, DLDowe, TIDix, Monash,
zz0198, simple, low information, Milosavljevic, Jurka, c1998, c199x, c19xx,
compress
%I World Scientic
%X "An information-theoretic DNA compression scheme devised by Milosavljevic and
Jurka(1993) has been used in many places in the literature for both the
discovery of new genes and the compression of DNA. Their compression method
applies an encoding of previously occurring runs. They use 5 different
code-words: four being the DNA bases, A,C,G and T, and the other being a
pointer to a previously occurring run. They advocate a code-word of length
log2(5) for each of these and then encoding a run by a code-word of length
2log2(n), where n is the length of the sequence. This scheme encodes the
start of the sequence with a code-word of length log2(n) and likewise encodes
the end of the sequence with a code-word of length log2(n). In this paper,
we show the above coding scheme to be inefficient in various ways and improve
upon it so that it can compress DNA. We discuss our implementation of various
schemes some of which run in linear time."
-- [more],
[paper.pdf]['98]
[paper]
uk us isbn:9810232780.
Also see [MML page]
and Milosavljevic '95.
%A D. L. Dowe
%A L. Allison
%A T. I. Dix
%A L. Hunter
%A C. S. Wallace
%A T. Edgoose
%T Circular clustering of protein dihedral angles by minimum message length
%J Pacific Symposium on Biocomputing '96
%M JAN
%P 242-255
%D 1996
%I World Scientific
%O TR 95/237, Dept. Computer Science, Monash University, Oct 1995
%K PSB, PSB96, TR 237, TR237, Monash, DLD, CSW, CSWallace, LAllison, MolBio,
Monash, classification, angle, von Mises, vonMises, protein structure,
inductive inference, II, MML, MDL, conf, bioinformatics, c1996, c199x, c19xx
%X L. Hunter - NLM, NIH. PSB '96: 3-6 Jan 1996, Hawaii; uk us isbn:9810225784.
[paper],
[paper.ps][1/'96],
[[eProceedings]][1/'96].
%A L. Allison
%T Towards modelling evolution = mutation modulo selection in sequence
alignment
%R 95/225
%I Dept. Computer Science, Monash University
%M JUN
%D 1995
%K LAllison, Monash, TR225, TR 225, MolBio, evolution, pressure, selection,
fit, fitness, family, phylogenetic, evolutionary tree, trees, sequence,
multiple alignment, zz0795, c1995, c199x, c19xx, bioinformatics
%X [bioinformatics].
%A L. Allison
%T Using Hirschberg's algorithm to generate random alignments of strings
%J IPL
%V 51
%N 5
%P 251-255
%M SEP
%D 1994
%K c1994, c199x, c19xx, LAllison, Monash, jrnl, IPL, MolBio, bioinformatics, GS,
DNA, methods, MML, minimum message length encoding, II, inductive inference,
Hirschberg, string, sequence, alignment, similarity, homology, approximate,
match, matching, LCS, LCSS, edit distance, Bayesian, Gibbs sampling, MCMC,
random, sample, DPA, simulated annealing, SA, dynamic programming algorithm,
divide and conquer, hidden Markov model, HMM, probability,
posterior distribution, stochastic
%X Hirschberg's (CACM '75) recursive divide and conquer technique for
the dynamic programming technique (LCS, LCSS, Edit Distance) is
applied to the problem of sampling alignments of two strings
at RANDOM from the alignments' posterior probability distribution.
[more],
[reprint.ps],
[doi:10.1016/0020-0190(94)90004-3]['04].
Also see [Bioinformatics].
%A L. Allison
%A C. S. Wallace
%T An information measure for the string to string correction problem with
applications
%J 17th Australian Comp. Sci. Conf.
%P 659-668
%M JAN
%D 1994
%W Christchurch, N. Z.
%K LAllison, CSW, CSWallace, Monash, conf, MolBio, inductive inference, II,
string, sequence, family, evolutionary, phylogenetic, tree, trees,
variation, variance, uncertainty, estimate, estimation, parameters, DNA,
multiple alignment, Gibbs sampling, sample, GS, simulated annealing SA,
minimum message length MML, Bayesian, temperature, cooling, probabilistic,
NZ, New Zealand, c1994, c199x, c19xx, ACSC 17, 94, ACSC17, ACSC94,
bioinformatics, Monash
%O Australian Comp. Sci. Comm., Vol 16, No 1(C), 1994, isbn:047302313X.
%X It has been shown how to calculate a probability for an alignment.
Alignments are sampled from their posterior probability distribution.
This is extended to multiple alignments (of several strings). Averaging
over many such alignments gives good estimates of how closely the strings
are related and in what way. In addition, sampling in an increasingly
selective way gives a simulated annealing search for an optimal alignment.
[Bioinformatics],
[paper].
See also the related paper J. Mol. Evol. (39, pp418-430, 1994),
"The posterior probability distribution ...", for more results.
%A L. Allison
%A C. S. Wallace
%T The posterior probability distribution of alignments and its application
to parameter estimation of evolutionary trees and to optimization of
multiple alignments
%J J. Mol. Evol.
%V 39
%N 4
%P 418-430
%M OCT
%D 1994
%O An earlier version is TR 93/188, Dept. Comp. Sci., Monash U., July '93
%K jrnl, MolBio, JME, c1994, c199x, c19xx, LAllison, CSWallace, CSW, DNA,
bioinformatics, optimisation, estimate, infer, parameters, algorithm,
multiple, alignment, data, string, molecular, sequence, homology, Markov,
family, phylogenetic, tree, trees, edit distance, Monte Carlo method, mcmc,
simulated annealing, SA, inductive inference, II, sample, speed, Bayesian,
dynamic programming algorithm, DPA, stochastic, methods, GS, Gibbs sampling,
minimum message length encoding, MML, chain, minimum description length, MDL,
transthyretin, chloramphenicol resistance gene, CAT, CATB, CATD, CATP, CATQ,
CCOLI, ECOLI, algorithmic, mutual information, theory, significance,
probabilistic, temperature, limits, TR 93/188, TR188
%X "It is shown how to sample alignments from their posterior probability
distribution given two strings. This is extended to sampling alignments of
more than two strings. The result is firstly applied to the estimation of
the edges of a given evolutionary tree over several strings. Secondly,
when used in conjunction with simulated annealing, it gives a stochastic
search method for an optimal multiple alignment."
-- [paper] and source code,
[reprint.ps],
[doi:/10.1007/BF00160274]['07].
(The JME paper is a much expanded and changed version of TR 93/188,
[TR93/188](.ps))
%A L. Allison
%T A fast algorithm for the optimal alignment of three strings
%J J. Theor. Biol.
%V 164
%N 2
%P 261-269
%M SEP
%D 1993
%O TR 92/168 Dept. Computer Science, Monash University, Oct '92.
%K LAllison, Monash, jrnl, II, JTB, MolBio, bioinformatics, multiple alignment,
edit distance, Ukkonen, three, string, strings, sequence, sequences,
dynamic programming algorithm, DPA, TR 92 168 TR92-168 TR168,
c1993, c199x, c19xx, J Theoretical Biology
%X Given 3 strings, length ~ n, 3-way edit-distance d,
O(n.d^2) time algorithm worst case, O(n+d^3) typically.
Tree costs 0/1/2. ie. xxx :0; xxy, xx-, x-- :1; xyz, xy- :2
NB. Each internal node of an unrooted binary tree has 3 neighbours.
[more],
[reprint.ps],
[paper] inc' pdf paper and code,
[doi:10.1006/jtbi.1993.1153]['04],
and more on [bioinformatics].
%A L. Allison
%T Normalisation of affine gap costs used in optimal sequence alignment
%J J. Theor. Biol.
%M MAR
%D 1993
%V 161
%N 2
%P 263-269
%K LAllison, Monash, jrnl, MolBio, JTB, alignment, string, sequence analysis,
edit distance, homology, gap, gaps, linear, indel, insert delete, Altschul,
mutual information, similarity, hidden Markov model, HMM, bioinformatics,
inductive inference, II, c1993, c199x, c19xx, DNA, J Theoretical Biology
%X "It is shown how to normalize the costs of an alignment algorithm that
employs affine or linear gap costs. The normalized costs are interpreted as
the -log probabilities of the instructions of a finite-state edit-machine.
This gives an explicit model relating sequences that can be linked to
processes of mutation and evolution."
-- [more],
[reprint.ps],
[paper] inc' pdf.
%A L. Allison
%T Lazy dynamic programming can be eager
%J IPL
%V 43
%N 4
%P 207-212
%M SEP
%D 1992
%K LAllison, Monash, jrnl, FP, lazy functional programming, Haskell, fast,
efficient, dynamic programming algorithm, DPA, edit distance, LCS, LCSS,
MolBio, approximate, similar, string, strings, match, matching, sequence,
alignment, c1992, c199x, c19xx, bioinformatics
%X Lazy evaluation in a functional language is exploited to make the simple
dynamic programming algorithm for the edit-distance problem run quickly
on similar strings: being lazy can be fast.
Runs in O(n*d) time thanks to laziness.
-- [more],
[reprint.ps]
[html]
[doi:10.1016/0020-0190(92)90202-7]['07].
%A C. N. Yee
%A L. Allison
%T Fast string alignment with linear indel costs
%R TR 92/165
%I Computer Science, Monash University
%M JUL
%D 1992
%K LAllison, MolBio, string alignment, similarity, homology, Ukkonen,
edit distance, linear indel gap cost, costs, Ukkonen, Monash,
TR 165, TR92/165, TR165, II, c1992, c199x, c19xx, bioinformatics
%X two strings, O(n*d) time.
The constants in the cost function have to be "small" integers.
[Computing for Molecular Biology].
[Also search for: Ukkonen].
%A D. L. Dowe
%A J. Oliver
%A L. Allison
%A T. I. Dix
%A C. S. Wallace
%T Learning rules for protein secondary structure prediction
%J Proc. 1992 Department Research Conf.
%I Dept. Computer Science, University of Western Australia
%E C. McDonald
%E J. Rohl
%E R. Owens
%M JUL
%D 1992
%O TR 92/163, Dept. Computer Science, Monash University, JUN '92
%K LAllison, CSW, DLD, Monash, UWA, WA, conf, MolBio, decision tree, trees,
graph, protein, amino acid, AA, secondary structure, SS, prediction,
rule, rules, alpha helix, beta strand, extended sheet, coil, turn,
CSWallace, inductive inference, II, MML, minimum message length, c1992,
c199x, c19xx, bioinformatics, TR 92 163, TR92-163, TR163
%X [TR92/163.ps]
Also see [Bioinformatics],
and TR 92/163.
[CSci UWA home]['00]; uk us isbn:0864221959.
%A D. L. Dowe
%A J. Oliver
%A T. I. Dix
%A L. Allison
%A C. S. Wallace
%T A decision graph explanation of protein secondary structure prediction
%J 26th Hawaii Int. Conf. Sys. Sci.
%V 1
%P 669-678
%M JAN
%D 1993
%K LAllison, CSW, Monash, conf, MolBio, protein secondary structure prediction,
conformation, alpha helix, ss, AA, beta sheet extended strand, turn, coil,
II, inductive inference, decision graph tree, DTree, CSWallace, CSW,
MML, Minimum message length encoding, description, MDL, Bayesian,
TR163 163, c1993, c199x, c19xx, bioinformatics, HICSS, HICSS26, HICSS93
%X Oliver and Wallace (IJCAI '91) introduced `decision graphs' -
a generalisation of decision trees - here applied to protein secondary
structure prediction.
[more],
[paper (HTML)].
Also see TR 92/163.
%A C. N. Yee
%A L. Allison
%T Reconstruction of strings past
%J Bioinformatics
%O (was Comp. Appl. BioSci.)
%V 9
%N 1
%P 1-7
%M FEB
%D 1993
%O TR 92/162, Dept. Computer Science, Monash University, May '92
%K LAllison, Monash, jrnl, MML, minimum message length, mdl, J. Bioinformatics,
encoding, hidden Markov model, HMM, II, inductive inference, probabilistic,
MolBio, string, sequence, alignment, homology, similarity, edit,
evolutionary, distance, parameter estimation, r-theory,
TR TR92-162 TR162 162, c1993, c199x, c19xx, CABIOS, J. Bioinformatics
%X Use of single optimal alignment gives biased estimates of the evolutionary
"distance" between two strings but the r-theory, average all alignments,
recovers accurate estimates over a very wide range of similarity.
-- [more],
[doi:10.1093/bioinformatics/9.1.1]['11],
[reprint.ps]
[paper.html].
Also see [Bioinformatics].
[now J. Bioinformatics].
%A L. Allison
%T Some algorithmic attacks on multiple alignment (abstract)
%J Boden Conf.
%W Thredbo, Australia
%M FEB
%D 1993
%K LAllison, Monash, RSBS, ANU, conf, MolBio, MML, II, string,
sequence, approximate match, matching, three, DPA, bioinformatics
%X also see [Bioinformatics].
%A L. Allison
%T Estimating parameters and evolutionary distances in the inference of
evolutionary trees (abstract)
%J Robertson Symposium.
%W Australian National University
%M JAN
%D 1993
%K LAllison, Monash, RSBS, ANU, conf, MolBio, MML, inductive inference,
II, string, sequence, multiple alignment, approximate match,
matching, phylogenetic, tree, trees, c1993, c199x, c19xx, bioinformatics
%X also see [Bioinformatics].
%A L. Allison
%A C. S. Wallace
%A C. N. Yee
%T Minimum message length encoding, evolutionary trees and multiple alignment
%J 25th Hawaii Int. Conf. on Sys. Sci.
%K LAllison, CSW, Monash, conf, MolBio, minimum message length encoding, MML,
ML, evolutionary, family, phylogenetic, tree, trees, CSWallace, CSW, human,
Bayesian, finite state, model, machine, FSM, hidden Markov model, primate,
HMM, DNA, multiple alignment, inductive inference, II, bioinformatics, chimp,
HICSS, HICSS25, HICCS92, TR 91 155, TR91-155, TR155, c1992, c199x, c19xx
%V 1
%P 663-674
%M JAN
%D 1992
%O TR 91/155, Dept. Computer Science, Monash University '91
%X "A method of Bayesian inference known as MML encoding is applied to inference
of an evolutionary tree and to multiple alignment for K >= 2 strings.
It allows the posterior odds-ratio of two competing hypotheses, for
example two trees, to be calculated. A tree that is a good hypothesis forms
the basis of a short message describing the strings. The mutation
process is modelled by finite-state machine. It is seen that tree
inference and multiple alignment are intimately connected."
-- [paper],
there is an example on the primate globin pseudo-genes.
(Also see [Bioinformatics].)
%A S. T. S. Ho
%A L. Allison
%A C. N. Yee
%A T. I. Dix
%T Constraint checking for restriction site mapping
%J 24th Hawaii Int. Conf. on Sys. Sci.
%V ?
%P ???-???
%M JAN
%D 1991
%O TR 89/129 Computer Science, Monash University, JUL '89
%K LAllison, Monash, conf, MolBio, HICSS, HICSS24, HICSS91, bioinformatics,
restriction site mapping, RSM, map, constraint satisfaction problem CSP,
separation theory, TR 89/129 89 129 TR89/129 TR129, c1991, c199x, c19xx
%X also see [mapping].
%A S. Ho
%A L. Allison
%A C. N. Yee
%T Restriction site mapping for three or more enzymes
%J Comp. Appl. BioSci. (J. Bioinformatics)
%V 6
%N 3
%P 195-204
%M JUL
%D 1990
%O TR 88/117 Dept. Comp. Sci., Monash University Oct '88
%K LAllison, Monash, jrnl, MolBio, CABIOS, RSM, restriction site map, mapping,
constraint satisfaction problem, programming, three, separation theory,
TR 88 117 TR117 TR88/117, c1990, c199x, c19xx, J. Bioinformatics, enzyme
%X "Restriction site mapping requires a generator to put forward possible maps &
a constraint checker to reject false maps. Ideally these combine to give an
algorithm which calculates a sound & complete solution set. 3 algorithms for
generation are presented & compared. Two decompose a multi-enzyme problem
(3+) into subproblems. The constraint checker is based on separation theory.
Some insights into the extent of constraint checking involved in & the
feasibility of more checking for three or more enzymes are discussed. The
trade-off between comp'n time & the soundness of the soln set is examined."
[now J. Bioinformatics]
-- [doi:10.1093/bioinformatics/6.3.195],
[jrnl]['07].
Also see [mapping].
%A L. Allison
%A Du Xiaofeng
%T Relating three strings by minimum message length encoding (abstract)
%P 13
%J International Conference on Genes, Proteins and Computers
%W Chester
%I SERC Daresbury Laboratory
%M APR
%D 1990
%K LAllison, Monash, conf, MolBio, multiple, three, triple, alignment,
LCS, LCSS, MML, family, evolutionary phylogenetic tree, bioinformatics,
inductive inference, II, DNA, c1990, c199x, c19xx
%X also see [Bioinformatics].
%A L. Allison
%A C. S. Wallace
%A C. N. Yee
%T Finite-state models in the alignment of macro-molecules
%J J. Mol. Evol.
%V 35
%N 1
%P 77-89
%M JUL
%D 1992
%K LAllison, jrnl, MolBio, c1992, c199x, c19xx, TR 90/148, macromolecules,
TR90/148, TR148, 148 inductive inference, II, DNA, bioinformatics, DPA,
dynamic programming algorithm, mutual information, ML, string, sequence,
comparison, alignment, minimum message length encoding, MML, FSM, FSA,
finite state model, analysis, minimum description length, MDL, methods,
Hidden Markov model, HMM, homology, similarity, LCS, LCSS, significance,
evolutionary, edit distance, sequence, r-theory, linear, gap, indel, insert,
delete, Algorithm, Time, Speed, JME, AAAI, Bayes, Bayesian, CSWallace, CSW
%O An extended abstract titled:
Inductive inference over macro-molecules
in joint sessions at AAAI Symposium, Stanford MAR 1990
on (i) Artificial Intelligence and Molecular Biology, p5-9,
& (ii) Theory and Application of Minimal-Length Encoding, p50-54,
also an early version in Technical Report 90/148,
Dept. Comp. Sci., Monash U., Australia 3168.
%X MML encoding is a technique of inductive inference with theoretical and
practical advantages. It allows the posterior odds-ratio of two theories
or hypotheses to be calculated. Here it is applied to the problem of
aligning or relating two strings, in particular biological macro-molecules.
We compare the r-theory, that the strings are related, with the null-theory,
that they are not related. If they are related the probabilities of the
various alignments can be calculated. This is done for the one-, three-
and five-state models of relation or mutation. These correspond to linear
and piecewise linear cost functions on runs of indels. We describe how
to estimate the parameters of a model. The validity of a model is itself
a hypothesis and can be tested objectively. This is done on real DNA
and on artificial data. The tests on artificial data indicate limits on
what can be inferred in various situations. The tests on real DNA support
either the three- or the five-state models over the one-state model.
Finally, a fast, approximate minimum message length string comparison
algorithm is described.
-- [doi:10.1007/BF00160262]['07].
[reprint] and software,
See C. S. Wallace & D.M Boulton
An information measure for classification.
CompJ 11(2) 185-194 Aug '68 (appendix)
for the derivation of the coding scheme for multi-state data.
See also (i) Bishop & Thompson
(ii) Thorne, Kishino & Felsenstein, and
[AIMB](.ps),
[Alignment].
%A L. Allison
%A C. S. Wallace
%A C. N. Yee
%T When is a string like a string?
%J Int. Symposium on Artificial Intelligence and Mathematics
%W Ft. Lauderdale, Florida, USA
%M JAN
%D 1990
%K LAllison, CSW, CSWallace, Monash, conf, inductive inference, II, homology,
alignment, LCS, edit distance, string, sequence, comparison, similarity,
r-theory, macro-molecule, MolBio, DNA, uncertainty, pattern matching, MML,
minimum message length encoding, AIM AIM90, Hidden Markov model, HMM,
c1990, c199x, c19xx, bioinformatics
%X [more],
[html],
[.ps](.ps)
also see [TR90/148](html)
and [TR90/148](.ps).
%A L. Allison
%A C. N. Yee
%T Minimum message length encoding and the comparison of macromolecules
%J Bulletin of Mathematical Biology
%V 52
%N 3
%M MAY
%D 1990
%P 431-453
%O TR 89/126 Computer Science, Monash University, MAY '89
%K LAllison, Monash, jrnl, minimum message length encoding, MML, c1990, c199x,
c19xx, MDL, ML, inductive inference, II, minimum description length, DNA,
approximate, alignment, similarity, homology, LCS, LCSS, pattern matching,
string, sequence, comparison, Bayesian, Hidden Markov model, HMM,
mutual information, MolBio, bioinformatics, BMB, TR 89/126, 89 126, TR89/126
%X "A method of inductive inference known as minimum message length encoding is
applied to string comparison in molecular biology. The question of whether or
not two strings are related and, if so, of how they are related and the
problem of finding a good theory of string mutation are treated as inductive
inference problems. The method allows the posterior odds-ratio of two string
alignments or of two models of string mutation to be computed. The
connection between models of mutation and existing string alignment
algorithms is made explicit. A fast minimum message length alignment
algorithm is also described."
[more],
[doi:10.1007/BF02458580]['06].
%A L. Allison
%A C. Y. Yee
%T Restriction site mapping is in separation theory
%J Comp. Appl. BioSci. (J. Bioinformatics)
%V 4
%N 1
%P 97-101
%M JAN
%D 1988
%K LAllison, Monash, jrnl, CABIOS, plasmid, map, maps, mapping, combinatorial,
enzyme, restriction site, DNA, phage, RSM, MolBio, separation theory,
linear, constraint, programming, algorithm, inequality, c1988, c198x, c19xx,
J. Bioinformatics, NAR, Nucl. Acids Res. special issue
%X Paper examines constraint checking during backtracking generation
of solutions to the plasmid mapping problem.
The constraints fall into Vaughn Pratt's separation theory.
It is necessary and sufficient to check all simple cycles
in the graph of restriction sites linked by fragments
from the two single digests and the one double digest (can be generalized
to more than two enzymes).
In general, at level 'n' there are at most n new constraints to check.
As an efficiency matter, only those cycles containing 1, 2 or 3
indivisible cycles can be checked which allows a very few false maps to
be generated. This reduces total constraints/map from n*n to 3*n.
-- [more],
[doi:10.1093/comjnl/33.5.460]['11],
[paper (HTML)],
[paper (.ps)].
[jrnl is now J. Bioinformatics].
%A L. Allison
%A T. I. Dix
%T A bit-string longest-common-subsequence algorithm
%J IPL
%V 23
%M DEC
%D 1986
%P 305-310
%K LAllison, TIDix, Monash, UWA, jrnl, MolBio, LCS, LCSS, c1986, c198x, c19xx,
fast, algorithm, bits string, bit vector, DPA, dynamic programming algorithm,
similarity, homology, bioinformatics, IPL, sequence, approximate, match,
matching, strings, distance, comparison, alignment, practical, speedup,
speed
%X "A longest-common-subsequence algorithm is described which operates in terms
of bit or bit-string operations. It offers a speedup of the order of the
word-length on a conventional computer."
Speedup is ~ wordlength (eg. a factor of 32 or 64), time is O(n^2/wordlength).
[more]+source code,
[HTML],
[.ps],
[doi:10.1016/0020-0190(86)90091-8]['07].