[1st>DPA>]

# Information Content Using Different Contexts

1. e.g.1, e.g.2,
2. entropy,
3. arithmetic coding,
4. multinomial distribution,
5. LZ / ZL,
6. PPM prediction by partial match,
7. with mismatches,
8. the approximate repeats model (ARM),
9. the eXpert Model (XM),
10. contexts and difference plots (HBV), and
11. references.
users.monash.edu/~lloyd/Seminars/2007-Information/index.shtml
Why compress biological sequences?

Not to save disc space (at least not yet[*]) --

savings are <20%± for DNA, maybe <10%± protein.

The reason is the strong connection of compression to modelling:

Good compression <=> good model.

To discover pattern, structure, hypotheses.

([*]Although when several sequences are close relatives, then compression can give significant space savings.)

E.g.1, assuming

pr(A) = pr(C) = pr(G) = pr(T) = 1/4,

- log2(1/4) = +log2(4) = 2 bits

A -> 002; C -> 012; G -> 102; T -> 112; is optimal.

Expected code length = 1/4×2 + 1/4×2+ 1/4×2+ 1/4×2 = 2 bits/symbol

E.g.2, assuming

pr(A) = 1/2; pr(C) = 1/4; pr(G) = pr(T) = 1/8,

(Note 1/2 + 1/4 + 1/8 + 1/8 = 1)

- log2(1/2) = 1; - log2(1/4) = 2; - log2(1/8) = 3

A -> 02; C -> 102; G -> 1102; T -> 1112; is optimal.

Expected code length = 1/2×1 + 1/4×2+ 1/8×3+ 1/8×3 = 1 3/4 bits/symbol

Entropy

- Ex log(pr(x)) = ∑x:X - pr(x) . log(pr(x)),

which depends in pr( ) . . . Do you have the true model?

In sequences, pr( ) can be context dependent, have pr(xi | x1..i-1).

The entropy gives a lower limit for compression per symbol -- can't beat it on average.

BTW, Kullback Leibler distance,
KL(pr1(),pr2()) = ∑x:X pr1(x).{ - log(pr2(x)) + log(pr1(x)) }, ≥ 0.
. . .
. . . entropy

note, "the" entropy is essentially unknowable

unless you fix on one model for some reason

(but what is the "true" model for real data?).

Actual compression also depends on what is known (context, background knowledge, common knowledge)

but entropy is still a useful abstract concept.

. . .
. . . entropy, & coding v. stats.

Q: So if code_length = - log( probability )  (& probability = e- code_length),  is that "all"?

A: yes and no:

Coding keeps us "honest":
every parameter (unless common knowledge) must be transmitted (at optimum precision).

There are many useful techniques & heuristics from file compression that do not corr. to any obvious "classical" statistical model.

Arithmetic coding

can get arbitrarily close to the theoretical lower limit
(the entropy of the data)
in bits per symbol, e.g., [Lan84],

provided it is given a good model of the data.

Hence
"Compression = Modelling + Coding",

and coding is now "solved", so . . .

a data compressor is as good, or bad, as its model.

NB. AC is capable of coding a symbol in fractions of a bit!

Multinomial distribution

e.g., α = {H,T}, or {A,C,G,T}, or {amino-acids}, etc.

1. combinatorial code,
3. θ + data|θ.

code1 = code2
= log(|data|+|α|-1)! - log(|α|-1)! - log(#α1)! - ... - log(#α|α|)!,

code3 = code1 + (|α|-1) × a fraction of a bit per parameter,
provided θ is stated to optimum precision [BW69],

e.g., [click].

. . .
. . . multinomial

NB. DNA
order_0:   3 parameters,
order_1: 12 parameters  (four × 4-nomials),
etc.

and protein
order_0:   19 parameters,
order_1: 380 parameters  (20 × 20-nomials)
etc.

unless some structure (restriction) is imposed on models.

LZ [LZ76, ZL77]

very influential in text (file) compression,

two approaches:

build a header (dictionary) of repeated sub-strings,
replace each instance with a pointer-code into dictionary,

or
replace a repeated instance with
<escape code, pointer to 1st instance, length>.

And a nice quote:
"Roughly speaking, the complexity of a finite fully specified sequence is a measure of the extent to which the given sequence resembles a random one." --p.75 [LZ76]

. . .

. . . PPM [CW84]

is an LZ-inspired text compression algorithm which

builds a tree of contexts ("frequent" substrings). It consists of methods of

growing the tree,

gathering stats. per context, and

weighting predictions from short frequent contexts v. long infrequent ones.

. . .

. . . LZ & PPM, with mismatches [LY96] for Biol..

Repeats are exact in LZ and PPM.

[LY96] allowed some mismatches against a context;

must descend 4 × as many subtrees per mismatche per position,

must blend predictions from exact v. 1-mismatch v. 2-mismatches v. ... etc.,

as well as predictions from long v. short contexts which

is controlled by "several" parameters, to be estimated.

The Approximate Repeats Model (ARM) [AED98]

(or similar)
. . .
. . . ARM

ARM is simple (has few parameters),

its parameters relate directly to useful concepts (frequency, length, fidelity),

inspired by Ziv-Lempel [LZ76, ZL77], copies but now with mutations,

can think of ARM as

~ some base_model + some alignment_model.

. . .
. . . ARM

can make an ARM out of potentially any (self-)alignment model, and

any base model --

even another copy of ARM.

(And note that in modelling alignment the generic DPA is extended by adding a compression model parameter, e.g., ARM.-)

. . .
. . . ARM

1. "base model" for background properties/statistics;

2. param for frequency of copies,

change- & indel-params for fidelity of copies, and

param for length of copies

(more frequent, more faithful, longer => more compression).

Similarly for 'reverse-complementary' repeats, & can augment to model 'longer but less frequent' repeats, say.

. . .
. . . ARM & an information theory interpretation

as in local alignment (but here against self) an optimal path is a hypothesis,
. . .
. . . ARM

can search for an optimal one of these hypothesis, but

any two such hypotheses are exclusive so,

as in alignment, we can sum all their probabilities,

in O( |s|2 )-time.

For  l o n g  seqs., ∃ a faster approximation algorithm (& also see the [XM]).
. . .
. . . ARM

[example-R] is [random], |s|=100,

base-line = 200* bits (2 b/sym)

order_0 Markov Model,
hypothesis (H) = 8.2, data (D|H) = 197.4, total = 205.6 bits (2.06 b/sym)

order_1 Markov Model,
H = 21.1, D|H = 191.0 (exc. H), total = 212.1 bits (2.12 b/sym)

ARM,
H = 31.6, D|H = 191.1, total = 222.7 bits (2.23 b/sym)
. . .
*winner
. . . ARM

[example-NR] is [non-random], |s|=128,

base-line = 256 bits (2 b/sym)

order_0 Markov Model,
H = 10, D|H = 211, total = 221 bits (1.73 b/sym)

order_1 Markov Model,
H = 30, D|H = 152.1, total = 182.2 bits (1.42 b/sym)

ARM,
hypothesis (H) = 49.5, data (D|H) = 128.5, total = 178.0* bits (1.39 b/sym)
*winner
(more: [P.falciparum])

The eXpert Model (XM) [CDA07]

A blend of "expert" opinions:

Markov expert(s) for background "local" stats.,

copy experts,

reverse complementary copy experts.

Bayesian blending is based on ests. of posterior probs.

. . .
. . . expert model

the key parts of the algorithm:

1. the kinds of individual "experts",

2. blending the expert opinions,

3. proposing and selecting (good) copy experts

crucial to algorithmic efficiency,
too few => poor compression.
. . . XM on DNA (bits/sym), e.g.,

 // // --> // --> // --> // --> BioC GenC DNAC DNAP CDNA // // --> // --> // --> // --> 1.88 1.82 1.79 1.78 1.77

and XM on a proteome (bits/sym), e.g.,

 // // --> // --> CP ProtComp LZ-CTW // // --> // --> 4.15 3.94 3.95

more . . .
. . . XM on DNA, b/sym,   (read this later or see [CDA07])   thanks to Minh Duc Cao,
 Sequence CHMPXX CHNTXX HEHCMVCG HUMDYSTR HUMGHCSA HUMHBB HUMHDAB HUMHPRTB MPOMTCG MTPACG VACCG Average Length 121024 155844 229354 33770 66495 73308 58864 56737 186609 100314 191737 - gzip 2.2818 2.3345 2.3275 2.3618 2.0648 2.2450 2.2389 2.2662 2.3288 2.2919 2.2518 2.2721 bzip 2.1218 2.1845 2.1685 2.1802 1.7289 2.1481 2.0678 2.0944 2.1701 2.1225 2.0949 2.0983 ac-o2 1.8364 1.9333 1.9647 1.9235 1.9377 1.9176 1.9422 1.9283 1.9654 1.8723 1.9040 1.9205 ac-o3 1.8425 1.9399 1.9619 1.9446 1.9416 1.9305 1.9466 1.9352 1.9689 1.8761 1.9064 1.9267 gzip-4 1.8635 1.9519 1.9817 1.9473 1.7372 1.8963 1.9141 1.9207 1.9727 1.8827 1.8741 1.9038 bzip-4 1.9667 2.0090 2.0091 2.0678 1.8697 1.9957 1.9921 2.0045 2.0117 1.9847 1.9520 1.9875 dna2(MR04) 1.6733 1.6162 1.8487 1.9326 1.3668 1.8677 1.9036 1.9104 1.9275 1.8696 1.7634 1.7891 Off-line(AL00) 1.9022 1.9985 2.0157 2.0682 1.5993 1.9697 1.9740 1.9836 1.9867 1.9155 1.9075 1.9383 BioCompress(GT94) 1.6848 1.6172 1.848 1.9262 1.3074 1.88 1.877 1.9066 1.9378 1.8752 1.7614 1.7837 GenCompress(CKL00) 1.673 1.6146 1.847 1.9231 1.0969 1.8204 1.8192 1.8466 1.9058 1.8624 1.7614 1.7428 CTW+LZ(MSI00) 1.6690 1.6129 1.8414 1.9175 1.0972 1.8082 1.8218 1.8433 1.9000 1.8555 1.7616 1.7389 DNACompress(CLM02) 1.6716 1.6127 1.8492 1.9116 1.0272 1.7897 1.7951 1.8165 1.8920 1.8556 1.7580 1.7254 DnaPack(BF05) 1.6602 1.6103 1.8346 1.9088 1.0390 1.7771 1.7394 1.7886 1.8932 1.8535 1.7583 1.7148 CDNA(LY99) - 1.65 - 1.93 0.95 1.77 1.67 1.72 1.87 1.85 1.81 - GeMNL(KT05) 1.6617 1.6101 1.8420 1.9085 1.0089 - 1.7059 1.7639 1.8822 1.8440 1.7644 - eXpert Model(CDA07) 1.6575 1.6086 1.8404 1.9031 0.9845 1.7524 1.6696 1.7378 1.8783 1.8466 1.7639 1.6946
and on protein, b/sym:
 Sequence Haemophilus Influenzae Saccharomyces Cerevisiae Methanococcus Jannaschii Homo Sapiens Average CP(MW99) 4.143 4.146 4.051 4.112 4.113 ProtComp(HT04) 4.108 3.938 4.008 3.824 3.9695 LZ+CTW(MSI00) 4.1177 3.9514 4.0279 4.0055 4.0256 eXpert Model(CDA07) 4.1022 3.8859 4.0002 3.7860 3.9434

Contexts and difference plots [DPA07]

info(S | cntxt1) - info(S | cntx2),   --per element,

shows what new information cntx2 gives about S, and where;
ignores "background" already known from S or cntxt1.

Requires a fast compression alg., such as XM,

1-D plots take linear space, feasible for large genomes,

smoothing, zoom, difference, threshhold, take linear-time, feasible for large genomes,

examples . . .

. . . a small example, hepatitis B [HBV], xm --hs=2
1.98 b/sym   human;
1.36 b/sym   human | woollyMonkey;
1.84 b/sym   human | woodChuck;
1.96 b/sym   human | duck;
<----720 pixels---->
0-900; smoothed w=21; top human HBV; blue=saving|w.monkey; brown=saving|woodchuck; green=saving|duck [larger]; note 200-300
800-1600
1600-2400
2400-end (~3200)
. . ., e.g., [HBV]
<----1600 pixels---->
0-900; smoothed w=21; top human HBV; blue=saving|w.monkey; brown=saving|woodchuck; green=saving|duck; note 200-300.
[Larger 800-1600, 1600-2400, and 2400-end [here].]
Thanks, in αβ-ical order to

Julie Bernal, Minh Duc Cao, Trevor Dix, Tim Edgoose, Samira Jaeger, Chris Mears, David Powell, Linda Stern, Chris Wallace, and Chut N. Yee.

References

[AZM02] D. Adjeroh, Y. Zhang, A. Mukherjee, M. Powell, T. Bell, DNA sequence compression using the Burrows-Wheeler transform,
IEEE Computer Soc. Conf. on Bioinformatics, pp.303, 2002.
[AED98] L. Allison, T. Edgoose, T. I. Dix, Compression of Strings with Approximate Repeats,
Intelligent Systems in Molec. Biol. (ISMB'98), pp.8-16, June 1998 [more]
"... a repeated substring can be an approximate match to the original substring; this is close to the situation of DNA ... a faster approximation alg. is also given. The model is further extended to include approximate reverse complementary repeats when analyzing DNA strings."
[ASE00] L. Allison, L. Stern, T. Edgoose, T. I. Dix, Sequence complexity for biological sequence analysis,
Comp. Biol. & Chem (was Comp. & Chem.), 24(1), pp.43-55, 2000, doi:10.1016/S0097-8485(00)80006-6 [more].
A model which relates directly to biological concepts and knowledge, and gives good compression of DNA sequences.
[AL00] A. Apostolico, S. Lonardi, Compression of biological sequences by greedy off-line textual substitution,
Proc. IEEE Data Compression Conf., pp.143-152, 2000.
[BF05] B. Behzadi, F. L. Fessant, DNA compression challenge revisited: A dynamic programming approach,
CPM, pp.190-200, 2005.
[BW69] D. M. Boulton, C. S. Wallace, The information content of a multistate distribution,
J. Theor. Biol., 23, pp.269-278, 1969, doi:10.1016/0022-5193(69)90041-1.
(i) Cleared up a long running misconception about the information-content of a sequence of symbols, and (ii) showed that if a code based on estimates of symbol-probabilities is used, those estimates must also be encoded at optimum precision.
[CDA07] M. D. Cao, T. I. Dix, L. Allison, C. Mears, A simple statistical algorithm for biological sequence compression,
IEEE Data Compression Conf. (DCC07), pp.43-52, March 2007, doi:10.1109/DCC.2007.7 [more].
A "panel of experts model" (XM) achieves good compression of DNA and of protein sequences.
[CKL00] X. Chen, S. Kwong, M. Li, A compression algorithm for DNA sequences and its applications in genome comparison,
RECOMB, pp.107, 2000.
[CLM02] X. Chen, M. Li, B. Ma, T. John, DNACompress: Fast and effective DNA sequence compression,
Bioinformatics, 18(2), pp.1696-1698, Dec. 2002.
[CW84] J. G. Cleary, I. H. Witten, Data compression using adaptive coding and partial string matching,
IEEE Trans. on Communications, 32(4), pp.396-402, 1984.
Variable-length matching of the recent past context with the previously transmitted sequence gives predictors for the next symbol. Compresses English text at ~2.2 bits/character. (PPM = Prediction by Partial Matching.)
[DPA07] T. I. Dix, D. R. Powell, L. Allison, J. Bernal, S. Jaeger, L. Stern,
Comparative analysis of long DNA sequences by per element information content using different contexts,
BMC Bioinformatics, 8(S2):S10, May 2007, doi:10.1186/1471-2105-8-S2-S10 [more].
Calculating the information content, per symbol, in different contexts, and taking difference plots, shows what information a context gives about a sequence. These quantities can be computed efficiently for very long sequences, even whole genomes.
[GT93] S. Grumbach, F. Tahi, Compression of DNA sequences,
IEEE Data Compression Conf., pp.340-350, 1993.
[GT94] S. Grumbach, F. Tahi, A new challenge for compression algorithms: Genetic sequences,
J. of Information Proc. and Management, 30(6), pp.875-866, 1994.
[HT04] A. Hategan, I. Tabus, Protein is compressible,
Proc. 6th Nordic Signal Processing Symp. (NORSIG), Espoo, Finland, pp.192-195, June 2004.
Not sure about one of the results, but rekindled interest in compression of protein sequences, and has a catchy title -- see [MW99].
[KT05] G. Korodi, I. Tabus, An efficient normalized maximum likelihood algorithm for DNA sequence compression,
ACM Trans. on Inf. Syst., 23(1), pp.3-34, 2005.
[Lan84] G. G. Langdon, An introduction to arithmetic coding,
IBM J. Res. & Dev., 28(2), pp.135-149, 1984, @ibm.
Arithmetic coding (compression) can get arbitrarily close to the lower limit (the entropy of the data) in bits per symbol -- provided it is given a good model of the data.
[LZ76] A. Lempel, J. Ziv, On the complexity of finite sequences,
IEEE Trans. Inf. Theory, IT-22, pp.783-795, Nov. 1976.
Represents repeated substrings as words in a dictionary. And... "Roughly speaking, the complexity of a finite fully specified sequence is a measure of the extent to which the given sequence resembles a random one." --p.75. Also see [ZL77].
[LY96] D. M. Loewenstern, P. N. Yianilos, Significantly lower entropy estimates for natural DNA sequences,
IEEE Data Compression Conf. (DCC97), pp.151-160, 1997, doi:10.1109/DCC.1997.581998.
Took a text compression algorithm (PPM±) and modified it to allow some mismatches with previous contexts and to form a weighted average of such predictors.
[MR04] G. Manzini, M. Rastero, A simple and fast DNA compressor,
Software Practice and Experience, 34(14), pp.1397-1411, 2004.
[MSI00] T. Matsumoto, K. Sadakane, H. Imai, Biological sequence compression algorithms,
Genome Informatics, 11, pp.43-52, 2000.
[MW99] C. G. Nevill-Manning, I. H. Witten, Protein is incompressible,
IEEE Data Compression Conf., pp.257-266, March 1999, doi:10.1109/DCC.1999.755675.
Put a damper on protein compression for while by suggesting that any compression is due to the "skew" in the use of the alphabet, i.e. a 0-order Markov model does best. (If you search for a copy of their data-set on the web, make sure that you get a valid one.) Also see [HT04] and [CDA07].
[RDD96] E. Rivals, J.-P. Delahaye, M. Dauchet, O. Delgrange, A guaranteed compression scheme for repetitive DNA sequences,
Data Compression Conf., pp.453, 1996.
[Sha48] C. E. Shannon, A mathematical theory of communication,
Bell Syst. Technical Jrnl., 27, pp.379-423 & pp.623-656, 1948.
Foundational work on communication and information theory.
[TKR03] I. Tabus, G. Korodi, J. Rissanen, DNA sequence compression using the normalized maximum likelihood model for discrete regression,
IEEE Data Compression Conf., pp.253, 2003.
[ZL77] J. Ziv, A. Lempel, A universal algorithm for sequential data compression,
IEEE Trans. Inf. Theory, IT-23, pp.337-343, May 1977.
"... encoding future segments of the source-output via maximum-length copying from a buffer containing the recent past output. The transmitted codeword consists of the buffer address [a pointer] and the length of the copied segment. ..." Also see [LZ76].

© Lloyd Allison Faculty of Information Technology (Clayton), Monash University, Australia 3800.
Created with "vi (Linux & Solaris)",   charset=iso-8859-1