^CSC423^

CSE423 Prac #1, 2000.

Due date: Midday Friday, week 6,
CSSE general office - hand in a paper copy of:
(a) the program source,
(b) any instructions that you deem neceessary,
(c) test results and discussion.
Email lloyd at csse.monash.edu.au a single source file, yourName.c, being the source code of your program. Running it to be "self explanatory".
Demonstrations will be arranged.
Systems: gcc, Irix, indy
Marks: 20%

The Lempel-Ziv (LZ) method of data compression spawned many research papers and file compression programs. The basic model is that a sequence consists of a mixture of (i) random symbols (characters) and (ii) repeats from earlier parts of the sequence. It can be instantiated by allocating an extra code for <repeat, , > (in addition to the codes for the symbols proper) with two parameters, the start position of the repeat, and its length, <repeat,start,length>. e.g. mississippi  can be coded as  miss<repeat,2,4>ppi  which might be shorter than otherwise.

To implement an LZ scheme, you need to choose a probability for <repeat, , >, and probability distributions for the starting positions of repeats and for the lengths of repeats. e.g. You might consider that all possible starting positions are equally likely, or e.g. It might be the case that "close" repeats are more probable than "distant" ones in some applications.

It was shown that if the data comes from a very wide class of sources, then asymptotically the LZ model approaches the optimal level of compression. A drawback is that the convergence is often slow.

  1. A. Lempel & J. Ziv. On the complexity of finite sequences. IEEE Trans. Inf. Theory, IT-22, pp783-795, Nov' 1976.
  2. J. Ziv. & A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, IT-23, pp337-343, May 1977.

The algorithm for a raw LZ scheme attempts to find the single best way of encoding a string that gives the most compression. A lot of research has gone into efficient algorithms and data structures to create programs (for something like LZ) that run very quickly - a prime consideration for file compression programs. We are not interested in such matters here.

There are generally multiple ways to encode a string:
mississippi   -- i.e. no repeats
miss<repeat,2,4>ppi
missi<repeat,3,3>ppi
. . . etc.
Each way can be considered to be a distinct (exclusive) hypothesis about how the string was created.

By choosing the single most probable hypothesis, an algorithm is ignoring the probability of the alternative hypotheses. It is more appropriate from the learning and modelling point of view to add together the probabilities of all the hypotheses. It is actually possible to devise a code that achieves a message length being the -log2 of this sum.

Task:

Investigate the extra compression obtainable by summing over the probabilities of all (or at least many) hypotheses under the LZ model. Use a variety of data - ascii text, computer program source code (various languages), data files, etc.. It is optional to implement an actual encoder and a decoder; it is sufficient to calculate what the code lengths would be, provided that you do not "cheat".

Be careful with numerical accuracy. Probabilities of strings of any non-trivial length will be very small. Logs of probabilities fall in a more reasonable numerical range.

You "may" need a function that gives this effect:
logPlus(msgLen1, msgLen2)   =   -log2( 2-msgLen1 + 2-msgLen2 )
being careful of numerical accuracy of course; you will not be very successful if you naively implement the above formula. We can provide an efficient numerical approximation.

We are not mainly concerned with the speed of your program. It should be practical to use it on strings of at least several thousand characters and, all other things being equal, you should certainly try for a fast program, but we are not very interested in the "constant".

The exercise is not to produce the world's best text compression program! It is to investigate the extra compression, if any, obtainable by the specified variation on the basic LZ model.

Example

Human Tissue Plasminogen Activator Gene, positions 24321-24448 (128 bp).
TGATAGGTGATAGATAGATTGATAGATGATAGAAGATTGATAGATGATAG
ATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATG
ATAGACAGATTGATAGATGATAGAGAGA
used as an example in
A. Milosavljevic and J. Jurka. Discovering simple DNA sequences by the algorithmic sigificance method. CABIOS 9(4) pp407-411, 1993. [I have a copy.]
The composition of the sequence is highly biased (only two Cs), and there are lots of repeated elements, e.g. 9 x TAGAT. Their paper sketches a dynamic programming algorithm (p409) to find a single best hypothesis for a sequence under an LZ-like model (compression scheme), for the case that the code length of <repeat,?,?> is a constant.

July 2000, L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3168