yourName.c
, being the source code of your program.
Running it to be "self explanatory".
The Lempel-Ziv (LZ) method of data compression
spawned many research papers and file compression programs.
The basic model is that a sequence consists of a mixture of
To implement an LZ scheme, you need to choose a probability for <repeat, , >, and probability distributions for the starting positions of repeats and for the lengths of repeats. e.g. You might consider that all possible starting positions are equally likely, or e.g. It might be the case that "close" repeats are more probable than "distant" ones in some applications.
It was shown that if the data comes from a very wide class of sources, then asymptotically the LZ model approaches the optimal level of compression. A drawback is that the convergence is often slow.
The algorithm for a raw LZ scheme attempts to find the single best way of encoding a string that gives the most compression. A lot of research has gone into efficient algorithms and data structures to create programs (for something like LZ) that run very quickly - a prime consideration for file compression programs. We are not interested in such matters here.
By choosing the single most probable hypothesis, an algorithm is ignoring the probability of the alternative hypotheses. It is more appropriate from the learning and modelling point of view to add together the probabilities of all the hypotheses. It is actually possible to devise a code that achieves a message length being the -log2 of this sum.
Investigate the extra compression obtainable by summing over the probabilities of all (or at least many) hypotheses under the LZ model. Use a variety of data - ascii text, computer program source code (various languages), data files, etc.. It is optional to implement an actual encoder and a decoder; it is sufficient to calculate what the code lengths would be, provided that you do not "cheat".
Be careful with numerical accuracy. Probabilities of strings of any non-trivial length will be very small. Logs of probabilities fall in a more reasonable numerical range.
We are not mainly concerned with the speed of your program. It should be practical to use it on strings of at least several thousand characters and, all other things being equal, you should certainly try for a fast program, but we are not very interested in the "constant".
The exercise is not to produce the world's best text compression program! It is to investigate the extra compression, if any, obtainable by the specified variation on the basic LZ model.
TGATAGGTGATAGATAGATTGATAGATGATAGAAGATTGATAGATGATAG ATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATG ATAGACAGATTGATAGATGATAGAGAGA
C
s),
and there are lots of repeated elements,
TAGAT