Coding 

We are interested in noiseless coding, i.e. in efficient codes that can be used to send data over a communication line that does not introduce noise (errors). Such codes can also be used to compress data for storage purposes. Here `efficiency' refers only to the code itself, not to the amount of time or space required to do the encoding and decoding processes.
The average codeword length (message length), for events v from S, is
DecodingA code is uniquely decodable if every finite string of bits represents at most one sequence of source text. i.e. There is no possibility of two different source texts giving the same coded string. Synchronizable CodesA code is commafree if it is impossible
to get one outofframe codeword in the middle of an encoded string.
In the early days of molecular biology it
was briefly suspected that the genetic code
(DNA^{3}>amino acids)
was commafree, but this turned out not to be the case.
A code has a stagger limit of `k'
if the synchronisation of all windows of length k+1 or greater
is apparent without any more context. The genetic code is very nearly a Gray code,
i.e. an error in the Prefix PropertyAnother desirable property for a code to have is the prefix property: No code word is a prefix of any other code word. A code having the prefix property is uniquely decodable, but not necessarily v.v.. The prefix property allows us to identify the end of the first (and second etc.) code word in a data stream immediately, and is required to be able to decode a stream of data unambiguously without lookahead. While the prefix property seems to be quite strict, some fundamental results show that such codes are all that we need to consider for the purposes of inference. ExampleRecalling the biased four sided dice, P(a)=1/2, P(c)=1/4, P(g)=P(t)=1/8, the following is an efficient code for data generated by the dice:
Codes and EntropySuppose we have events with probabilites p_{1}, p_{2} ..., p_{N} and there is a code for these events with codeword length l_{1}, l_{2} ..., l_{N}, where ∑_{i=1..N}{2^{li}}≤1. The average codeword length will be
Kraft InequalityThere is a prefix code with codewords of lengths
l_{1}, l_{2}, ..., l_{N} iff
Proof (a) code exists => inequality: There must be a maximum codeword length, say max. Consider the full binary tree of depth max with edges labelled (01), e.g. below if max=3 (root at the left):
Every codeword must appear somewhere in the tree and because of the prefix property no other codeword can be a descendant or an ancestor of it in the tree. There are 2^{k} bitstrings of length k. A bitstring of length max, has probability 2^{max}. The probability of a shorter bitstring is the sum of the probabilities of the maxbit strings below it. The sum of the probabilities of the codewords cannot exceed 1.0. Proof (b) inequality => code exits. Sort the lengths so that
l_{1}≤l_{2}≤ ... ≤l_{N}.
TheoremThere exists a prefix code with codewords of length l_{1}, l_{2}, ..., l_{N}, iff there is a uniquely decodable code with these lengths G.F.'s notes give Welsh Codes and Cryptography, OUP, 1988, as a reference. So it is reasonable to insist on the use of prefix codes because if there is any uniquely decodable code then there is a prefix code. Shannon (1948)Noiseless coding theorem for sources without memory: Let S be a source with entropy H.
See G.Farr (1999) p9 for (1). Proof (2):
Huffman (1952)Algorithm for the construction of minimum redundancy codes. Huffman's algorithm
starts with a list of probabilities for events,
p_{1}, p_{2}, .., p_{N}.
It forms a binary tree with these probabilities
at leaves of the tree:
Huffman's algorithm does give an optimal code if one is restricted to transmiting symbols one at a time. It may be possible to do better by grouping symbols together. (Arithmetic coding effectively takes this to the extreme.) (Huffman's algorithm is bottomup  in the initial phase. The other strategy that one might think of is "topdown" but this does not lead to efficient codes in general.) Arithmetic CodingArithmetic coding works by dividing the interval [0, 1),
Note that 1/2 = 0.5_{10} = 0.1_{2}, 1/4 = 0.25_{10} = 0.01_{2}, 1/8 = 0.125_{10} = 0.001_{2}, etc. Langdon (1984)
gives an excellent tutorial introduction to arithmetic coding
[But beware:
fig.3 seems not to correspond to In arithmetic coding, probabilities of events need not be neat powers of 1/2. e.g. he gives a 2symbol example where P(F)=1/8, P(T)=7/8 in position one and P(F)=P(T)=1/2 thereafter. (The probabilities are allowed to vary with position in the sequence, so long as encoder and decoder both know them or can calculate them.) For input "TTFT..." the successive interval widths would ideally be 7/8 (T), 7/16 (T), 7/32 (F), 7/64 (T). In general, the precision required to state an interval width could increase without limit. Arithmetic coding places a limit on the accuracy required; the price is a small loss in coding efficiency and Langdon (1984) states "less than 4%" for the choices in that paper.
Note that the secondlevel partition is 1/4:5/8 not the ideal 7/16:7/16. Successive intervals [C, C+A), of lefthand end `C' and width `A', are encoded by C: "0...", "0...", "0...", "1...", ... . The end of the output is padded out by sufficient zeros. DecodingThe decoder follows a similar set of actions to the encoder which enables the input source string to be recovered. PrecisionThe lefthand end, C, of the current interval is a binary number. The method guarantees that it (and, A, the interval width) can be represented with bounded precision, i.e. of the form 0.000...000{0  1}^{k}, and can be represented by a fixedpoint number 1.{0  1}^{k1} (and exponent). This removes any problems of numerical accuracy. It is required that P(F)≤P(T) and P(F) is approximated by the smallest power of 1/2 no larger than it. P(T) is approximated by "what's left" of the current interval. If P(F)>P(T), and note that both the encoder and decoder must know this, they can switch their roles and carry on as before. CarryOverIt can happen that when additions are made to C, to move the lefthand end of the interval, a carry can propagate. This requires keeping the output string in a FIFO buffer, potentially of arbitrary length. The length of the buffer can be limited (e.g. to 16 bits Langdon (1984)) in a technique called bitstuffing (Langdon & Rissanen 1981) by forcing a `0' out if 16 `1's have been output. The decoder can detect this situation. Multisymbol AlphabetsLarger input symbolsets can be handled either by converting the symbols to binary strings which are then encoded as above, or by more direct extensions of the method to deal with multiple symbols and their individual probabilities. EntropyAs described, arithmetic coding is very efficient
and extensions of it can get arbitrarily close
to the entropy of the data.
In some sense it has "solved" the coding problem,
making it clear that
... probability ... predict ... model ... inference ...  LA 1999
Notes


