Integer Distributions

home1 home2
and the

   #tree code


also see
More Ints

This section considers positive integers, n≥1, unless otherwise stated. (If you have non-negative integers, n≥0, just add one first.) This is the first, and perhaps the most fundamental, infinite data-space. A code over positive integers can be used to transmit data from any enumerable set.

Geometric Distribution and Unary Code

More on the
[Geometric (click)].

The geometric distribution:

P(n) = (1-p)n-1p,    n>0
where 0<p<1
It models, for example, the number of coin-tosses until a head is thrown, including the head, using a coin where P(head)=p, i.e. P(Tn-1H) = P(T)n-1.P(H).

The geometric distribution is a proper distribution:

(1-p)n-1p = p(1 + q + q2 + q3 +...)
where q=1-p
= p/(1-q) = p/p = 1
The expected value of n is given by
nqn-1p = p(1 + 2q + 3q2 + ...)
= p/(1-q)2 = 1/p

An efficient code can be based on the geometric distribution (if the source comes from such a distribution):

msgLen(n) = -(n-1).log2(1-p) - log2(p)

E.g., choosing a geometric distribution with p=0.5 amounts to the use of a unary code:

integer code word probability
1 0 1/2
2 10 1/4
3 110 1/8
4 1110 1/16


We can describe a geometric distribution by a finite-state machine with states S and S'. The machine starts in state S; it either outputs a `1' and returns to S or outputs a `0' and goes to S':

S >-----(1)-----> S

S >-----(0)-----> S'
The transitions out of state S, which it is natural to label `1' and `0', form the "states" of a 2-state / binomial distribution. (Try not to confuse the states of the machine and of the distribution.)


  • L.Allison, C.S.Wallace and C.N.Yee, Finite-State Models in the Alignment of Macro-Molecules, J.Molec.Evol. 35(1) pp.77-89, 1992, or see [TR90/148].
    The geometric distribution is sometimes the "right" distribution for a problem, but even if not it can still be a good approximation, and the fact that its -log is a linear function of `n' can be convenient algorithmically (fast) -- (probabilistic) finite-state automata (PFSA, FSA, hidden Markov models HMM) "count" in unary. A mixture of two or more geometrics (i.e. piece-wise linear "costs") is still convenient algorithmically and may be a better model for some problems; see fig 9 of [TR90/148].

Poisson Distribution

More on the
[Poisson (click)].

The Poisson distribution with parameter α>0, for n≥0:

P(n) =
e αn
NB. n0. The expectation equals the variance equals α.

P(n) = (α/n) P(n-1), so P(n) increases while n<α and decreases when n>α, i.e. the mode is α; this is also the maximum likelihood estimate of α given observed data `n'.

The Poisson distribution can be derived (e.g. Meyer 1970) as the distribution of the number of particle decays in a radioactive source in unit time where α is the rate, i.e. where the probability of a decay in a small time interval, dt, is α.dt.


We observe a value of `n' (e.g. n decays in unit time). The negative log likelhood, i.e. -log P(n) is

= α - n.log α + log n!
(Recalling Stirling's approximation, loge(n!) = n.loge(n)-n+0.5loge(n)+0.5loge(2pi)+..., we see that the message length goes up roughly in proportion with n.log(n).)

The second derivative with respect to the parameter α is n/α2.
The expectation of this over n, i.e. the Fisher information, is

α/α2 = 1/α
The MML estimate of α is that value that minimises the message length
-log(h(α)) -log(p(n|α)) + 1/2 log F(α) +(-log 12 + 1)/2
of this prior = A.
For the prior h(α) = (1/A).e-α/A, differentiate the msgLen w.r.t. α and set to zero:
d/d α { -log 1/A + α/A   //from h
+ α - n.log α + log n! //from likelihood
- 1/2 log α //from F
+ (-log 12 +1)/2 }  
= 1/A + 1 - n/α - 1/(2.α)
= 1/A + 1 - (n+1/2)/α
= 0
i.e. we make the MML estimate (inference)  α' = (n+1/2) / (1+1/A).
The uncertainty region in the estimate of the parameter is about sqrt(12/F(α')), i.e. sqrt(12 α').

Poisson Process

The Poisson process models, for example, the number of radioactive decays in a given time t:

P(n,t) = 
If we observe data `n' over time `t', the MML estimate of α (Wallace & Dowe 1997) is  α' = (n+1/2)/(t+1/A)


  • C. S. Wallace & D. L. Dowe. MML Mixture Modelling of Multi-state, Poisson, von Mises Circular and Gaussian Distributions. Proc. 6th Int. Workshop on Artificial Intelligence and Statistics, pp.529-536, 1997


A parameter-less probability distribution for positive integers:

P(n) =
This is a proper probability distribution:
 = 1 - 1/2 + 1/2 - 1/3 + 1/3 ...  = 1
So a non-redundant code can be based on it
msgLen(n) = log2(n) + log2(n+1), for n>0

The expectation of the distribution is infinite:

{i/(i.(i+1))} = 


 = ∞

(so perhaps it is not a good distribution to use if you anticipate mostly "small" integers).

integer probability
1 1/2
2 1/6
3 1/12
4 1/20
... ...


Note that   P(n) ~ 1/n   cannot be a proper probability distribution because


 = ∞

Therefore a code with message lengths ~ -log2(1/n) = log2(n)   for all n≥1 is impossible. However, the log* prior (and code) gets close to 1/n.

Log2* and Relatives

If you know that an integer, n, lies in the interval [1,N] (or in [0,N-1]) then it can be encoded in log2(N) bits, (and this is an optimal code if the probability distribution is uniform). What to do when there is no such bound N? Obviously transmit the length of the code word for n first. But how to transmit the length? Transmit its length first, of course! A sound code can in fact be based on this intuitive idea; note that logk(n) decreases very rapidly as k increases.

The leading bit of n is necessarily ``1'' so there is no need to transmit it, except that it can be used as a flag to determine whether the current value is a length or the final value of n proper; lengths are thus given a leading ``0''. Such a prefix code can be used to code integers of arbitrary size. Unfortunately the length of a code word as a function of n is neither convex nor smooth although it is monotonic increasing+.

integer components code word
1 1 1
2 2,2 00 10
3 2,3 00 11
4 2,3,4 00 01 100, e.g., ~ code'(3)++100
5 2,3,5 00 01 101
6 2,3,6 00 01 110
7 2,3,7 00 01 111
8 2,3,4,8 00 01 000 1000, e.g., ~ code'(4)++1000
9 2,3,4,9 00 01 000 1001
10 2,3,4,10 00 01 000 1010
15 2,3,4,15 00 01 000 1111
16 2,3,5,16 00 01 001 10000, e.g., ~ code'(5)++10000
Note, the spaces in the above code words are only there to make the components stand out.
The code above is valid, but not at all efficient. In fact we can do better, i.e. achieve a non-redundant code, by using lengths minus one:
integer components code word prob
1 1 1 1/2
2 1,2 0 10 1/8
3 1,3 0 11
4 1,2,4 0 00 100, e.g., ~ code'(2)++100 1/64
5 1,2,5 0 00 101
6 1,2,6 0 00 110
7 1,2,7 0 00 111
8 1,3,8 0 01 1000, e.g., ~ code'(3)++1000 1/128
9 1,3,9 0 01 1001
10 1,3,10 0 01 1010
15 1,3,15 0 01 1111
16 1,2,4,16 0 00 000 10000, e.g., ~ code'(4)++100000 1/2048
msgLen(n) = 1 + floor(log2 n) + msgLen(floor(log2 n)),   if n>1
    = 1,   if n=1
P(n) = 2-msgLen(n)

The probability distribution P(n) has an infinite expectation: the probability of n is greater than under the 1/(n.(n+1)) distribution (which has an infinite expectation) for large n.

Use the HTML FORM below to encode an integer `n':

n= > >
word= L
 {0..9}+  {0,1}+

Rissanen (1983) gives, r(n),

r(n) = log2*(n) + log2(2.865)
where log2*(n) = log2 n + log2 log2 n + ... NB. +ve terms only
P(n) = 2-r(n)
as a continuous approximation to code-word lengths and advocates 2-r(n) as a "universal prior" for integers. The "2.865" is a normalisation constant to make the distribution (of the approximation) sum to 1.0. Note that r(n) is continuous but it (the area under the curve) is not convex being concave at n=2, 4, 16, 216, ... etc. This can cause problems for some optimisation algorithms (Allison & Yee 1990).


  • P. Elias. Universal Codeword Sets and Representations for the Integers. IEEE Trans. Inform. Theory IT-21 pp.194-203, 1975
    Introduced this kind of code for the integers (- Farr 1999)
  • S. K. Leung-Yan-Cheong & T. M. Cover. Some Equivalences between Shannon Entropy and Kolmogorov Complexity. IEEE Trans. Inform. Theory IT-24 pp.331-338, 1978
    Investigated this kind of code for the integers (- Farr 1999)
  • J. Rissanen. A Universal Prior for Integers and Estimation by Minimum Description Length. Annals of Statistics 11(2) pp.416-431, 1983.
    Advocates the use of the log* distribution and code.
  • + L. Allison & C. N. Yee. Minimum Message Length Encoding and the Comparison of Macromolecules. Bull. Math. Biol. 52(3) pp.431-453, 1990.
  • G. Farr. Information Theory and MML Inference. School of Computer Science and Software Engineering, Monash University, 1999.
    An excellent source on "universal" codes (and other things).

Wallace Tree Code

The definition of (strict) binary trees is:

  1. A leaf is a binary tree.
  2. If `left' and `right' are binary trees then <left,right> is a binary tree.
A code for binary trees (Wallace & Patrick 1993) follows their definition:
  1. The code for a leaf is `0'.
  2. The code for a fork, <left,right>, is `1 code(left) code(right)'.

Code   Tree

0      [leaf]

100            <..>
              .    .
            .        .
       [leaf]        [leaf]

10100              <..>
 ^^               .    .
 ||             .        .
 |Right    [leaf]        <..>
 |                      .    .
 Left                 .        .
                 [leaf]        [leaf]

Note, a code-word always has one more zero than it has ones. This allows the end of a code word to be detected. It also allows the word to be decoded. Note for example that `1' and `10' are not code words.

The code is efficient, that is to say the sum over all words, w, of 2-|w| is one. The code is equivalent to giving a tree with code w a probability of 2-|w|: This would be difficult to prove combinatorially, but consider all infinitely long strings over {0,1}. We make them all equally probable in the sense that if they are truncated after k-bits, then the 2k truncated strings are all equally likely, i.e. of probability 1/2k, for all k. The sum of the probabilities of the truncated strings of an arbitrary length k is clearly 1.

Now 0 (pr=0.5) and 1 (pr=0.5) can be taken as the steps in a random walk. It is known that a one-dimensional random walk returns to the starting point with probability 1.0. Taking the set of infinitely long strings over {0,1}, almost every one of them, i.e. all but a subset having total probability 0.0, will at some point have one more `0' than `1's. Truncate each such string at the first such point. The probability of such a truncated string, w, is 1/2|w|. This also equals the sum of the probabilities of all of its infinite extensions. i.e. the sum of the probabilities of all such code words is 1.0.

This can be used as the basis of a code for positive integers: Enumerate the code words in order of increasing length and within that, for a given length, lexicographically say:

integer code word probability
1 0 1/2
2 100 1/8
3 10100 1/32
4 11000 1/32
5 1010100 1/128
6 1011000
7 1100100
8 1101000
9 1110000
10 101010100 1/512
11 101011000
12 101100100
13 101101000
14 101110000
15 110010100
16 110011000
17 110100100
18 110101000
19 110110000
20 111000100
21 111001000
22 111010000
23 111100000
24 10101010100 1/2048
... ... ...
The first code word of length 2k+3 is 1(01)k00 and the last is 1k+10k+2.

We see that the code-word lengths increase in smaller and more regular jumps than is the case for the log* code.


  • C. S. Wallace & J. D. Patrick. Coding Decision Trees. Machine Learning, 11, pp.7-22, 1993.
    The tree-code is used to code the topology of classification trees (also known as [decision-trees]). This allows simple trees and complex trees to be compared fairly. Other information held in the nodes of the trees is also coded.
  • logk(n) = log(log(...log(n))), k times.
Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

free op. sys.
free office suite
~ free photoshop
web browser

© L. Allison   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Tuesday, 16-Jul-2024 13:54:21 AEST.