next up previous
Next: Message from Monte Carlo Up: Bayesian Posterior Comprehension via Previous: Introduction


Bayesian Posterior Comprehension

Suppose we wish to construct an epitome, or brief summary, of a posterior distribution that could be used as a substitute for the full posterior distribution for all subsequent analyses. A general set of properties that we might reasonably expect from such an epitome are the facilitation of:

  1. Point estimation.
  2. Human comprehension (i.e., human insight and understanding).
  3. Approximation of posterior expectations.
In this paper we will refer to these properties as the properties of Bayesian Posterior Comprehension (BPC). For an epitome with BPC properties to be of any use, it must contain as much information about the posterior distribution as possible. We note that we have ruled out choosing the epitome criterion on a case by case basis by requiring that the epitome be suitable for all subsequent analyses. Otherwise we would choose the epitome criterion with minimum expected loss given time, computation and other constraints.

An epitome could take many forms, therefore we must first settle representational issues. Three alternative representations that could be considered are:

  1. Approximate the posterior distribution by fitting a parametric distribution to it.
  2. Sample from the posterior distribution - the sample is then the epitome.
  3. Choose a small weighted subset of the parameter space where the weights somehow represent the goodness of each estimate.

The first representational option could be the most succinct and easily interpreted by an operator for many posterior distributions. However, there would be other more complicated posterior distributions such as that for a non-trivial change-point problem where the epitome would be quite complicated and difficult to comprehend by a human. This would therefore violate the second property of BPC (human comprehension). Facilitation of the third property of BPC may also be difficult.

Representational option number two (sampling) is now a routine part of Bayesian inference due to Markov Chain Monte Carlo methods (Gilks, Richardson, and Spiegelhalter, 1996) but is not as succinct a representation as we would like. This affects the human comprehension property and also the computation time required for approximating posterior expectations.

The third representation is attractive because if the set of estimates and weights are chosen correctly then it can fulfill the requirements of BPC. We will require that the weights assigned to each estimate somehow correspond to the posterior probability associated with the estimate. Therefore we seek a weighted subset of the parameter space:

$\displaystyle \varepsilon = \{ (\theta_1,w_1),...,(\theta_K,w_K) \}$ (1)

where the $ \theta_i$ are associated with good posterior probability mass and their weights represent their goodness (a function of their probability mass) as an estimate. Such an epitome can be used for point estimation - if we are interested in inferring the single best model then we can use the $ \theta_i$ with the greatest weight. If the size of the set is small (i.e., $ K$ is small) then it can be used for posterior comprehension, as a human could inspect the set of estimates and their weights to get an understanding, and overview, of the posterior distribution. Posterior expectations could be approximated by normalising the weights and treating the set as a distribution.

Choosing a weighted set of estimates having BPC properties is a multi-objective problem. The size of the set will have a significant impact on how the conflicting BPC objectives are satisfied. If the set is too small then approximated posterior expectations will be poor and human comprehension may also suffer. If it is too large then approximated posterior expectations will require more computation time and human comprehension may again suffer.

One Bayesian approach to BPC where the parameter space is a union of subspaces of differing dimension (variable dimension posterior) would be to return as the epitome, the mode from each subspace with a weight equal to the posterior probability of the subspace. This would be less than ideal when the posterior distribution contains a multimodal subspace, since important parts of the posterior may not be represented in the epitome. Another problem with this approach is that the weights can be misleading since it is possible that a subspace containing a large amount of posterior probability also contains a mode that lies in an area of relatively poor posterior probability mass.

An approach that meets some of the requirements of BPC is Occam's Window (OW) (Raftery, Madigan, and Hoeting, 1997; Madigan and Raftery, 1994). The Occam's Window algorithm was devised primarily to allow for fast Bayesian Model Averaging. The algorithm is based on selecting a small set of subspaces from the parameter space by using posterior sampling. The strategy is not ideally suited for BPC, since in terms of point estimates, it would suffer from the same problems discussed in the previous paragraph.

(Wallace and Boulton, 1968; Wallace and Freeman, 1987; Wallace and Dowe, 1999; Wallace and Boulton, 1975)

The Minimum Message Length (MML) principle, which was briefly discussed in the introduction, can be used for constructing a BPC epitome. MML methods attempt to estimate a codebook - consisting of a countable set of point estimates, $ \theta_i$, and their quasi-prior probabilities, $ p_i$. The definition of this set is defined, in information-theoretic terms, as the codebook which minimises the expected length of a special two-part message encoding the point estimate and data (Wallace and Boulton, 1968; Wallace and Freeman, 1987; Wallace and Dowe, 1999; Wallace and Boulton, 1975). We assume that there exists a sender and a receiver that wish to communicate the observed data over a noiseless coding channel and that they share the codebook. Coding theory tells us that an event with probability $ p$ can be encoded in a message with length $ -\log p$ nits using2 an ideal Shannon code. So in theory, the sender can transmit some observed data to the receiver in a two part message. In the first part, the sender transmits a codeword corresponding to a point estimate from the codebook. This requires a message of length $ -\log p_i$ nits. The sender then transmits the data encoded using the (already) stated estimate in the second part. This requires a message of length $ -\log f(x\vert\theta_i)$ nits, where $ f(.\vert.)$ is the usual statistical likelihood function. Therefore the total message length (MessLen) of the transmission encoding an hypothesis, $ \theta_i$, and the data, $ x$, is

$\displaystyle MessLen(\theta_i,x) = -\log p_i - \log f(x\vert\theta_i)$ (2)

We expect the sender to transmit the data using the estimate from the codebook that has the minimum message length (i.e. $ argmin_{\theta_i} MessLen(\theta_i,x)$).

In order to minimise the length of these two-part messages on average, we seek the codebook that has minimum expected message length. This creates a trade-off between model complexity and goodness of fit. For example, if you increase the number of entries in the codebook then the expected length of the first part of the message increases (but the expected length of the second part encoding the data decreases). If you decrease the number of entries in the codebook then you get the opposite effect3.

To strictly minimise the expected message length one must create a codebook that can be used to encode any data from the dataspace. This is not computationally practical as a general method of inference (see, e.g., (Farr and Wallace, 2002)). In practice, most MML approximations only attempt to estimate the entries of the codebook that are close to the minimum message length. It is this small, instantaneous, codebook that corresponds to an epitome that has the BPC properties. The weights in the MML epitome can be calculated by converting from message lengths to (unnormalised) probabilities - i.e., by taking the inverse log (antilog) of the negative message length

$\displaystyle \varepsilon = \{ (\theta_1,antilog(- MessLen(\theta_1,x))),...,(\theta_K,antilog(- MessLen(\theta_K,x))) \}$ (3)

We note that this MML epitome is a function of the MML codebook and the observed data (entering through the message length). In the following sections we describe how to create such codebooks using a recent methodology called Message from Monte Carlo. We also illustrate the use of the method with a variety of examples so that the reader may get a feel for the use of the MML instantaneous codebook.


next up previous
Next: Message from Monte Carlo Up: Bayesian Posterior Comprehension via Previous: Introduction
2003-04-23