^CSE454^   actual: [2002] [2003] [2004]

Sample Questions 2002

This page is a work in progress. It is a collection of possible questions for the subject. The exact details of what is covered from year to year may (will) vary slightly.

• 3-points => 1.5 hours.
• Closed book.
• No calculators.
• These [tables (click)] give sufficient accuracy for any numerical caluclations.

1. This is a template for a possible question on the KL distance.
• Define the `Kullback Leibler (KL) distance' of one probability distribution, Q, form another probability distribution R.
• Give an interpretatioan in terms of coding and compression of what the KL distance represents.
• Calculate (i) the KL distance of Q from R, and (ii) the KL distance of R from Q,   <when Q and R are specified as two of the following>.  - H T Coin1 1/2 1/2 Coin2 3/4 1/4
or
 - A C G T DNA1 1/4 1/4 1/4 1/4 DNA2 1/2 1/4 1/8 1/8 DNA3 1/4 1/8 1/8 1/2 DNA4 1/8 1/8 1/2 1/4 DNA5 1/8 1/2 1/4 1/8 DNA6 1/8 1/8 1/4 1/2
or
 - 1 2 3 4 5 6 Dice1 1/4 1/4 1/8 1/8 1/8 1/8 Dice2 1/8 1/8 1/8 1/8 1/4 1/4 Dice3 1/4 1/8 1/8 1/8 1/8 1/4 Dice4 1/4 1/4 1/4 1/8 1/16 1/16

2. A four-sided dice has its faces labelled `A', `B', `C', and `D'. It is rolled 200 times and the following frequencies are counted:

```   A: 25,  B: 25,  C: 50,  D: 100
```

(a) Define the term `entropy' of a distribution.

(b) What entropy would you infer for the dice?

(c) Define the term `Kullback-Leibler (KL) distance' of two distributions.

(d) There was a mix-up during the experiment and the experimenter copied down these incorrect frequencies:

```   A: 100,  B: 50,  C: 25,  D:25
```

What is the KL distance of the distribution inferred by the experimenter from the correct distribution inferred from (b)?

3. This question is about multi-state distributions and hypothesis testing.

(a) Briefly describe the three "obvious" ways of transmitting a sequence of observations drawn from a multi-state distribution whose parameters are not known in advance.

(b) What is the relationship between the message lengths of the three methods of transmitting the observations? (You do not have to prove it.)

(c) J' claims to be able to read other peoples' minds. A test is arranged where D' tosses and observes a coin which J' cannot see. J' must read D's mind and state whether D' sees a head or tail:

results 1 2 3 4 5 6 7 8 9 10
D' sees: H T H T H H H T H H
J' reports: H T T T T H H T T H
? J' correct ? y y n y n y y y n y
Describe how minimum message length (MML) can be used to test J's claim.

(d) Estimate the message lengths, in bits, for the data under the two hypotheses (i) that J' can read minds to some significant extent and (ii) that J' cannot read minds.  What can be concluded? Discuss the results.

4. You have some data on shooting attempts by a basket-ball player over a season. The data consists of a sequence of 0's and 1's, 1 indicating a basket (success), 0 indicating a miss. It is assumed that the probability of scoring a basket is independent of the outcome of other shots. The coach believes in the `hot hands' hypothesis, i.e. that the player may shoot well during some parts of the season with a high probability of success, and shoots less well at other times. The trainer, on the other hand, does not believe in this hypothesis but believes that any variation is just caused by luck and chance.

(a) In general terms, how could this difference of opinion between the coach and the trainer be examined using MML?

(b) Formulate each hypothesis in MML terms.

5. This question is about inference by Minimum Message Length (MML).

The bureau of meteorology has records of the total annual rain-fall at Darwin, Cairns, Sydney, and Perth, for the last 200 years.
H1:  One hypothesis is that the rain-fall at a location is well modelled by a normal distribution, N(mu,sigma), where mu and sigma depend only on the city.
H2:  A second hypothesis is that there are two or more "kinds" of year, and that the rain-fall at a location is well modelled by a normal distribution, N(mu,sigma), where mu and sigma depend on the city and on the kind of the year. Further, the kind of a year is independent of the kind of the previous year and of the next year.
H3:  A third hypothesis is as for H2, except that the probability distribution of the "kind" of a year can depend on the kind of the previous year.

(a) Describe how each of the three hypotheses can be encoded so as to allow their posterior probabilities to be calculated using MML.

(b) What parts of the MML "tool kit" are relevant to calculating their message lengths? (Why and how.)

(c) Describe the outline of algorithm(s) for fitting the hypotheses to the data and for calculating their message lengths.

6. (a) Briefly describe the three "obvious" ways of transmitting a sequence of observations drawn from a multi-state distribution whose parameters are not known in advance.

(b) What is the relationship between the message lengths of the three methods of transmitting the observations? (Do not prove the result.)

(c) The chairman of the Geethorn Football Club has called an emergency meeting after 16 games of the football season to discuss firing the coach because of the terrible results. The coach maintains that although the overall results for the season to date are mediocre, there has been an improvement from game 9 onwards when Fido, the club mascot who sometimes bit players on days near a full-moon, ran away. The chairman maintains that this is a lunatic suggestion and that any apparent change is merely due to chance and statistical variation, and that there is no reason to expect the rest of the season to be other than terrible.

 Game: Result:Win/Loss 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 L W L L W L L L W W L W L L W L
Describe how MML can be used to decide which opinion is more plausible.

(d) Estimate total message lengths for the data under the two hypotheses (chairman v coach), stating any assumptions that you make. Briefly explain the reasoning behind your estimates. You may find information in the appendix helpful in estimating message lengths.

7. This question is about Probabilistic Finite State Automata (PFSAs), also known as Hidden Markov Models (HMM). Below are seven sentences from a language L over the alphabet `{A,B,C}`. The ``/`' character denotes `end of sentence'. (The example comes from Gaines, B. R. (1976) Behaviour Structure Transformations under Uncertainty Int. J. Man-Machine Studies, 8, pp337-365.)

`CAAAB/BBAAB/CAAB/BBAB/CAB/BBB/CB/`

(a) Define an efficient way of encoding a PFSA.

(b) Define an efficient way of encoding the data given a PFSA.

(c) Describe how MML and PFSAs can be used to model the sentences in languages such as L. How can MML be used to identify the best such model for L?

(d) Below are two PFSAs.
Note that for simplicity, transitions that do not occur on the above data have not been shown in the 2nd PFSA; define a sensible convention for dealing with such transitions.
Estimate the message lengths of the above data under the two PFSAs below. Give the message lengths of the PFSAs, the data given each PFSA, and the totals. Briefly explain the reasoning behind your estimates. NB. The information in the appendix may be useful.

8. You are the honours-year coordinator in CSSE, concerned with the possibility that some of the subjects taken by students are inherently more `difficult' (or are marked more severely) than other subjects and that some subjects (e.g. CSE454) are easier (or marked more leniently). You believe, as a first approximation, that each student has a single basic `ability', that largely determines how well the student would perform in a subject of a given difficulty. You also believe that a subject has a single inherent `difficulty', and that a student's mark in a subject is mostly determined by the student's ability and the subject's difficulty.

You would like to adjust the raw marks of subjects to compensate for their difficulty. The adjusted marks for a subject should reflect the abilities of the students that took it, and a student's marks should reflect his or her ability.

Note that students do not take all subjects; it might even be possible to find two students who take no subject in common. It is also possible that one or more subjects are taken mainly by the stronger students and that others may be taken mainly by the weaker students, so it is not a good idea to force all subjects to have the same mean and variance.

(a) In general terms only, how could this be formulated as an MML problem?

(b) What is the `null hypothesis' for this problem?

(c) Give examples of simple and complex hypotheses in your framework.

9. This question is about coding.

Prove the Kraft Inequality :-
Given a set of N possible events {e1, e2, ..., eN}, there is a prefix code for the set with code-words of lengths L1, L2, ..., LN if and only if the sum over i=1..N of 2-Li is less than or equal to 1.

10. (a) What do we mean by `efficient' in `efficient code'?

(b) Describe an efficient, universal code for positive integers (>=0), i.e. where large integers are reasonably likely.

(c) Show how to encode the integer 18 in your code.

(d) Show the steps by which the encoding from part (c) can be decoded to recover 18.

11. The Lempel Ziv (LZ) model (1976) of strings of characters over a finite alphabet is based on the idea that a string is a mixture of "random" characters and repeated substrings. A repeat must be an exact copy of a substring that begins earlier in the string (NB. "exact").

(a) Describe how the LZ model could be formulated as an MML inference problem.

(b) Give two good, but not necessarily optimal, explanations of the string "miss sissy of mississippi" under the LZ model. Show how such an explanation could be costed in message length terms. (It is not necessary to give a message length as a number, but show how such a number could be calculated.)

(c) Describe a different way of encoding a string which would sum over all explanations of the string under the LZ model.

(d) Briefly, what differences, if any, would you expect between inferences based on single explanations as in (b) and those based on all explanations as in (c)?

12. Consider decision (classification) trees with four binary "input" attributes Att1, Att2, Att3, Att4, and a fifth binary attribute `class' which is to be predicted.

(a) Devise code words for the following tree topological structures:

Now consider the following two trees based on the above topologies:

 e.g. Att1=0:   class=0   151 times,   class=1   149 times. Att1=1:   class=0   199 times,   class=1   151 times.

(b) For each leaf node of each of the two trees, state the class predicted and give a rough estimate of the probability of that class.

(c) Cost the above two trees on the same data set. Do this by stating all parts of the message. For each tree in turn, cost those parts of the message that you are able to cost (refer to the tables provided) and outline how to cost any other part(s) of the message.

(d) State which tree is a better explanation of the data, and why?

Also see [2002].

Appendix

Include the [tables (click)]

© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (Linux & IRIX)",   charset=iso-8859-1