^CSE454^ ^Local^ ^Sample-Qs^ & ^MML^

## CSE454 Data Mining and MML 2005

### Computer Science and Software Engineering, Monash University, Australia, CSE454 Data Mining and MML(semester 1), 1th June 2005

• Duration 1.5 hours + 10 minutes reading time.
• Attempt all four (4) questions.
• This paper counts for 60% of the subject's assessment. Maximum possible marks here: 90.
• This is a closed book exam.
• No calculators.
• The attached [tables] give sufficient accuracy for any estimates of probabilities and message lengths. Where appropriate, show a numerical result as a simplified expression and/or as an estimated value, and state any assumptions.
 1. (a) Briefly, define the following as they apply to 'CSE454 Data Mining and MML'. prior probability likelihood posterior probability. [(a) 9 marks] (b) There are four DNA 'bases', {A, C, G, T}. For most organisms it is roughly the case that Pr(A) = Pr(C) = Pr(G) = Pr(T) = 0.25. However, for the parasite Plasmodium falciparum (P.f.), which causes malaria, it is roughly the case that Pr(A) = Pr(T) = 0.4 and Pr(C) = Pr(G) = 0.1. Charles has eight DNA sequences of length ten in a file; exactly one of the sequences comes from P.f.. Diane is interested in one of the sequences, S=AGAAGTACTA, from the file. What is the prior probability that S is from P.f.? What is the likelihood of S for non-P.f. DNA? What is the likelihood of S for P.f. DNA? What is the posterior probability that S is from P.f.? Show working. [(b) 11 marks] [Total: 20 marks]

 2. There are four DNA 'bases', {A, C, G, T}. A and T are complementary to each other, and C and G are complementary to each other. Because genomic DNA is stored in two complementary strands, it is the case that Pr(A)=Pr(T) and Pr(C)=Pr(G) for bulk genomic DNA. For most organisms it is roughly the case that Pr(A) = Pr(C) = Pr(G) = Pr(T) = 0.25. However quite a few parasites have biased composition (that is AT-rich or GC-rich). (Think carefully about how many parameters a zero-order model of genomic DNA has.)   (a) There are three "obvious", efficient ways to encode a univariate data-set over a finite alphabet: An "adaptive" code, and two other methods. What are the two other methods? [(a) 3 marks] (b) How do the message lengths under each of the three methods compare? [(b) 4 marks] (c) You are given a short section of recently sequenced genomic DNA, S=CCCCACGCTT, from one of the strands of a chromosome of some parasite. Show how the adaptive method would operate for S, and thus estimate the information content of S. [(c) 10 marks] (d) From this evidence, what are the odds that the parasite has a biased composition, as opposed to the unequal frequencies of bases in S just being due to chance? Why? State any assumptions. [(d) 3 marks] [Total: 20 marks]

patientb63MC
1 A N
2 A P
3 A N
4 C N
5 C P
6 A N
7 G P
8 A N
9 T P
10 C P
3. You are analysing data on a medical condition (MC) which might have a genetic cause. A simple test is available for a moderately common point-mutation at base 63, b63, of a certain gene. In the general population b63 = A most often. You have data on a number of patients who are positive (P) or negative (N) for MC. You use MML to analyse the data. (Note, of course a real data set would be much larger, and Pr(A), Pr(C), Pr(G) and Pr(T) can all be different for DNA in a gene.)

(a) You model the data as a mixture of one or more classes. Each class is defined as a bivariate model. Each bivariate model consists of a 4-state distribution for b63 and a 2-state distribution for MC. Carefully list everything that must be specified in a message (i) for a one-class "mixture" and the data, and (ii) for a two-class mixture and the data.
[(a) 10 marks]
(b) Estimate the message lengths (the parts and totals) for the best one-class and the best two-class mixtures. Show working. (You may use "total assignment" of data to classes in order to simplify calculations.)
[(b) 10 marks]
[Total: 20 marks]

 4.Consider classification- (decision-) trees over multivariate data with discrete and/or continuous attributes to be tested and a discrete ('class') attribute to be predicted.   (a) Describe an efficient scheme to encode (i) a tree, and (ii) the data given the tree. [(a) 10 marks]   A certain data set has the following attributes: ```@0, Gender = Male | Female @1, Party = Lib | Dem | Lab | Green @2, Seg = U | V | W @3, Age is continuous @4, Z = P | N -- the 'class' to be predicted. ``` Although Age is continuous it just happens that eleven distinct values appear, each several times, in the data set: 19, 24, 26, 32, 37, 45, 48, 57, 65, 69 and 73.     (b) Describe the encoding of the following tree (without data) in reasonable detail, and estimate its message length. Show working. State any assumptions. [(b) 10 marks] (c) The flow of the data through the tree is shown in the following diagram. Estimate the message length of (@4 of) the data in the two deepest leaf nodes only. Show working. [(c) 5 marks] (d) Estimate the change in total message length (both tree and data|tree) if the two deepest leaves, as identified in part (c), and the @0 test-node immediately above them were collapsed to one leaf. Show working. Should the change be made? Why or why not? [(d) 5 marks] [Total: 30 marks]