Lloyd Allison,
School of Computer Science & Software Engineering, FIT,
Seminar's area: Inductive inference,
Research problem: What are statistical models?
I.e. What do they do? What can be done to them? How can you combine two or more of them? And what do you get? How can you program with them? What are the operations, types, classes, semantics?
This talk describes an approach and illustrates it with a casestudy of mixed (hybrid) Bayesian networks which are applied to a challenging data set.
Some of this work was done with or was inspired by Leigh Fitzgibbon, Josh Comley and the late Chris Wallace.
This file is at users.monash.edu/~lloyd/Seminars/2005II/index.shtml and includes links to extra material.The long story: Some of us in CSSE at Monash started to write some generalized machinelearning software  with a GUI to do common, anticipated tasks, a 'scripting language' for unanticipated but not too hard tasks, and an implementation language for major extensions. We soon came to realise that the major design issue was "what was to be the notion of 'type', and/or 'class', of data and of statistical models of data?" Defining this notion was the big problem for the scripting language which really defines the heart of such a system. The GUI would naturally follow the scripting language's lead on types and classes. Almost any mainstream language would do for the implementation language (its ideas of type and/or class might or might not influence the scripting language and GUI). We got a good way using Java for the implementation language and implemented a lambdacalculus based scripting language, but we did not fully answer what is the best notion of type and/or class for this domain?
There is some evidence to suggest that functional programming (FP), which derives from lambda calculus, is a good general paradigm for machine learning (e.g. Lisp is popular in AI, and the 'S' package has a Lispish scripting language). But type checking is often weak and/or done at runtime. Perhaps it is even a trap to design a custom (scripting) language when your main focus is on some application area  will the language be professional and will it develop?
This talk describes an approach to writing machinelearning software using the stock functional language Haskell for both scripting and implementation purposes (no GUI, yet). Haskell has a compiletime parametric polymorphic type system and has typeclasses for adhoc polymorphism (overloading). It has one of the more sophisticated type systems among reasonably well supported programming languages, and type systems are a major research focus of the Haskell community. The runtime parts of the language are sufficient to implement classic, and novel, machine learning algorithms. But the type system is particularly useful for analysing, and answering, "just what is a statistical model from a programming point of view?" Machine learning is a very happy application area for it. Many statistical models are naturally polymorphic and there are useful "high order" operators on estimators and models. FP seems to add generality and lightness to statistical models.
Overfitting is a well known problem in machine learning. It becomes even more troublesome when a statistical model is used as a submodel in new combinations with others. The submodel is very likely to be used outside its comfort zone. It is essential that the complexity of submodels be under good, automatic control to make compositional programming with them practical. Wallace's minimum message length (MML) inference is a compositional criterion and is a good fit with FP.
The talk uses practical examples including Bayesian networks (BNs) with discrete and continuous attributes to illustrate 'inductive programming' through FP and MML.
Given a new problem in general computing,
the probability of having a readymade solution is ~0, ...
programming languages (e.g. Haskell)
exist to ease writing new solutions._{ }
Given a new inductive inference problem,
the probability of having a readymade solution is ~0, ...
<?what?> exists to ease writing new solutions?
users.monash.edu/~lloyd/Seminars/2005II/index.shtml
repeat_{ }{
_{ } new postgrad starts in machine learning;
_{ } devises new model (or variation);
_{ } does some maths (not just a hacker);
_{ } implements search & estimators (practical too!);
_{ } does comparative tests v. other methods (beats them);
_{ } writes thesis, gets degree & leaves.
_{ } }
We are often left with programs that are hard to (re) use

We are given a dataset over 5 attributes, @0...@4:
@0  continuous
@1  Boolean
@2  Boolean
@3  Boolean
@4  continuous.
Bayesian networks (BNs) can "learn & explain" relationships between attributes (variables).
Must infer: Network and, for each node,
(Note, BN is "just" a casestudy in I.P..)
the toy example's Bayesian network
Now, we have some general machinery[14]:
classes 


Model ... 
FunctionModel ... 
TimeSeries ... 

types 
ModelType ... 
FunctionModelType CTreeType ... 
TimeSeriesType ... 

values 
mState Normal ... 
class'n tree ... 
Markov model ... 
N. Friedman and M. Goldszmidt,_{ }
suggested using classification trees in the nodes of a Bayesian network.
Oh good, we just happen [14] to have:
estCTree  :: 










{CTleaf N(1.0,0.41)(+0.1),_,_,_,_}, @0 {CTleaf _,mState[0.5,0.5],_,_,_}, @1 {CTfork @0<>=1.4[ @2@0,@1 {CTleaf _,_,mState[0.99,0.01],_,_}, {CTfork @1=FalseTrue[ {CTleaf _,_,mState[0.98,0.02],_,_}, {CTleaf _,_,mState[0.02,0.98],_,_}]}]}, {CTleaf _,_,_,mState[0.5,0.5],_}, @3 {CTfork @2=FalseTrue[ @4@0,@2 {CTfork @0<>=1.0[ {CTleaf _,_,_,_,N(0.55,0.2)(+0.1)}, {CTfork @0<>=1.4[ {CTleaf _,_,_,_,N(1.0,0.2)(+0.1)}, {CTleaf _,_,_,_,N(1.45,0.2)(+0.1)}]}]}, {CTleaf _,_,_,_,N(3.45,0.2)(+0.1)}]}See [earlier slide]. N.B. Don't get the nodes of the network and of a tree confused!
mState[p0,p1,...]  multistate distribution[*].
N(m,s)  normal (Gaussian) distribution[*].
"classification" tree[*],
CTleaf  model[*] an attribute,
CTfork  test[*] an attribute.
([*] message lengths, i.e. complexities, direct the search & prevent overfitting.)
It took one day to program
general mixed Bayesian
(Note discrete and continuous attributes.)
data Tipe = Alzheimers  Child  Despondent  Hiker  ...
type Age = Double
data Race = White  Black deriving (Eq, Enum,...
data Gender = Male  Female deriving (Eq, Enum,...
data Topography = Mountains  Piedmont  Tidewater deriving...
data Urban = Rural  Suburban  Urban
type HrsNt = Double  hours notified
type DistIPP = Double  distance
...
(See Koester, and also Twardy & Hope.)type MissingPerson = (Maybe Tipe, Maybe Age, ...)
e0 = (estModelMaybe estMultiState) Tipe
e1 = (estModelMaybe (estNormal 0 90 1 70 0.5)) Age
...
e7 = (estModelMaybe (estNormal 0 50 0.5 30 0.2)) DistIPP
...
estimator = estVariate15 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14
NB. There is(n't:) a lot of missing data.
Net:[  @1, Age: {CTleaf _,(Maybe 50:50,N(40.6,27.5)... }, ... etc. ...  @6, HrsNt: {CTfork @1(<>=62.0?)[ {CTleaf...,(Maybe 50:50,N( 8.7, 7.6)...}, {CTleaf...,(Maybe 50:50,N(21.4,26.3))...}, {CTleaf...,(Maybe 50:50,N(20.0,..1case)..} ]}, ... ] network: 115.1 nits + data: 5396.6 nits null: 5935.6 nits (@0..@7)print warning: May lose network's lines if shrunk.
e.g. 

 
A Network Inferred for all 15 Attributes of 'lost persons' 

estModelMaybe estModel dataSet = let present (Just _) = True present Nothing = False m1 = uniformModelOfBool  [*] m2 = estModel (map (\(Just x) > x) (filter present dataSet)) in modelMaybe m1 m2
m1 = estMultiState (map present dataSet)
class Project t where  as in 'projection' select :: [Bool] > t > t selAll :: t > [Bool]  all True flags
A type, t, in class Project
can be projected onto a
selected subspace.
E.g. A 15dim'n estimator, such as estVariate15 e0...e14.
To allow parent / child selection to be under program control.
Also a related class for selective partitioning of data spaces within trees.
It took 1.5 days to apply Bayesian networks
(just an example)
to this
Some subjective [comparisons].
Have types and typeclasses for Models, FunctionModels and TimeSeries models. Polymorphic types, type classes, highorder functions, lazyevaluation  all useful.
Easy to write a new model,
[Case studies] include  mixture models (clustering), timeseries, Markov models, mixtures of Markov models, function models (regressions), (CD)Trees, BNets, mutations, segmentation, etc..
