Lloyd Allison,
School of Computer Science & Software Engineering, FIT,
Seminar's area: Inductive inference,
Research problem: What are statistical models?
I.e. What do they do? What can be done to them? How can you combine two or more of them? And what do you get? How can you program with them? What are the operations, types, classes, semantics?
This talk describes an approach and illustrates it with a case-study of mixed (hybrid) Bayesian networks which are applied to a challenging data set.
Some of this work was done with or was inspired by Leigh Fitzgibbon, Josh Comley and the late Chris Wallace.
This file is at users.monash.edu/~lloyd/Seminars/2005-II/index.shtml and includes links to extra material.The long story: Some of us in CSSE at Monash started to write some generalized machine-learning software -- with a GUI to do common, anticipated tasks, a 'scripting language' for unanticipated but not too hard tasks, and an implementation language for major extensions. We soon came to realise that the major design issue was "what was to be the notion of 'type', and/or 'class', of data and of statistical models of data?" Defining this notion was the big problem for the scripting language which really defines the heart of such a system. The GUI would naturally follow the scripting language's lead on types and classes. Almost any main-stream language would do for the implementation language (its ideas of type and/or class might or might not influence the scripting language and GUI). We got a good way using Java for the implementation language and implemented a lambda-calculus based scripting language, but we did not fully answer what is the best notion of type and/or class for this domain?
There is some evidence to suggest that functional programming (FP), which derives from lambda calculus, is a good general paradigm for machine learning (e.g. Lisp is popular in AI, and the 'S' package has a Lisp-ish scripting language). But type checking is often weak and/or done at run-time. Perhaps it is even a trap to design a custom (scripting) language when your main focus is on some application area -- will the language be professional and will it develop?
This talk describes an approach to writing machine-learning software using the stock functional language Haskell for both scripting and implementation purposes (no GUI, yet). Haskell has a compile-time parametric polymorphic type system and has type-classes for ad-hoc polymorphism (overloading). It has one of the more sophisticated type systems among reasonably well supported programming languages, and type systems are a major research focus of the Haskell community. The run-time parts of the language are sufficient to implement classic, and novel, machine learning algorithms. But the type system is particularly useful for analysing, and answering, "just what is a statistical model from a programming point of view?" Machine learning is a very happy application area for it. Many statistical models are naturally polymorphic and there are useful "high order" operators on estimators and models. FP seems to add generality and lightness to statistical models.
Over-fitting is a well known problem in machine learning. It becomes even more troublesome when a statistical model is used as a sub-model in new combinations with others. The sub-model is very likely to be used outside its comfort zone. It is essential that the complexity of sub-models be under good, automatic control to make compositional programming with them practical. Wallace's minimum message length (MML) inference is a compositional criterion and is a good fit with FP.
The talk uses practical examples including Bayesian networks (BNs) with discrete and continuous attributes to illustrate 'inductive programming' through FP and MML.
Given a new problem in general computing,
the probability of having a ready-made solution is ~0, ...
programming languages (e.g. Haskell)
exist to ease writing new solutions.
Given a new inductive inference problem,
the probability of having a ready-made solution is ~0, ...
<?what?> exists to ease writing new solutions?
users.monash.edu/~lloyd/Seminars/2005-II/index.shtml
repeat {
new postgrad starts in machine learning;
devises new model (or variation);
does some maths (not just a hacker);
implements search & estimators (practical too!);
does comparative tests v. other methods (beats them);
writes thesis, gets degree & leaves.
}
We are often left with programs that are hard to (re-) use
--
We are given a data-set over 5 attributes, @0...@4:
@0 -- continuous
@1 -- Boolean
@2 -- Boolean
@3 -- Boolean
@4 -- continuous.
Bayesian networks (BNs) can "learn & explain" relationships between attributes (variables).
Must infer: Network and, for each node,
(Note, BN is "just" a case-study in I.P..)
the toy example's Bayesian network
Now, we have some general machinery[1-4]:
classes |
|
|||
Model ... |
FunctionModel ... |
TimeSeries ... |
||
types |
ModelType ... |
FunctionModelType CTreeType ... |
TimeSeriesType ... |
|
values |
mState Normal ... |
class'n tree ... |
Markov model ... |
N. Friedman and M. Goldszmidt,
suggested using classification trees in the nodes of a Bayesian network.
Oh good, we just happen [1-4] to have:
estCTree | :: |
|
|
|
|
||
|
|
||
|
|
{CTleaf N(1.0,0.41)(+-0.1),_,_,_,_}, --@0 {CTleaf _,mState[0.5,0.5],_,_,_}, --@1 {CTfork @0<|>=1.4[ --@2|@0,@1 {CTleaf _,_,mState[0.99,0.01],_,_}, {CTfork @1=False|True[ {CTleaf _,_,mState[0.98,0.02],_,_}, {CTleaf _,_,mState[0.02,0.98],_,_}]}]}, {CTleaf _,_,_,mState[0.5,0.5],_}, --@3 {CTfork @2=False|True[ --@4|@0,@2 {CTfork @0<|>=1.0[ {CTleaf _,_,_,_,N(0.55,0.2)(+-0.1)}, {CTfork @0<|>=1.4[ {CTleaf _,_,_,_,N(1.0,0.2)(+-0.1)}, {CTleaf _,_,_,_,N(1.45,0.2)(+-0.1)}]}]}, {CTleaf _,_,_,_,N(3.45,0.2)(+-0.1)}]}See [earlier slide]. N.B. Don't get the nodes of the network and of a tree confused!
mState[p0,p1,...] -- multi-state distribution[*].
N(m,s) -- normal (Gaussian) distribution[*].
"classification" tree[*],
CTleaf -- model[*] an attribute,
CTfork -- test[*] an attribute.
([*] message lengths, i.e. complexities, direct the search & prevent over-fitting.)
It took one day to program
general mixed Bayesian
(Note discrete and continuous attributes.)
data Tipe = Alzheimers | Child | Despondent | Hiker | ...
type Age = Double
data Race = White | Black deriving (Eq, Enum,...
data Gender = Male | Female deriving (Eq, Enum,...
data Topography = Mountains | Piedmont | Tidewater deriving...
data Urban = Rural | Suburban | Urban
type HrsNt = Double -- hours notified
type DistIPP = Double -- distance
...
(See Koester, and also Twardy & Hope.)type MissingPerson = (Maybe Tipe, Maybe Age, ...)
e0 = (estModelMaybe estMultiState) --Tipe
e1 = (estModelMaybe (estNormal 0 90 1 70 0.5)) --Age
...
e7 = (estModelMaybe (estNormal 0 50 0.5 30 0.2)) --DistIPP
...
estimator = estVariate15 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14
NB. There is(n't:-) a lot of missing data.
Net:[ -- @1, Age: {CTleaf _,(Maybe 50:50,N(40.6,27.5)... }, ... etc. ... -- @6, HrsNt: {CTfork @1(<|>=62.0|?)[ {CTleaf...,(Maybe 50:50,N( 8.7, 7.6)...}, {CTleaf...,(Maybe 50:50,N(21.4,26.3))...}, {CTleaf...,(Maybe 50:50,N(20.0,..1-case)..} ]}, ... ] network: 115.1 nits + data: 5396.6 nits null: 5935.6 nits (@0..@7)print warning: May lose network's lines if shrunk.
e.g. |
|
| |||||||||||||||||||||||||||||||||||||||||||||||
A Network Inferred for all 15 Attributes of 'lost persons' |
---|
estModelMaybe estModel dataSet = let present (Just _) = True present Nothing = False m1 = uniformModelOfBool -- [*] m2 = estModel (map (\(Just x) -> x) (filter present dataSet)) in modelMaybe m1 m2
m1 = estMultiState (map present dataSet)
class Project t where -- as in 'projection' select :: [Bool] -> t -> t selAll :: t -> [Bool] -- all True flags
A type, t, in class Project
can be projected onto a
select-ed sub-space.
E.g. A 15-dim'n estimator, such as estVariate15 e0...e14.
To allow parent / child selection to be under program control.
Also a related class for select-ive partitioning of data spaces within trees.
It took 1.5 days to apply Bayesian networks
(just an example)
to this
Some subjective [comparisons].
Have types and type-classes for Models, Function-Models and Time-Series models. Polymorphic types, type classes, high-order functions, lazy-evaluation -- all useful.
Easy to write a new model,
[Case studies] include -- mixture models (clustering), time-series, Markov models, mixtures of Markov models, function models (regressions), (C|D)-Trees, BNets, mutations, segmentation, etc..
|