Inductive Programming

Lloyd Allison, School of Computer Science & Software Engineering, FIT, Monash University, Victoria, Australia 3800.

Seminar's area: Inductive inference, i.e. inferring general models from given data (artificial intelligence, data mining, machine learning).

Research problem: What are statistical models?

I.e. What do they do? What can be done to them? How can you combine two or more of them? And what do you get? How can you program with them? What are the operations, types, classes, semantics?

This talk describes an approach and illustrates it with a case-study of mixed (hybrid) Bayesian networks which are applied to a challenging data set.

Some of this work was done with or was inspired by Leigh Fitzgibbon, Josh Comley and the late Chris Wallace.

A seminar based on this material was presented to: School of Computing and Information Technology, Griffith University (Nathan campus), Qld 4111, Thurs. 26 May 2005; Clayton School of I.T., Monash U., Mon. 25 July 2005; School of Information Technologies, U. Sydney, NSW 2006, Wed. 7 September 2005.

This file is at users.monash.edu/~lloyd/Seminars/2005-II/index.shtml and includes links to extra material.

The long story: Some of us in CSSE at Monash started to write some generalized machine-learning software -- with a GUI to do common, anticipated tasks, a 'scripting language' for unanticipated but not too hard tasks, and an implementation language for major extensions. We soon came to realise that the major design issue was "what was to be the notion of 'type', and/or 'class', of data and of statistical models of data?" Defining this notion was the big problem for the scripting language which really defines the heart of such a system. The GUI would naturally follow the scripting language's lead on types and classes. Almost any main-stream language would do for the implementation language (its ideas of type and/or class might or might not influence the scripting language and GUI). We got a good way using Java for the implementation language and implemented a lambda-calculus based scripting language, but we did not fully answer what is the best notion of type and/or class for this domain?

There is some evidence to suggest that functional programming (FP), which derives from lambda calculus, is a good general paradigm for machine learning (e.g. Lisp is popular in AI, and the 'S' package has a Lisp-ish scripting language). But type checking is often weak and/or done at run-time. Perhaps it is even a trap to design a custom (scripting) language when your main focus is on some application area -- will the language be professional and will it develop?

This talk describes an approach to writing machine-learning software using the stock functional language Haskell for both scripting and implementation purposes (no GUI, yet). Haskell has a compile-time parametric polymorphic type system and has type-classes for ad-hoc polymorphism (overloading). It has one of the more sophisticated type systems among reasonably well supported programming languages, and type systems are a major research focus of the Haskell community. The run-time parts of the language are sufficient to implement classic, and novel, machine learning algorithms. But the type system is particularly useful for analysing, and answering, "just what is a statistical model from a programming point of view?" Machine learning is a very happy application area for it. Many statistical models are naturally polymorphic and there are useful "high order" operators on estimators and models. FP seems to add generality and lightness to statistical models.

Over-fitting is a well known problem in machine learning. It becomes even more troublesome when a statistical model is used as a sub-model in new combinations with others. The sub-model is very likely to be used outside its comfort zone. It is essential that the complexity of sub-models be under good, automatic control to make compositional programming with them practical. Wallace's minimum message length (MML) inference is a compositional criterion and is a good fit with FP.

The talk uses practical examples including Bayesian networks (BNs) with discrete and continuous attributes to illustrate 'inductive programming' through FP and MML.

Motivation 1

Given a new problem in general computing,
the probability of having a ready-made solution is ~0, ...
programming languages (e.g. Haskell) exist to ease writing new solutions.

Given a new inductive inference problem,
the probability of having a ready-made solution is ~0, ...
<?what?> exists to ease writing new solutions?

users.monash.edu/~lloyd/Seminars/2005-II/index.shtml

Motivation 2

repeat{
new postgrad starts in machine learning;
devises new model (or variation);
does some maths (not just a hacker);
implements search & estimators (practical too!);
does comparative tests v. other methods (beats them);
writes thesis, gets degree & leaves.
}

We are often left with programs that are hard to (re-) use -- undocumented code, not robust, different I/O formats, reinvented wheels, etc.. Can this be better organised?

Case Study, BNs: 1. A Toy Example

We are given a data-set over 5 attributes, @0...@4:

@0 -- continuous
@1 -- Boolean
@2 -- Boolean
@3 -- Boolean
@4 -- continuous.

Bayesian networks (BNs) can "learn & explain" relationships between attributes (variables).

Must infer: Network and, for each node, parents--->child parameters.

(Note, BN is "just" a case-study in I.P..)

the toy example's Bayesian network

generalized mixed hybrid Bayesian network discrete and continuous variables

-- as inferred.

Now, we have some general machinery[1-4]:

classes

SuperModel
prior
msg1
mixture
...

Model
pr m d
msg2 m ds
...

FunctionModel
condModel fm i
condPr fm i o
...

TimeSeries
predictors s
...

types

ModelType
...

FunctionModelType
CTreeType
...

TimeSeriesType
...

values

mState
Normal
mixture model
...

class'n tree
...

Markov model
...

& also [conversion functions]

N. Friedman and M. Goldszmidt, Learning Bayesian networks with local structure, UAI'96, pp.252--262, 1996,

suggested using classification trees in the nodes of a Bayesian network.

Oh good, we just happen [1-4] to have:

`estCTree`	`::`	`([opSpace] -> model of opSpace)`	--arbitrary leaf estimator
		`-> ([ipSpace] -> [Splitter ipSpace])`	--re partitioning ipSpace
		`-> [ipSpace] -> [opSpace]`	--training data
		`-> CTreeType ipSpace opSpace`	--resulting class'n tree

(Taking some small liberties with the types.-)
the toy network...

{CTleaf N(1.0,0.41)(+-0.1),_,_,_,_}, --@0

{CTleaf _,mState[0.5,0.5],_,_,_},   --@1

{CTfork @0<|>=1.4[                  --@2|@0,@1
  {CTleaf _,_,mState[0.99,0.01],_,_},
  {CTfork @1=False|True[
    {CTleaf _,_,mState[0.98,0.02],_,_},
    {CTleaf _,_,mState[0.02,0.98],_,_}]}]}, 

{CTleaf _,_,_,mState[0.5,0.5],_},   --@3

{CTfork @2=False|True[              --@4|@0,@2
  {CTfork @0<|>=1.0[
    {CTleaf _,_,_,_,N(0.55,0.2)(+-0.1)},
    {CTfork @0<|>=1.4[
      {CTleaf _,_,_,_,N(1.0,0.2)(+-0.1)},
      {CTleaf _,_,_,_,N(1.45,0.2)(+-0.1)}]}]},
  {CTleaf _,_,_,_,N(3.45,0.2)(+-0.1)}]}

See [earlier slide]. N.B. Don't get the nodes of the network and of a tree confused!

components from the parts bin:

mState[p0,p1,...] -- multi-state distribution[*].

N(m,s) -- normal (Gaussian) distribution[*].

"classification" tree[*],

CTleaf -- model[*] an attribute,

CTfork -- test[*] an attribute.

([*] message lengths, i.e. complexities, direct the search & prevent over-fitting.)

Expressiveness

It took one day to program general mixed Bayesian networks [3] and their estimator.

(Note discrete and continuous attributes.)

BN Application: 2. A Real Example, Lost Person

data Tipe = Alzheimers | Child | Despondent | Hiker | ...

type Age = Double

data Race = White | Black deriving (Eq, Enum,...

data Gender = Male | Female deriving (Eq, Enum,...

data Topography = Mountains | Piedmont | Tidewater deriving...

data Urban = Rural | Suburban | Urban deriving (Eq,Ord,...

type HrsNt = Double -- hours notified

type DistIPP = Double -- distance

...

(See Koester, and also Twardy & Hope.) NB. "Just" an application of the case-study.

Estimators

type MissingPerson = (Maybe Tipe, Maybe Age, ...)

e0 = (estModelMaybe estMultiState) --Tipe

e1 = (estModelMaybe (estNormal 0 90 1 70 0.5)) --Age

...

e7 = (estModelMaybe (estNormal 0 50 0.5 30 0.2)) --DistIPP

...

estimator = estVariate15 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14

NB. There is(n't:-) a lot of missing data.

1st Inferred Network for 8 attributes, and ...

... Inferred Network's Trees:

Net:[
 -- @1, Age:
 {CTleaf _,(Maybe 50:50,N(40.6,27.5)... },

 ... etc. ...

 -- @6, HrsNt:
 {CTfork @1(<|>=62.0|?)[
  {CTleaf...,(Maybe 50:50,N( 8.7, 7.6)...},
  {CTleaf...,(Maybe 50:50,N(21.4,26.3))...},
  {CTleaf...,(Maybe 50:50,N(20.0,..1-case)..}
]},

... ]

network: 115.1 nits + data: 5396.6 nits
null:   5935.6 nits (@0..@7)

print warning: May lose network's lines if shrunk.

e.g.

hybrid mixed generalized Bayesian networks

@0:	Tipe	= Alzheimers \| Child \|...
@1:	Age
@2:	Race	= ...\| ...
@3:	Gender	= ...\| ...
@4:	Topography	= ...\| ...\| ...
@5:	Urban	= ...\| ...\| ...
@6:	HrsNt
@7:	DistIPP
@8:	TrackOffset
@9:	Health	= Well \| Hurt \| Dead --after!
@10:	Outcome	= Find \| Suspended \| Invalid
@11:	FindRes	= Ground \| Air \|...\| Dog
@12:	FindLoc	= Brush \| Woods \|...
@13:	HrsFind
@14:	HrsTot

A Network Inferred for all 15 Attributes of 'lost persons'

E.g. A High-Order Function of an Arbitrary Estimator

estModelMaybe estModel dataSet =
 let
  present (Just _) = True
  present Nothing  = False
  m1 = uniformModelOfBool          -- [*]
  m2 = estModel (map (\(Just x) -> x)
                (filter present dataSet))
 in modelMaybe m1 m2

[*] orif modelling missing-ness, instead use...

  m1 = estMultiState (map present dataSet)

A supporting class

class Project t where     -- as in 'projection'
 select :: [Bool] -> t -> t
 selAll :: t -> [Bool]    -- all True flags

A type, t, in class Project can be projected onto a select-ed sub-space.
E.g. A 15-dim'n estimator, such as estVariate15 e0...e14.

To allow parent / child selection to be under program control.

Also a related class for select-ive partitioning of data spaces within trees.

Expressiveness

It took 1.5 days to apply Bayesian networks (just an example) to this problem [3], including devising a general method to deal with missing data,

Some subjective [comparisons].

Summary

Have types and type-classes for Models, Function-Models and Time-Series models. Polymorphic types, type classes, high-order functions, lazy-evaluation -- all useful.

Easy to write a new model, e.g. ModelMaybe & Bayesian network (more), to suit a new problem.

[Case studies] include -- mixture models (clustering), time-series, Markov models, mixtures of Markov models, function models (regressions), (C|D)-Trees, BNets, mutations, segmentation, etc..

more:
[conversion]
[mixture model]
[class'n tree]
[missing data]
[Bayes net]
[comparisons]

[1] L. Allison. Types and classes of machine learning and data mining. 26th Australasian Computer Science Conference (ACSC), Adelaide, pp.207--215, Feb. 2003.
[2] L. Allison. Inductive inference 1. TR 2003/148, Dec. 2003, and also
[3] --"--. Inductive inference 1.1. TR 2004/153, May 2004, School of Comp. Sci. & Software Eng., Monash University.
[4] L. Allison. Models for machine learning and data mining in functional programming. J. Functional Prog., 15(1), pp.15-32, Jan 2005 (online 23/7/2004).
[5] L. Allison. A Programming Paradigm for Machine Learning, with a Case Study of Bayesian Networks. ACSC2006, pp.103-111, Jan. 2006.
R. J. Koester. Virginia dataset on lost-person behaviour. Author's site http://www.dbs-sar.com/, 2001.
S. Peyton-Jones & others. Report on the Programming Language Haskell-98. Available at http://www.haskell.org/, 1999.
C. R. Twardy. SARbayes: Predicting lost person behavior. Presented to National Association of Search and Rescue (NASAR), Charlotte, 2002.
C. R. Twardy & L. Hope. Missing data on missing persons. Private comm. 2004.
C.R.T. also suggested the name "inductive programming."

Created with "vi (Linux & Solaris)", charset=iso-8859-1