^inductive programming^

Classification- (Decision-) Trees

Problem: Given a data-set from a data-space, 'Ip × Op', with input-space, 'Ip', and output-space, 'Op', learn a classification-tree function-model to model Op conditional upon Ip.

E.g. from Ip = <Appetite, Fever, Stomach_pain> conditionally model Op = Has_appendicitis.

users.monash.edu/~lloyd/Seminars/2005-II/Tree/index.shtml  

Partitioning on input attributes (input space)

Discrete attribute:
k-way split where k is the `arity' of the attribute.
 
Ordered attribute:
Try median, left-quartile, right-quartile, octiles, ... .
NB. depends on current (part of) the data set.
 
Continuous attribute:
As for ordered attribute.
 
Multivariate:
Interleave the splitting options of all component attributes.

Classification Trees:


data CTreeType inSpace opSpace =

  CTleaf (ModelType opSpace)   |

  CTfork MessageLength
         (Splitter inSpace)
         [CTreeType inSpace opSpace]

  | ...

  -- can also consider other non-traditional options

A `Splitter' is used to partition the input space.


instance SuperModel (CTreeType inSpace opSpace) where

 -- NB. For simplicity only, this costs the
 -- structure at 1-bit per node.  This is
 -- only optimal for binary trees.

 msg1 (CTleaf leafModel)
   = log2 + msg1 leafModel

 msg1 (CTfork fnLen f dts)
   = log2 + fnLen + (foldl (+) 0 (map msg1 dts))

Message length of a tree is that of the node plus those of the sub-trees, if any.

Make a Classification Tree an instance of a FunctionModel:


instance FunctionModel CTreeType where

  condModel (CTleaf leafModel) i = leafModel

  condModel (CTfork fnLen f dts) i
    = condModel (dts !! (applySplitter f i)) i

estCTree : Estimate a Classification Tree

It is convenient to have an inner, local ``search'' function:


estCTree  estLeafMdl splits  ipSet opSet =
 let

  search ipSet opSet =
   let
       . . .

Simplest possible tree is a single leaf node:


leaf    = CTleaf leafMdl

leafMdl = estLeafMdl opSet

leafMsg = msg (functionModel2model leaf)
              (zip ipSet opSet)

simple recursive search for the best (1- or) 2-level tree;

base case:

 
alternatives  []
 bestML bestCTree bestIpParts bestOpParts
 = (bestCTree, bestIpParts, bestOpParts)  -- done
 

and...

...general case:

alternatives (sp:sps)
 bestML bestCTree bestIpParts bestOpParts =
 let
  -- NB. the `1' below is acceptable but not optimal
  theTree  = CTfork (log (fromIntegral (length splts)))
                    sp leaves
  leaves   = map CTleaf leafMdls     -- one leaf ...
  leafMdls = map estLeafMdl opParts  -- ... per part
  partNums = map (applySplitter sp) ipSet
  ipParts  = partition (aritySplitter sp) partNums ipSet
  opParts  = partition (aritySplitter sp) partNums opSet
  msgLen   = msg (functionModel2model theTree)
                 (zip ipSet opSet)
 in
  if msgLen < bestML
  then alternatives sps msgLen theTree ipParts opParts
  else alternatives sps bestML bestCTree bestIpParts bestOpParts

let the search begin:


  splts = splits ipSet     -- partitions of input space

 in
  case  alternatives splts leafMsg leaf [ipSet] [opSet]  of

   -- subtrees?  search continues...
   ((CTfork msgLen pf leaves), ipParts, opParts) ->
     CTfork msgLen pf (zipWith search ipParts opParts);

   -- the single leaf wins?  Search is over!
   (t, _, _) -> t


in search ipSet opSet

-- That's all --

Summary

Can estimate (fit, infer, learn) a classification-tree given (i) ways of partitioning the data on input attributes and (ii) an estimator for leaf models.
 
NB. Any kind of leaf model can be used -- discrete, continuous, or multivariate;   even...
 
...FunctionModel- (Regression-) Trees
Use a conversion function estFunctionModel2estModel:
estFunctionModel2estModel estFn ipOpPairs =
  functionModel2model (uncurry estFn (unzip ipOpPairs))

ft = estCTree (estFunctionModel2estModel estFnMdl)
        splits trainingIp trainingOp

See: L. Allison. Models for Machine Learning and Data Mining in Functional Programming, J. Functional Programming (JFP), 15(1), pp.15-32, January 2005, and also [II(click)].