^inductive programming^

Classification- (Decision-) Trees

Problem: Given a data-set from a data-space, 'Ip × Op', with input-space, 'Ip', and output-space, 'Op', learn a classification-tree function-model to model Op conditional upon Ip.

E.g. from Ip = <Appetite, Fever, Stomach_pain> conditionally model Op = Has_appendicitis.


Partitioning on input attributes (input space)

Discrete attribute:
k-way split where k is the `arity' of the attribute.
Ordered attribute:
Try median, left-quartile, right-quartile, octiles, ... .
NB. depends on current (part of) the data set.
Continuous attribute:
As for ordered attribute.
Interleave the splitting options of all component attributes.

Classification Trees:

data CTreeType inSpace opSpace =

  CTleaf (ModelType opSpace)   |

  CTfork MessageLength
         (Splitter inSpace)
         [CTreeType inSpace opSpace]

  | ...

  -- can also consider other non-traditional options

A `Splitter' is used to partition the input space.

instance SuperModel (CTreeType inSpace opSpace) where

 -- NB. For simplicity only, this costs the
 -- structure at 1-bit per node.  This is
 -- only optimal for binary trees.

 msg1 (CTleaf leafModel)
   = log2 + msg1 leafModel

 msg1 (CTfork fnLen f dts)
   = log2 + fnLen + (foldl (+) 0 (map msg1 dts))

Message length of a tree is that of the node plus those of the sub-trees, if any.

Make a Classification Tree an instance of a FunctionModel:

instance FunctionModel CTreeType where

  condModel (CTleaf leafModel) i = leafModel

  condModel (CTfork fnLen f dts) i
    = condModel (dts !! (applySplitter f i)) i

estCTree : Estimate a Classification Tree

It is convenient to have an inner, local ``search'' function:

estCTree  estLeafMdl splits  ipSet opSet =

  search ipSet opSet =
       . . .

Simplest possible tree is a single leaf node:

leaf    = CTleaf leafMdl

leafMdl = estLeafMdl opSet

leafMsg = msg (functionModel2model leaf)
              (zip ipSet opSet)

simple recursive search for the best (1- or) 2-level tree;

base case:

alternatives  []
 bestML bestCTree bestIpParts bestOpParts
 = (bestCTree, bestIpParts, bestOpParts)  -- done


...general case:

alternatives (sp:sps)
 bestML bestCTree bestIpParts bestOpParts =
  -- NB. the `1' below is acceptable but not optimal
  theTree  = CTfork (log (fromIntegral (length splts)))
                    sp leaves
  leaves   = map CTleaf leafMdls     -- one leaf ...
  leafMdls = map estLeafMdl opParts  -- ... per part
  partNums = map (applySplitter sp) ipSet
  ipParts  = partition (aritySplitter sp) partNums ipSet
  opParts  = partition (aritySplitter sp) partNums opSet
  msgLen   = msg (functionModel2model theTree)
                 (zip ipSet opSet)
  if msgLen < bestML
  then alternatives sps msgLen theTree ipParts opParts
  else alternatives sps bestML bestCTree bestIpParts bestOpParts

let the search begin:

  splts = splits ipSet     -- partitions of input space

  case  alternatives splts leafMsg leaf [ipSet] [opSet]  of

   -- subtrees?  search continues...
   ((CTfork msgLen pf leaves), ipParts, opParts) ->
     CTfork msgLen pf (zipWith search ipParts opParts);

   -- the single leaf wins?  Search is over!
   (t, _, _) -> t

in search ipSet opSet

-- That's all --


Can estimate (fit, infer, learn) a classification-tree given (i) ways of partitioning the data on input attributes and (ii) an estimator for leaf models.
NB. Any kind of leaf model can be used -- discrete, continuous, or multivariate;   even...
...FunctionModel- (Regression-) Trees
Use a conversion function estFunctionModel2estModel:
estFunctionModel2estModel estFn ipOpPairs =
  functionModel2model (uncurry estFn (unzip ipOpPairs))

ft = estCTree (estFunctionModel2estModel estFnMdl)
        splits trainingIp trainingOp

See: L. Allison. Models for Machine Learning and Data Mining in Functional Programming, J. Functional Programming (JFP), 15(1), pp.15-32, January 2005, and also [II(click)].