ID3

This programming assignment will introduce you to the wonderful world of machine learning.  The assignment will be to use an algorithm that learns a classifier from a number of labeled examples.  But first a little...

Background

Machine learning can be described as the study of algorithms that can change their behaviour when they encounter new information.  An important part of machine learning is inductive learning, in which the program is shown many examples and learns a general rule that allows it to predict things about future examples it might see (such rules are called hypotheses).  Usually in inductive learning, the program is shown a number of training examples and generalizes from those examples. One type of inductive learning problem is classification. A classifier is a rule that takes an example and puts it into one of several distinct categories. Suppose we're trying to learn a function that tells us if a particular day is a good day for playing tennis.  The examples might have the following form:
 
Temperature Precipitation Court-busy Play-tennis
Low Clear No Yes
Low Rain No No
Med Clear Yes No
High Clear No Yes
The goal is to learn a function which, given a new example, say (Temp=Med, Prec=Rain, Busy=Yes), can classify it correctly into one of the two classes "Good tennis day" and "Bad tennis day". The specific problem we will be looking at is supervised learning of a decision tree.  Supervised learning means that the training examples include the answers (in this example, they include whether it's a good day to play tennis).  A decision tree is a particular form of classifier.  It is a tree (duh), where each internal node is a feature, with branches for each possible value of the feature.  The leaf nodes are classifications. A decision tree that could be used to classify the above examples might look like this:

 

ID3 algorithm

The ID3 algorithm builds decision trees recursively.  It uses the features to split the examples up into smaller and smaller groups until all the examples in each group have the same classification.  Then the decision tree is the series of features it chose for the splits.  Continuing with the tennis example, it might first pick the precipitation feature and split the examples up into rainy and clear days.  The rainy days are all bad for playing tennis, so it is done with that branch, but it must recur on those examples which are clear, splitting again on court-busy. Here's the algorithm for ID3:

ID3(examples, features)
BEGIN
    IF no examples THEN something's wrong, fail
    ELSE IF no features THEN something's wrong, fail
    ELSE IF all examples + THEN return new LEAF_NODE(+)
    ELSE IF all examples - THEN return new LEAF_NODE(-)
    ELSE
        feature = choose some feature to split on
        group1..groupn = examples grouped by their value for feature
        subtree1..subtreen = for group in ()
                                ID3(group, features-feature)
        return new NODE(label = feature, children = subtree1..subtreen)
    ENDIF
END
 

Selecting features

The heart of ID3 is the way it chooses which feature to select.  It uses a notion called Information gain.
 The idea is that you want to pick the feature that tells you the most about what the final classification of each example will be.  The computation of information gain is dependent on another concept called Entropy.  The entropy of a set is a measure of how diverse it is.  If a set of examples all have the same classification, then it's entropy is 0.  If there are an equal number of examples from every class, then the entropy is 1 (the maximum possible entropy).  One way to conceptualize this is that entropy is a measure of how much I need to tell you about a set of examples for you to be able to correctly classify them.  If all the examples are the same, I don't have to tell you anything, but if they are all very different, I have to tell  you a lot.  The formula for entropy the entropy of a set of examples S is:

Where pi is the probability of an example being in class i (which can be computed as number of examples in class i / total number of examples). Once you know the entropy for a set of examples (S), you can find the information gain for a given feature (F). To do this, first split the examples up according to their values for the feature in question, then compute the entropy for each of the subsets, weight it by the size of the subset and subtract all of these values from the entropy of whole example set. Or, in mathematical terms,

Where |S| is the size of the set S. ID3 selects the feature that minimizes this computed information gain. Continuing with the tennis example above, we show the computed values below for entropy and information gain when the examples are split on temperature, precipitation and court-busy.

 

In this figure, a number followed by a + or a - indicates how many examples in that set are good or bad tennis days (respectively). E is the entropy computed from those classification counts and information gain is computed as
"E at the top of the tree - sum of E's at the bottom of the tree weighted by the number of examples in each branch"
(which is exactly the formula given above). Because half of the examples can be completely classified by just looking at temperature, that feature has the highest information gain value.

Well, with that somewhat lengthy description of the algorithm you will be using, lets move on to

The assignment

1) Download the code that implements the ID3 algorithm and the sample data file. Load them into lisp and run ID3 by typing:

(setq *trace-id3* t) ;; If you want to see debugging information

(train tennis-examples)

This will print the decision tree and return a list representation of it.

2) Come up with a data file of your own, include 20 or more examples with several (5 or so) features each with 2 or more possible values as well as a classification.

3) Run ID3 with your data

4) Make up a few more examples, and use the decision tree learned by ID3 to classify them (work it out by hand or write code to do it if you are ambitious), how well does it do?

Extra credit:

Describe how you might change ID3 to handle any of the following extensions

  1. missing data (some examples don't have data for all features)
  2. noisy data (some training examples have incorrect feature values or categories)
  3. continuous data (features values can be real numbers instead of a set of discrete values)
  4. some other interesting extension of your own choosing