ID3
This programming assignment will introduce you to the wonderful world of
machine learning. The assignment will be to use an algorithm that
learns a classifier from a number of labeled examples.
But first a little...
Background
Machine learning can be described as the study of algorithms that can change
their behaviour when they encounter new information. An important
part of machine learning is inductive learning, in which the program
is shown many examples and learns a general rule that allows it to predict
things about future examples it might see (such rules are called hypotheses).
Usually in inductive learning, the program is shown a number of training
examples and generalizes from those examples. One type of inductive
learning problem is classification. A classifier is a rule that
takes an example and puts it into one of several distinct categories. Suppose
we're trying to learn a function that tells us if a particular day is a
good day for playing tennis. The examples might have the following
form:
Temperature |
Precipitation |
Court-busy |
Play-tennis |
Low |
Clear |
No |
Yes |
Low |
Rain |
No |
No |
Med |
Clear |
Yes |
No |
High |
Clear |
No |
Yes |
The goal is to learn a function which, given a new example, say (Temp=Med,
Prec=Rain, Busy=Yes), can classify it correctly into one of the two classes
"Good tennis day" and "Bad tennis day". The specific problem we will be
looking at is supervised learning of a decision tree.
Supervised learning means that the training examples include the answers
(in this example, they include whether it's a good day to play tennis).
A decision tree is a particular form of classifier. It is a tree
(duh), where each internal node is a feature, with branches for each possible
value of the feature. The leaf nodes are classifications. A decision
tree that could be used to classify the above examples might look like
this:
ID3 algorithm
The ID3 algorithm builds decision trees recursively. It uses the
features to split the examples up into smaller and smaller groups until
all the examples in each group have the same classification. Then
the decision tree is the series of features it chose for the splits.
Continuing with the tennis example, it might first pick the precipitation
feature and split the examples up into rainy and clear days. The
rainy days are all bad for playing tennis, so it is done with that branch,
but it must recur on those examples which are clear, splitting again on
court-busy. Here's the algorithm for ID3:
ID3(examples, features)
BEGIN
IF no examples THEN something's wrong, fail
ELSE IF no features THEN something's wrong,
fail
ELSE IF all examples + THEN return new LEAF_NODE(+)
ELSE IF all examples - THEN return new LEAF_NODE(-)
ELSE
feature = choose some
feature to split on
group1..groupn
= examples grouped by their value for feature
subtree1..subtreen
= for group in ()
ID3(group, features-feature)
return new NODE(label
= feature, children = subtree1..subtreen)
ENDIF
END
Selecting features
The heart of ID3 is the way it chooses which feature to select. It
uses a notion called Information gain.
The idea is that you want to pick the feature that tells you
the most about what the final classification of each example will be.
The computation of information gain is dependent on another concept called
Entropy. The entropy of a set is a measure of how diverse
it is. If a set of examples all have the same classification, then
it's entropy is 0. If there are an equal number of examples from
every class, then the entropy is 1 (the maximum possible entropy).
One way to conceptualize this is that entropy is a measure of how much
I need to tell you about a set of examples for you to be able to correctly
classify them. If all the examples are the same, I don't have to
tell you anything, but if they are all very different, I have to tell
you a lot. The formula for entropy the entropy of a set of examples
S is:
Where pi is the probability of an example being in
class i (which can be computed as number of examples in class i
/ total number of examples). Once you know the entropy for a set of examples
(S), you can find the information gain for a given feature (F). To do this,
first split the examples up according to their values for the feature in
question, then compute the entropy for each of the subsets, weight it by
the size of the subset and subtract all of these values from the entropy
of whole example set. Or, in mathematical terms,
Where |S| is the size of the set S. ID3 selects the feature that minimizes
this computed information gain. Continuing with the tennis example above,
we show the computed values below for entropy and information gain when
the examples are split on temperature, precipitation and court-busy.
In this figure, a number followed by a + or a - indicates how many examples
in that set are good or bad tennis days (respectively). E is the entropy
computed from those classification counts and information gain is computed
as
"E at the top of the tree - sum of E's at the bottom of the tree weighted
by the number of examples in each branch"
(which is exactly the formula given above). Because half of the examples
can be completely classified by just looking at temperature, that feature
has the highest information gain value.
Well, with that somewhat lengthy description of the algorithm you will
be using, lets move on to
The assignment
1) Download the code that
implements the ID3 algorithm and the sample data
file. Load them into lisp and run ID3 by typing:
(setq *trace-id3* t) ;; If you want to see debugging information
(train tennis-examples)
This will print the decision tree and return a list representation of
it.
2) Come up with a data file of your own, include 20 or more examples
with several (5 or so) features each with 2 or more possible values as
well as a classification.
3) Run ID3 with your data
4) Make up a few more examples, and use the decision tree learned by
ID3 to classify them (work it out by hand or write code to do it if you
are ambitious), how well does it do?
Extra credit:
Describe how you might change ID3 to handle any of the following extensions
-
missing data (some examples don't have data for all features)
-
noisy data (some training examples have incorrect feature values or categories)
-
continuous data (features values can be real numbers instead of a set of
discrete values)
-
some other interesting extension of your own choosing