Second Project: Learning Ensembles
Due Date: Wednesday, December 8, 1999
In this project you will implement two ensemble learning methods
(bagging and boosting), apply them to a decision tree learner, study the
results experimentally, and design and test your own improved ensemble
learner.
What to do:
-
Form into groups of two.
-
Implement:
-
The ID3 decision tree learner with either reduced-error pruning or
chi-square pruning (see Exercise 5.6 of Dean). You may treat numeric
values by pre-discretizing them into equal-sized bins, and missing
values by filling in with the most frequent value (or the average, for
numeric attributes).
-
The Bagging ensemble learning method.
-
The AdaBoost ensemble learning method.
-
Download at least 10 (and ideally 20 or more) classification datasets
from the UCI
repository. In addition to using these datasets directly, you may
modify them (e.g., by adding noise or irrelevant attributes), and you
may generate synthetic datasets using a generator (or generators) of
your own design.
-
Study and compare the two ensemble methods empirically. The
fundamental dependent variable to consider is the methods' accuracy
(i.e., the percentage of test examples for which they make the correct
class prediction). To measure accuracy, you may use 10-fold
cross-validation (i.e., divide the examples randomly into 10 sets, and
for i = 1 to 10 learn on all but set i and test on set i; average the
results). Other relevant dependent variables are learning time and
comprehensibility of the models produced (e.g., roughly measured by
the total number of decision tree nodes produced).
- Based on your interpretation of the experimental results and your
own ideas, propose and implement a new ensemble learning method. The
goal is to produce a method that: a) outperforms both bagging and
boosting on the UCI datasets or some identifiable subset of them; and
b) is significantly different from any ensemble method previously
proposed in the literature. However, you may still obtain a passing
grade on the project if one of these two constraints is not
met. Specifically, you could implement a novel solution which doesn't
turn out to outperform bagging and boosting, or you could implement an
extension that is largely based on one in the literature. In the
first case, you should have a sensible rationale for your method, and
propose an empirically-supported explanation of why it didn't work.
In the second case, your experimental study should add to our
understanding of the algorithm. If your ensemble learner does
outperform bagging and boosting, you should provide empirical evidence
of it. Your method can be a refinement of bagging, a refinement of
boosting, combine elements of both, extend another ensemble approach
(e.g., stacking or error-correcting output codes) or incorporate a new
approach of your own design. It can outperform bagging and boosting by
being more accurate, by producing more comprehensible results while
being similarly accurate, etc. Developing a new ensemble method that
outperforms bagging and boosting will likely involve several
iterations of design, experimentation and interpretation of
results. Below you will find a reading list that will give you some
background on the current state of ensemble research and guidance on
how to conduct your experiments. You can also use this list as a
source of inspiration for ideas, and/or to check that your ideas are
new; and you can use it as a source of further pointers to the
literature for these purposes.
Turn in by Wednesday, December 8, 1999:
-
The code you wrote: ID3, bagging, boosting, your own algorithm, and any
other code you used to run experiments. The code should be clear and reasonably
documented (comments, brief description of architecture / guide to the
code, and brief user manual). Also include instructions for building the
executables and any command line options the grading program should use.
-
A description of what parts of the project were done by each of the two
group members.
-
An article describing your proposals, experiments and results. This article
should be written as a research article for submission to a conference.
It should have a maximum of 20 letter-sized pages in 12pt font, 1.5 line
spacing with 1" margins, including all tables and figures, but excluding
references. The project will be graded according to the AAAI
review form. Research articles typically have sections describing:
the problem they address and its motivation; the new solution(s) proposed,
and the rationale for them; empirical and/or theoretical evidence for their
superiority to previous approaches; a discussion of relations between the
new proposals and previous research, including a candid description of
the new approach's limitations and disadvantages; and directions for future
research. Citations made in the body of the paper are collected in a list
of references at the end. If the algorithm(s) you proposed didn't outperform
bagging and boosting, you can propose (and possibly test) your explanation(s)
in the empirical and/or discussion sections.
We may ask you to do a demo / oral discussion of the project.
Acceptable languages for the project are: LISP, C/C++, and Java.
Other languages may be allowed by special request.
Recommended reading:
-
Leo Breiman, Bagging Predictors. In Machine Learning, vol. 24.
[bagging.ps Original link:ftp://ftp.stat.berkeley.edu/pub/users/breiman/bagging.ps.Z
is broken]
-
Robert Schapire, Theoretical Views of Boosting and Applications.
In Proc. 10th International Conference on Algorithmic Learning Theory,
1999.
[http://www.research.att.com/~schapire/papers/Schapire99d.ps.gz]
-
Eric Bauer and Ron Kohavi, An Empirical Comparison of Voting Classification
Algorithms: Bagging, Boosting and Variants. In Machine Learning, vol. 36.
[http://robotics.stanford.edu/~ronnyk/vote.ps.gz]
-
Pedro Domingos, Why Does Bagging Work? A Bayesian Account and its Implications.
In Proc. KDD-97, Newport Beach, CA, 1997.
[http://www.cs.washington.edu/homes/pedrod/kdd97.ps.gz]
-
Dennis Kibler and Pat Langley, Machine Learning as an Experimental Science.
In Proc. 3rd European Working Session on Learning, 1988. [To be distributed.]
Recent research on learning ensembles has appeared in the
International Conference on Machine Learning, the National Conference
on Artificial Intelligence (AAAI), the International Joint Conference
on Artificial Intelligence, and others. The proceedings of these
conferences are available in the library, and many of the papers can
be found online, often from the authors' home pages. A list of home
pages of machine learning researchers is maintained by David Aha
[http://www.aic.nrl.navy.mil/~aha/].
Standard file formats to be used:
Your learners should accept files in C4.5 format. For a dataset named "foo",
you will have three files: foo.data, foo.test, and foo.names. Foo.data
contains the training examples and foo.test contains the test examples,
in the following format: one example per line, attribute values separated
by commas, class last, missing values represented by "?". For example:
2,5.0,4.0,?,none,37,?,?,5,no,11,below_average,yes,full,yes,full,good
3,2.0,2.5,?,?,35,none,?,?,?,10,average,?,?,yes,full,bad
3,4.5,4.5,5.0,none,40,?,?,?,no,11,average,?,half,?,?,good
3,3.0,2.0,2.5,tc,40,none,?,5,no,10,below_average,yes,half,yes,full,bad
...
where the class is "good" or "bad". Some UCI datasets may require minor
adjustments to fit this format. The "foo.names" file contains the definitions
of the attributes. The first line is a comma-separated list of the possible
class values. Each successive line then defines an attribute, in the order
in which they will appear in the .data and .names files. Each line is of
the form "attribute_name: continuous", if the attribute is numeric, or
"attribute_name: comma-separated list of values", if the attribute is symbolic.
Every line ends with a full stop. For example:
good, bad.
dur: continuous.
wage1: continuous.
wage2: continuous.
wage3: continuous.
cola: tc, none, tcf.
hours: continuous.
pension: empl_contr, ret_allw, none.
stby_pay: continuous.
shift_diff: continuous.
educ_allw: yes, no.
holidays: continuous.
vacation: average, generous, below_average.
lngtrm_disabil: yes, no.
dntl_ins: half, none, full.
bereavement: yes, no.
empl_hplan: half, full, none.
For a dataset named "foo", your learners should produce an output file
called "foo.out", containing a white-space-separated list of class predictions,
where the ith prediction corresponds to the ith example in the foo.test
file. For example:
good bad bad bad good good
bad good bad good good
Code provided:
To help with the experimentation phase, we are providing some infrastructure.
The files cross-validate.pl
and check-accuracy.pl
will apply a series of learners to a series of datasets, measuring the
accuracy of each learner on each dataset by 10-fold cross-validation and
return the average accuracy. (You only need check-accuracies.pl if
you are writing your learner in lisp and can't run them from the command
line.) As before, cross-validate.pl takes as its input a driver file
of the format
TARGETS:
id3
bag
adabost
INPUTS:
input1
input2
input3
Where input[1-3] are the basenames of the input files (i.e. input[1-3].data
and input[1-3].names.) It will create input files for 10-fold cross
validation, run your learners and print average accuracy results.
If you are programming in lisp, run it with the -norun option and it will
create input files, but not attempt to run the targets.
Good luck!