Second Project: Learning Ensembles
Due Date: Friday, June 2, 2000
In this project you will implement two ensemble learning methods
(bagging and boosting), apply them to a decision tree learner, and compare the
results experimentally.
What to do:
-
Implement:
-
The ID3 decision tree learner. You may treat numeric values by
pre-discretizing them into equal-sized bins, and missing values by filling in
with the most frequent value (or the average, for numeric attributes). For
extra credit (an additional 2% of the class grade), implement reduced-error
pruning.
-
The Bagging ensemble learning method, applied to ID3, using 25 replicates.
-
The AdaBoost ensemble learning method, applied to ID3, using a maximum of 25
rounds of boosting. (Important detail: If the weight of an example ever gets
below one in a billion, make it zero. This will avoid lots of niggly
problems.)
-
Choose 10 or so datasets from the package that will be/was emailed to the
class list; choose whichever ones are most interesting to you (for some of
them, more info is available in the
UCI
repository).
-
Compare the error rate of bagging, boosting and ID3 on these datasets.
(The error rate of a learner is the percentage of test examples for which
it predicts the wrong class.) To measure error rate, you may use 10-fold
cross-validation (i.e., divide the examples randomly into 10 sets, and
for i = 1 to 10 learn on all but set i and test on set i; average the
results).
- For extra credit (an additional 5% of the class grade, for a total of 30%
for this project, or 32% if you implemented reduced-error pruning), implement
and test your own ensemble learner. Your learner can be a refinement of
bagging, a refinement of boosting, combine elements of both, or incorporate a
new approach of your own design. Your grade in this part of the project will
depend on how original and well-justified your ensemble learner is, and on how
well it does in the experiments compared to bagging and boosting.
What to turn in:
-
The code you wrote: ID3, bagging, boosting, your own algorithm if you
implemented one, and any other code you wrote to run experiments. The code
should be clear and reasonably documented (comments, brief description of
architecture / guide to the code, and brief user manual). Also include
instructions for building the executables and any command line options the
grading program should use.
-
A description of what parts of the project were done by each of the two
group members.
-
A short report describing your experiments and results (and your learner, if
you implemented one). This report should have a maximum of four letter-sized
pages in 12pt font with 1" margins, including all tables and figures (six
pages if you implemented your own learner).
We may ask you to do a demo / oral discussion of the project.
Acceptable languages for the project are: LISP, C/C++, and Java. Other
languages may be allowed by special request.
Background reading:
(Not indispensable, but helpful.)
Recent research on learning ensembles has appeared in the
International Conference on Machine Learning, the National Conference
on Artificial Intelligence (AAAI), the International Joint Conference
on Artificial Intelligence, and others. The proceedings of these
conferences are available in the library, and many of the papers can
be found online, often from the authors' home pages. A list of home
pages of machine learning researchers is maintained by
David Aha.
Standard file formats to be used:
Your learners should accept files in C4.5 format. For a dataset named "foo",
you will have three files: foo.data, foo.test, and foo.names. Foo.data
contains the training examples and foo.test contains the test examples,
in the following format: one example per line, attribute values separated
by commas, class last, missing values represented by "?". For example:
2,5.0,4.0,?,none,37,?,?,5,no,11,below_average,yes,full,yes,full,good
3,2.0,2.5,?,?,35,none,?,?,?,10,average,?,?,yes,full,bad
3,4.5,4.5,5.0,none,40,?,?,?,no,11,average,?,half,?,?,good
3,3.0,2.0,2.5,tc,40,none,?,5,no,10,below_average,yes,half,yes,full,bad
...
where the class is "good" or "bad". The "foo.names" file contains the definitions
of the attributes. The first line is a comma-separated list of the possible
class values. Each successive line then defines an attribute, in the order
in which they will appear in the .data and .test files. Each line is of
the form "attribute_name: continuous", if the attribute is numeric, or
"attribute_name: comma-separated list of values", if the attribute is symbolic.
Every line ends with a full stop. For example:
good, bad.
dur: continuous.
wage1: continuous.
wage2:
continuous.
wage3: continuous.
cola: tc, none, tcf.
hours:
continuous.
pension: empl_contr, ret_allw, none.
stby_pay:
continuous.
shift_diff: continuous.
educ_allw: yes, no.
holidays: continuous.
vacation: average, generous, below_average=2E
lngtrm_disabil: yes, no.
dntl_ins: half, none, full.
bereavement: yes, no.
empl_hplan: half, full, none.
For each run, your learners should output to standard output a line
containing the error rate on the test set and the size of the model learned,
separated by white space:
error-rate-on-test-set model-size
(This is for compatibility with the toolkit described below. Since the
project doesn't require measuring model size, you can use a dummy value for
it.)
Code provided:
To help with the experimentation phase, we are providing some infrastructure.
Check out the University of Washington Data Mining Lab, which is being developed by
Geoff Hulten. Email Geoff
with questions, suggestions or bug reports.
Good luck!