Second Project: Learning Ensembles

Due Date: Friday, June 2, 2000

In this project you will implement two ensemble learning methods (bagging and boosting), apply them to a decision tree learner, and compare the results experimentally.

What to do:

Implement:

The ID3 decision tree learner. You may treat numeric values by pre-discretizing them into equal-sized bins, and missing values by filling in with the most frequent value (or the average, for numeric attributes). For extra credit (an additional 2% of the class grade), implement reduced-error pruning.
The Bagging ensemble learning method, applied to ID3, using 25 replicates.
The AdaBoost ensemble learning method, applied to ID3, using a maximum of 25 rounds of boosting. (Important detail: If the weight of an example ever gets below one in a billion, make it zero. This will avoid lots of niggly problems.)

Choose 10 or so datasets from the package that will be/was emailed to the class list; choose whichever ones are most interesting to you (for some of them, more info is available in the UCI repository).
Compare the error rate of bagging, boosting and ID3 on these datasets. (The error rate of a learner is the percentage of test examples for which it predicts the wrong class.) To measure error rate, you may use 10-fold cross-validation (i.e., divide the examples randomly into 10 sets, and for i = 1 to 10 learn on all but set i and test on set i; average the results).
For extra credit (an additional 5% of the class grade, for a total of 30% for this project, or 32% if you implemented reduced-error pruning), implement and test your own ensemble learner. Your learner can be a refinement of bagging, a refinement of boosting, combine elements of both, or incorporate a new approach of your own design. Your grade in this part of the project will depend on how original and well-justified your ensemble learner is, and on how well it does in the experiments compared to bagging and boosting.

What to turn in:

The code you wrote: ID3, bagging, boosting, your own algorithm if you implemented one, and any other code you wrote to run experiments. The code should be clear and reasonably documented (comments, brief description of architecture / guide to the code, and brief user manual). Also include instructions for building the executables and any command line options the grading program should use.
A description of what parts of the project were done by each of the two group members.
A short report describing your experiments and results (and your learner, if you implemented one). This report should have a maximum of four letter-sized pages in 12pt font with 1" margins, including all tables and figures (six pages if you implemented your own learner).

We may ask you to do a demo / oral discussion of the project.
Acceptable languages for the project are: LISP, C/C++, and Java. Other languages may be allowed by special request.

Background reading:

(Not indispensable, but helpful.)

Leo Breiman, Bagging Predictors. In Machine Learning, vol. 24.

ftp://ftp.stat.berkeley.edu/pub/users/breiman/bagging.ps

Robert Schapire, Theoretical Views of Boosting and Applications. In Proc. 10th International Conference on Algorithmic Learning Theory, 1999.

http://www.research.att.com/~schapire/papers/Schapire99d.ps

Eric Bauer and Ron Kohavi, An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting and Variants. In Machine Learning, vol. 36.

http://robotics.stanford.edu/~ronnyk/vote.ps

Recent research on learning ensembles has appeared in the International Conference on Machine Learning, the National Conference on Artificial Intelligence (AAAI), the International Joint Conference on Artificial Intelligence, and others. The proceedings of these conferences are available in the library, and many of the papers can be found online, often from the authors' home pages. A list of home pages of machine learning researchers is maintained by David Aha.

Standard file formats to be used:

Your learners should accept files in C4.5 format. For a dataset named "foo", you will have three files: foo.data, foo.test, and foo.names. Foo.data contains the training examples and foo.test contains the test examples, in the following format: one example per line, attribute values separated by commas, class last, missing values represented by "?". For example:

2,5.0,4.0,?,none,37,?,?,5,no,11,below_average,yes,full,yes,full,good
3,2.0,2.5,?,?,35,none,?,?,?,10,average,?,?,yes,full,bad
3,4.5,4.5,5.0,none,40,?,?,?,no,11,average,?,half,?,?,good
3,3.0,2.0,2.5,tc,40,none,?,5,no,10,below_average,yes,half,yes,full,bad
...

where the class is "good" or "bad". The "foo.names" file contains the definitions of the attributes. The first line is a comma-separated list of the possible class values. Each successive line then defines an attribute, in the order in which they will appear in the .data and .test files. Each line is of the form "attribute_name: continuous", if the attribute is numeric, or "attribute_name: comma-separated list of values", if the attribute is symbolic. Every line ends with a full stop. For example:

good, bad.
dur: continuous.
wage1: continuous.
wage2: continuous.
wage3: continuous.
cola: tc, none, tcf.
hours: continuous.
pension: empl_contr, ret_allw, none.
stby_pay: continuous.
shift_diff: continuous.
educ_allw: yes, no.
holidays: continuous.
vacation: average, generous, below_average=2E
lngtrm_disabil: yes, no.
dntl_ins: half, none, full.
bereavement: yes, no.
empl_hplan: half, full, none.

For each run, your learners should output to standard output a line containing the error rate on the test set and the size of the model learned, separated by white space:

error-rate-on-test-set model-size

(This is for compatibility with the toolkit described below. Since the project doesn't require measuring model size, you can use a dummy value for it.)

Code provided:

To help with the experimentation phase, we are providing some infrastructure. Check out the University of Washington Data Mining Lab, which is being developed by Geoff Hulten. Email Geoff with questions, suggestions or bug reports.

Good luck!