Second Project: Learning Ensembles

Due Date: Wednesday, December 6, 2000

In this project you will implement the boosting ensemble learning method, apply it to a decision tree learner, and compare the experimental results to the base decision tree learner.

What to do:

Implement:
The AdaBoost ensemble learning method using the C4.5 decision tree learner as the base learner, and a maximum of 25 rounds of boosting. (Important detail: If the weight of an example ever gets below one in a billion, make it zero. This will avoid lots of niggly problems.) C4.5 will be made available via the University of Washington Data Mining Lab (UWML) infrastructure, which provides “wrapper” code for running C4.5 and acquiring the trees it generates.
Choose 10 or so datasets from the UWML; choose whichever ones are most interesting to you (for some of them, more info is available in the UCI repository).
Compare the error rate of boosting with the un-augmented C4.5 learner on these datasets. (The error rate of a learner is the percentage of test examples for which it predicts the wrong class.) To measure error rate, you may use 10-fold cross-validation (i.e., divide the examples randomly into 10 sets, and for i = 1 to 10 learn on all but set i and test on set i; average the results).
For extra credit, implement and test your own ensemble learner. Your learner can be a refinement of boosting, an application of bagging to C4.5, or it can incorporate a new approach of your own design. Your grade in this part of the project will depend on how original and well-justified your ensemble learner is, and on how well it does in the experiments compared to boosting.

What to turn in:

The code you wrote: boosting applied to C4.5, your own algorithm if you implemented one, and any other code you wrote to run experiments. The code should be clear and reasonably documented (comments, brief description of architecture / guide to the code, and brief user manual). Also include instructions for building the executables and any command line options the grading program should use.
A description of which parts of the project were done by each of the two group members.
A short report describing your experiments and results (and your learner, if you implemented one). This report should have a maximum of four letter-sized pages in 12pt font with 1" margins, including all tables and figures (six pages if you implemented your own learner).

We may ask you to do a demo / oral discussion of the project.
Acceptable languages for the project are: LISP, C/C++, and Java. Other languages may be allowed by special request.

Background reading:

(Not indispensable, but helpful.) NOTE: The basic reference for implementing boosting is Dietterich’s paper.

Robert Schapire, Theoretical Views of Boosting and Applications. In Proc. 10th International Conference on Algorithmic Learning Theory, 1999.
[http://www.cs.washington.edu/education/courses/473/00sp/Schapire99d.ps]
Eric Bauer and Ron Kohavi, An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting and Variants. In Machine Learning, vol. 36.
[http://www.cs.washington.edu/education/courses/473/00sp/vote.ps]
Leo Breiman, Bagging Predictors. In Machine Learning, vol. 24.
[ftp://ftp.stat.berkeley.edu/pub/users/breiman/bagging.ps]

Recent research on learning ensembles has appeared in the International Conference on Machine Learning, the National Conference on Artificial Intelligence (AAAI), the International Joint Conference on Artificial Intelligence, and others. The proceedings of these conferences are available in the library, and many of the papers can be found online, often from the authors' home pages. A list of home pages of machine learning researchers is maintained by David Aha.

Standard file formats to be used:

Your learner(s) should accept files in C4.5 format. For a dataset named "foo", you will have three files: foo.data, foo.test, and foo.names. Foo.data contains the training examples and foo.test contains the test examples, in the following format: one example per line, attribute values separated by commas, class last, missing values represented by "?". For example:

2,5.0,4.0,?,none,37,?,?,5,no,11,below_average,yes,full,yes,full,good
3,2.0,2.5,?,?,35,none,?,?,?,10,average,?,?,yes,full,bad
3,4.5,4.5,5.0,none,40,?,?,?,no,11,average,?,half,?,?,good
3,3.0,2.0,2.5,tc,40,none,?,5,no,10,below_average,yes,half,yes,full,bad
...

where the class is "good" or "bad". The "foo.names" file contains the definitions of the attributes. The first line is a comma-separated list of the possible class values. Each successive line then defines an attribute, in the order in which they will appear in the .data and .test files. Each line is of the form "attribute_name: continuous", if the attribute is numeric, or "attribute_name: comma-separated list of values", if the attribute is symbolic. Every line ends with a full stop. For example:

good, bad.
dur: continuous.
wage1: continuous.
wage2: continuous.
wage3: continuous.
cola: tc, none, tcf.
hours: continuous.
pension: empl_contr, ret_allw, none.
stby_pay: continuous.
shift_diff: continuous.
educ_allw: yes, no.
holidays: continuous.
vacation: average, generous, below_average=2E
lngtrm_disabil: yes, no.
dntl_ins: half, none, full.
bereavement: yes, no.
empl_hplan: half, full, none.

For each run, your learners should output to standard output a line containing the error rate on the test set and the size of the model learned, separated by white space:

error-rate-on-test-set model-size

(This is for compatibility with the toolkit described below. Since the project doesn't require measuring model size, you can use a dummy value for it.)

Code provided:

The University of Washington Data Mining Lab is being developed by Geoff Hulten. Email Geoff with questions, suggestions or bug reports pertaining to UWML.

Good luck!