Second Project: Learning Ensembles

Due Date: Friday, June 2, 2000

In this project you will implement two ensemble learning methods (bagging and boosting), apply them to a decision tree learner, and compare the results experimentally.

What to do:

What to turn in:

We may ask you to do a demo / oral discussion of the project.
Acceptable languages for the project are: LISP, C/C++, and Java. Other languages may be allowed by special request.

Background reading:

(Not indispensable, but helpful.) Recent research on learning ensembles has appeared in the International Conference on Machine Learning, the National Conference on Artificial Intelligence (AAAI), the International Joint Conference on Artificial Intelligence, and others. The proceedings of these conferences are available in the library, and many of the papers can be found online, often from the authors' home pages. A list of home pages of machine learning researchers is maintained by David Aha.

Standard file formats to be used:

Your learners should accept files in C4.5 format. For a dataset named "foo", you will have three files: foo.data, foo.test, and foo.names. Foo.data contains the training examples and foo.test contains the test examples, in the following format: one example per line, attribute values separated by commas, class last, missing values represented by "?". For example:

2,5.0,4.0,?,none,37,?,?,5,no,11,below_average,yes,full,yes,full,good
3,2.0,2.5,?,?,35,none,?,?,?,10,average,?,?,yes,full,bad
3,4.5,4.5,5.0,none,40,?,?,?,no,11,average,?,half,?,?,good
3,3.0,2.0,2.5,tc,40,none,?,5,no,10,below_average,yes,half,yes,full,bad
...

where the class is "good" or "bad". The "foo.names" file contains the definitions of the attributes. The first line is a comma-separated list of the possible class values. Each successive line then defines an attribute, in the order in which they will appear in the .data and .test files. Each line is of the form "attribute_name: continuous", if the attribute is numeric, or "attribute_name: comma-separated list of values", if the attribute is symbolic. Every line ends with a full stop. For example:

good, bad.
dur: continuous.
wage1: continuous.
wage2: continuous.
wage3: continuous.
cola: tc, none, tcf.
hours: continuous.
pension: empl_contr, ret_allw, none.
stby_pay: continuous.
shift_diff: continuous.
educ_allw: yes, no.
holidays: continuous.
vacation: average, generous, below_average=2E
lngtrm_disabil: yes, no.
dntl_ins: half, none, full.
bereavement: yes, no.
empl_hplan: half, full, none.

For each run, your learners should output to standard output a line containing the error rate on the test set and the size of the model learned, separated by white space:

error-rate-on-test-set model-size

(This is for compatibility with the toolkit described below. Since the project doesn't require measuring model size, you can use a dummy value for it.)

Code provided:

To help with the experimentation phase, we are providing some infrastructure. Check out the University of Washington Data Mining Lab, which is being developed by Geoff Hulten. Email Geoff with questions, suggestions or bug reports.

Good luck!