Second Project: Learning Ensembles

Due Date: Wednesday, December 8, 1999


In this project you will implement two ensemble learning methods (bagging and boosting), apply them to a decision tree learner, study the results experimentally, and design and test your own improved ensemble learner.
 

What to do:

Turn in by Wednesday, December 8, 1999:

We may ask you to do a demo / oral discussion of the project.
Acceptable languages for the project are: LISP, C/C++, and Java.  Other languages may be allowed by special request.

Recommended reading:

Recent research on learning ensembles has appeared in the International Conference on Machine Learning, the National Conference on Artificial Intelligence (AAAI), the International Joint Conference on Artificial Intelligence, and others. The proceedings of these conferences are available in the library, and many of the papers can be found online, often from the authors' home pages. A list of home pages of machine learning researchers is maintained by David Aha [http://www.aic.nrl.navy.mil/~aha/].
 

Standard file formats to be used:

Your learners should accept files in C4.5 format. For a dataset named "foo", you will have three files: foo.data, foo.test, and foo.names. Foo.data contains the training examples and foo.test contains the test examples, in the following format: one example per line, attribute values separated by commas, class last, missing values represented by "?". For example:

2,5.0,4.0,?,none,37,?,?,5,no,11,below_average,yes,full,yes,full,good
3,2.0,2.5,?,?,35,none,?,?,?,10,average,?,?,yes,full,bad
3,4.5,4.5,5.0,none,40,?,?,?,no,11,average,?,half,?,?,good
3,3.0,2.0,2.5,tc,40,none,?,5,no,10,below_average,yes,half,yes,full,bad
...

where the class is "good" or "bad". Some UCI datasets may require minor adjustments to fit this format. The "foo.names" file contains the definitions of the attributes. The first line is a comma-separated list of the possible class values. Each successive line then defines an attribute, in the order in which they will appear in the .data and .names files. Each line is of the form "attribute_name: continuous", if the attribute is numeric, or "attribute_name: comma-separated list of values", if the attribute is symbolic. Every line ends with a full stop. For example:

good, bad.
dur: continuous.
wage1: continuous.
wage2: continuous.
wage3: continuous.
cola: tc, none, tcf.
hours: continuous.
pension: empl_contr, ret_allw, none.
stby_pay: continuous.
shift_diff: continuous.
educ_allw: yes, no.
holidays: continuous.
vacation: average, generous, below_average.
lngtrm_disabil: yes, no.
dntl_ins: half, none, full.
bereavement: yes, no.
empl_hplan: half, full, none.

For a dataset named "foo", your learners should produce an output file called "foo.out", containing a white-space-separated list of class predictions, where the ith prediction corresponds to the ith example in the foo.test file. For example:

good bad bad bad good good
bad good bad good good
 

Code provided:

To help with the experimentation phase, we are providing some infrastructure. The files cross-validate.pl and check-accuracy.pl will apply a series of learners to a series of datasets, measuring the accuracy of each learner on each dataset by 10-fold cross-validation and return the average accuracy.  (You only need check-accuracies.pl if you are writing your learner in lisp and can't run them from the command line.)  As before, cross-validate.pl takes as its input a driver file of the format

TARGETS:
id3
bag
adabost
INPUTS:
input1
input2
input3

Where input[1-3] are the basenames of the input files (i.e. input[1-3].data and input[1-3].names.)  It will create input files for 10-fold cross validation, run your learners and print average accuracy results.  If you are programming in lisp, run it with the -norun option and it will create input files, but not attempt to run the targets.

Good luck!