Second Project: Learning Ensembles

Due Date: Wednesday, December 8, 1999

In this project you will implement two ensemble learning methods (bagging and boosting), apply them to a decision tree learner, study the results experimentally, and design and test your own improved ensemble learner.

What to do:

Form into groups of two.
Implement:

The ID3 decision tree learner with either reduced-error pruning or chi-square pruning (see Exercise 5.6 of Dean). You may treat numeric values by pre-discretizing them into equal-sized bins, and missing values by filling in with the most frequent value (or the average, for numeric attributes).
The Bagging ensemble learning method.
The AdaBoost ensemble learning method.

Download at least 10 (and ideally 20 or more) classification datasets from the UCI repository. In addition to using these datasets directly, you may modify them (e.g., by adding noise or irrelevant attributes), and you may generate synthetic datasets using a generator (or generators) of your own design.
Study and compare the two ensemble methods empirically. The fundamental dependent variable to consider is the methods' accuracy (i.e., the percentage of test examples for which they make the correct class prediction). To measure accuracy, you may use 10-fold cross-validation (i.e., divide the examples randomly into 10 sets, and for i = 1 to 10 learn on all but set i and test on set i; average the results). Other relevant dependent variables are learning time and comprehensibility of the models produced (e.g., roughly measured by the total number of decision tree nodes produced).
Based on your interpretation of the experimental results and your own ideas, propose and implement a new ensemble learning method. The goal is to produce a method that: a) outperforms both bagging and boosting on the UCI datasets or some identifiable subset of them; and b) is significantly different from any ensemble method previously proposed in the literature. However, you may still obtain a passing grade on the project if one of these two constraints is not met. Specifically, you could implement a novel solution which doesn't turn out to outperform bagging and boosting, or you could implement an extension that is largely based on one in the literature. In the first case, you should have a sensible rationale for your method, and propose an empirically-supported explanation of why it didn't work. In the second case, your experimental study should add to our understanding of the algorithm. If your ensemble learner does outperform bagging and boosting, you should provide empirical evidence of it. Your method can be a refinement of bagging, a refinement of boosting, combine elements of both, extend another ensemble approach (e.g., stacking or error-correcting output codes) or incorporate a new approach of your own design. It can outperform bagging and boosting by being more accurate, by producing more comprehensible results while being similarly accurate, etc. Developing a new ensemble method that outperforms bagging and boosting will likely involve several iterations of design, experimentation and interpretation of results. Below you will find a reading list that will give you some background on the current state of ensemble research and guidance on how to conduct your experiments. You can also use this list as a source of inspiration for ideas, and/or to check that your ideas are new; and you can use it as a source of further pointers to the literature for these purposes.

Turn in by Wednesday, December 8, 1999:

The code you wrote: ID3, bagging, boosting, your own algorithm, and any other code you used to run experiments. The code should be clear and reasonably documented (comments, brief description of architecture / guide to the code, and brief user manual). Also include instructions for building the executables and any command line options the grading program should use.
A description of what parts of the project were done by each of the two group members.
An article describing your proposals, experiments and results. This article should be written as a research article for submission to a conference. It should have a maximum of 20 letter-sized pages in 12pt font, 1.5 line spacing with 1" margins, including all tables and figures, but excluding references. The project will be graded according to the AAAI review form. Research articles typically have sections describing: the problem they address and its motivation; the new solution(s) proposed, and the rationale for them; empirical and/or theoretical evidence for their superiority to previous approaches; a discussion of relations between the new proposals and previous research, including a candid description of the new approach's limitations and disadvantages; and directions for future research. Citations made in the body of the paper are collected in a list of references at the end. If the algorithm(s) you proposed didn't outperform bagging and boosting, you can propose (and possibly test) your explanation(s) in the empirical and/or discussion sections.

We may ask you to do a demo / oral discussion of the project.
Acceptable languages for the project are: LISP, C/C++, and Java. Other languages may be allowed by special request.

Standard file formats to be used:

Your learners should accept files in C4.5 format. For a dataset named "foo", you will have three files: foo.data, foo.test, and foo.names. Foo.data contains the training examples and foo.test contains the test examples, in the following format: one example per line, attribute values separated by commas, class last, missing values represented by "?". For example:

2,5.0,4.0,?,none,37,?,?,5,no,11,below_average,yes,full,yes,full,good
3,2.0,2.5,?,?,35,none,?,?,?,10,average,?,?,yes,full,bad
3,4.5,4.5,5.0,none,40,?,?,?,no,11,average,?,half,?,?,good
3,3.0,2.0,2.5,tc,40,none,?,5,no,10,below_average,yes,half,yes,full,bad
...

where the class is "good" or "bad". Some UCI datasets may require minor adjustments to fit this format. The "foo.names" file contains the definitions of the attributes. The first line is a comma-separated list of the possible class values. Each successive line then defines an attribute, in the order in which they will appear in the .data and .names files. Each line is of the form "attribute_name: continuous", if the attribute is numeric, or "attribute_name: comma-separated list of values", if the attribute is symbolic. Every line ends with a full stop. For example:

good, bad.
dur: continuous.
wage1: continuous.
wage2: continuous.
wage3: continuous.
cola: tc, none, tcf.
hours: continuous.
pension: empl_contr, ret_allw, none.
stby_pay: continuous.
shift_diff: continuous.
educ_allw: yes, no.
holidays: continuous.
vacation: average, generous, below_average.
lngtrm_disabil: yes, no.
dntl_ins: half, none, full.
bereavement: yes, no.
empl_hplan: half, full, none.

For a dataset named "foo", your learners should produce an output file called "foo.out", containing a white-space-separated list of class predictions, where the ith prediction corresponds to the ith example in the foo.test file. For example:

good bad bad bad good good
bad good bad good good

Code provided:

To help with the experimentation phase, we are providing some infrastructure. The files cross-validate.pl and check-accuracy.pl will apply a series of learners to a series of datasets, measuring the accuracy of each learner on each dataset by 10-fold cross-validation and return the average accuracy. (You only need check-accuracies.pl if you are writing your learner in lisp and can't run them from the command line.) As before, cross-validate.pl takes as its input a driver file of the format

TARGETS:
id3
bag
adabost
INPUTS:
input1
input2
input3

Where input[1-3] are the basenames of the input files (i.e. input[1-3].data and input[1-3].names.) It will create input files for 10-fold cross validation, run your learners and print average accuracy results. If you are programming in lisp, run it with the -norun option and it will create input files, but not attempt to run the targets.