In this project you will implement the boosting ensemble learning method, apply it to a decision tree learner, and compare the experimental results to the base decision tree learner.
We may ask you to do a demo / oral discussion of the
project.
Acceptable languages for the project are: LISP, C/C++, and Java. Other
languages may be allowed by special request.
(Not indispensable, but helpful.) NOTE: The basic reference for implementing boosting is Dietterich’s paper.
Recent research on learning ensembles has appeared in the International Conference on Machine Learning, the National Conference on Artificial Intelligence (AAAI), the International Joint Conference on Artificial Intelligence, and others. The proceedings of these conferences are available in the library, and many of the papers can be found online, often from the authors' home pages. A list of home pages of machine learning researchers is maintained by David Aha.
Your learner(s) should accept files in C4.5 format. For a dataset named "foo", you will have three files: foo.data, foo.test, and foo.names. Foo.data contains the training examples and foo.test contains the test examples, in the following format: one example per line, attribute values separated by commas, class last, missing values represented by "?". For example:
2,5.0,4.0,?,none,37,?,?,5,no,11,below_average,yes,full,yes,full,good
3,2.0,2.5,?,?,35,none,?,?,?,10,average,?,?,yes,full,bad
3,4.5,4.5,5.0,none,40,?,?,?,no,11,average,?,half,?,?,good
3,3.0,2.0,2.5,tc,40,none,?,5,no,10,below_average,yes,half,yes,full,bad
...
where the class is "good" or "bad". The "foo.names" file contains the definitions of the attributes. The first line is a comma-separated list of the possible class values. Each successive line then defines an attribute, in the order in which they will appear in the .data and .test files. Each line is of the form "attribute_name: continuous", if the attribute is numeric, or "attribute_name: comma-separated list of values", if the attribute is symbolic. Every line ends with a full stop. For example:
good, bad.
dur: continuous.
wage1: continuous.
wage2: continuous.
wage3: continuous.
cola: tc, none, tcf.
hours: continuous.
pension: empl_contr, ret_allw, none.
stby_pay: continuous.
shift_diff: continuous.
educ_allw: yes, no.
holidays: continuous.
vacation: average, generous, below_average=2E
lngtrm_disabil: yes, no.
dntl_ins: half, none, full.
bereavement: yes, no.
empl_hplan: half, full, none.
For each run, your learners should output to standard output a line containing the error rate on the test set and the size of the model learned, separated by white space:
error-rate-on-test-set model-size
(This is for compatibility with the toolkit described below. Since the project doesn't require measuring model size, you can use a dummy value for it.)
The University of Washington Data Mining Lab is being developed by Geoff Hulten. Email Geoff with questions, suggestions or bug reports pertaining to UWML.
Good luck!