Due Date: Thursday, Mar 4, 2010 copy of report in class and submit code online (the 9th week)
|
- Model ensembles.
- 1.1. A decision stump is a decision tree of depth one
(i.e., it branches on one attribute and makes
decisions). Which algorithm has higher bias: a decision stump
or a perceptron? And which one has higher variance? (Assume
that all attributes are either Boolean or numeric.) Which
ensemble method would you use to improve the accuracy of
decision stumps - bagging or boosting? Justify your answers.
- 1.2. Implement the bagging and the discrete AdaBoosting
algorithm. You should make your code as modular as
possible. Namely, your main module of bagging and AdaBoosting
should treat the base learner as a blackbox and communicate
with it via a generic interface that inputs weighted examples
and outputs a classifier, which then can classify any
instances. Create a wrapper for ID3 which takes a training set
of weighted examples and use sampling with replacement to
generate a new training set of the same size, according to the
weight of each example; the wrapper then passes on this set to
ID3 to obtain a classifier. Plug this wrapper and your ID3
implementation into both the bagging and the AdaBoost
algorithm. Run experiments using the dataset from Homework 1
and answer the following questions. Use information gain as
the evaluation criterion for all experiments. For bagging you
can try 9, 17, or 33 number of samplings, and for boosting you
can run either 10, 20, or 30 rounds. (The higher the better.)
- Does boosting help ID3 more when there's pruning or when
there isn't? Why? Measure bias and variance in each case. What
does this say about your hypothesis? (Use 99% confidence level
for pruning.)
- From now on, do not use pruning. Modify your ID3 to accept
a parameter of maximum depth, beyond which no further
splitting is allowed. Consider the ID3 learner alone,
speculate whether its bias increase or decrease with this
parameter? What about its variance? Why? Verify your
conclusion by actually measuring bias and variance for
different numbers of maximum depth. At the minimum, try
maximum depths of 1, 5, and 20.
- How does the accuracy obtained by ID3 with bagging compare
with that by ID3 with boosting, when the maximum depth varies?
Why? Measure bias and variance in each case. What does this
say about your hypothesis?
- How does the number of boosting rounds affect your test
accuracy? Try with the maximum depth being 1 and being
20. Measure bias and variance in each case. What does this say
about what you observe?
- To learn more about bias-variance decomposition,
see Pedro's
ICML paper. Pedro also has
a reference
implementation. In this homework, generate at least 10
bootstrap samples (30 is recommended if you could afford it)
from the original training set and use them when you compute
the bias and variance.
- Learning theory.
- 2.1. Mitchell 7.3
- 2.2. Mitchell 7.4
- Support vector machines.
- Download LIBSVM,
currently the most widely used SVM implementation. Peruse the
documentations to understand how to use it.
- Download the new promoters
dataset, which is the same as in Homework 1 except that it
is now in LIBSVM format.
- Run LIBSVM to classify promoters. Try different kernels (0-3)
and use default parameters for everything else. How does it
vary with different choice of kernel?
- Modify your perceptron classifier from homework 3 to apply
to this dataset. How does the accuracy compare to what can be
obtained using LIBSVM? Why?
- Vary the number of features used in the classifier (always
choosing the highest-information-gain ones available). How
does the relative accuracy of decision tree and SVM vary? How do you
explain this?
-
Turn in the following:
- Your code for Problem 1. Submit your
files here. The
due date is 12 noon, March 4, 2010. Your code should contain
appropriate comments to facilitate understanding.
- A report with your answers to the homework problems. For the programming task, in addition to answering the questions, you should describe in high-level how your code works.
Good luck, and have fun!
|