CSE 546: Data Mining (Spring 2008)

Assignment 4: Model Ensembles, Learning Theory, and SVM

Due Date: Monday, June 2, 2008 in class and submit code online (the tenth week)

  1. Model ensemble.

    • 1.1. A decision stump is a decision tree of depth one (i.e., it branches on one attribute and makes decisions). Which algorithm has higher bias: a decision stump or a perceptron? And which one has higher variance? (Assume that all attributes are either Boolean or numeric.) Which ensemble method would you use to improve the accuracy of decision stumps - bagging or boosting? Justify your answers.

    • 1.2. Implement the discrete AdaBoosting algorithm. You should make your code as modular as possible. Namely, your main module of AdaBoosting should treat the base learner as a blackbox and communicate with it via a generic interface that inputs weighted examples and outputs a classifier, which then can classify any instances. Create a wrapper for ID3 which takes a training set of weighted examples and use sampling with replacement to generate a new training set of the same size, according to the weight of each example; the wrapper then passes on this set to ID3 to obtain a classifier. Plug this wrapper and your ID3 implementation into the AdaBoost algorithm. Run experiments using the dataset from Homework 1 and answer the following questions. Use information-gain as the evaluation criterion for all experiments. For boosting you can run either 10, 20, or 30 rounds. (The higher the better.)

      • Does boosting help ID3 more when there's pruning or when there isn't? Why? Measure bias and variance in each case. What does this say about your hypothesis? (Use 1% confidence level for pruning.)

      • From now on, do not use pruning. Modify your ID3 to accept a parameter of maximum depth, beyond which no further splitting is allowed. Consider the ID3 learner alone, speculate whether its bias increase or decrease with this parameter? What about its variance? Why? Verify your conclusion by actually measuring bias and variance for different numbers of maximum depth. At the minimum, try maximum depths of 1, 5, and 20.

      • How does the accuracy obtained by ID3 alone compare with that by ID3 with boosting, when the maximum depth varies? Why? Measure bias and variance in each case. What does this say about your hypothesis?

      • How does the number of boosting rounds affect your test accuracy? Try with the maximum depth being 1 and being 20. Measure bias and variance in each case. What does this say about what you observe?

      • To learn more about bias-variance decomposition, see Pedro's ICML paper. Pedro also has a reference implementation. In this homework, generate at least 10 bootstrap samples (30 is recommended if you could afford it) from the original training set and use them when you compute the bias and variance.

  2. Learning theory.

    • 2.1. Mitchell 7.1

    • 2.2. Mitchell 7.2

  3. Support vector machine.

    • Download LIBSVM, currently the most widely used SVM implementation. Peruse the documentations to understand how to use it.

    • Download the new spam dataset, which is the same as in Homework 3 except that it is now in LIBSVM format.

    • Run LIBSVM to classify spams. Try different kernels (0-3) and use default parameters for everything else. How does it vary with different choice of kernel? How does the accuracy compare to what can be obtained using Naive Bayes and perceptron? Why?

    • Vary the number of words used in the classifier (always choosing the highest-information-gain ones available). How does the relative accuracy of NB and SVM vary? How do you explain this? Here is the list of words in descending order of information gain. For your convenience, here is the index-to-word mapping I used when creating the dataset.

  4. Turn in the following:

    • Your code for Problem 1. Submit your files here. The due date is 12PM, June 2, 2008. Your code should contain appropriate comments to facilitate understanding.

    • A report with your answers to the homework problems. For the programming task, in addition to answering the questions, you should describe in high-level how your code works.

Good luck, and have fun!