CSE 446: Machine Learning (Winter 2009)

Assignment 4: Model Ensembles, Learning Theory, and SVM

Due Date: Friday, March 13, 2009 in class and submit code online (the ninth week)

  1. Model ensemble.

    • 1.1. A decision stump is a decision tree of depth one (i.e., it branches on only one attribute and then makes decision). Compare bias and variance among the following learning algorithms: decision stump, Naive Bayes learning, and perceptron. For each of these algorithm, which ensemble method would you use to improve its accuracy? Explain.

    • 1.2. Implement the discrete AdaBoosting algorithm using ID3 decision stump as the base learner. You should make your code as modular as possible. Namely, your main module of AdaBoosting should treat the base learner as a blackbox and communicate with it via a generic interface that inputs weighted examples and outputs a classifier, which then can classify any instances. For the decision stump, you can modify your ID3 implementation in Homework 1 or implement it from scratch. To accommodate weights, create a wrapper which takes a training set of weighted examples and use sampling with replacement to generate a new training set of the same size, according to the weight of each example; the wrapper then passes on this set to the decision stump and obtains the classifier. Plug this wrapper and your decision stump implementation into the AdaBoost algorithm. Run the following experiments with the dataset from Homework 1 and answer the questions. Use information-gain as the evaluation criterion. Do not use pruning.

      • Compare and analyze the accuracies obtained by different learners: decision stump alone, boosting with 30 rounds, your ID3 implementation.

      • Compare and analyze the accuracies obtained by boosting with different numbers of rounds: 5, 10, 20, 30.

  2. Learning theory.

    • 2.1. (Mitchell 7.2) Consider the class C of concepts of the form (a<=x<=b) ^ (c<=y<=d), where a,b,c and d are integers in the interval (0,99). Note each concept in this class corresponds to a rectangle with integer-valued boundaries on a portion of the x,y plane.

      (a) Give an upper bound on the number of randomly drawn training examples sufficient to assure that for any target concept c in C, any consistent learner using H=C will, with probability 95%, output a hypothesis with error at most .15.

      (b) Now suppose the rectangle boundaries a,b,c, and d take on real values instead of integer values. Update your answer to the first part of this question.

    • 2.2. (Mitchell 7.5) Consider the space of instances X corresponding to all points in the x,y plane. Give the VC dimension of the following hypothesis spaces:

      (a) Hr=the set of all rectangles in the x,y plane. I.e., H={(a< x < b)^(c < y < d)|a,b,c,d are reals}

      (b) Hc=circles in the x,y plane. Points inside the circle are classified as positive examples.

      (c) Ht=triangles in the x,y plane. Points inside the triangle are classified as positive examples.

  3. Support vector machine.

    • LIBSVM is among the most popular, publicly available SVM implementations. In this exercise, you will get familiar with using the software for classification. Download the software and peruse the documentations to understand how to use it.

    • Write a script to convert the Reuter dataset into LIBSVM format. Make sure you use tools/checkdata.py to verify that the format is correct. See LIBSVM README for details.

    • Run LIBSVM to classify the documents. Try different kernels (0-3) and use default parameters for everything else. How does it vary with different choice of kernel? Compare the accuracy with that by k-NN and Naive Bayes? Analyze.

  4. Turn in the following:

    • Your code for Problem 1. Submit your files here. The due date is 11:30PM, March 13, 2009. Your code should contain appropriate comments to facilitate understanding.

    • A report with your answers to the homework problems. For the programming task, in addition to answering the questions, you should describe in high-level how your code works.

Good luck, and have fun!