Problem Set 4 - Ensembles

Due Mar 9 11:59pm. Revised late homework policy: 10% off if late by less than one day; 25% off if late between 1-2 days; 25% additional off each extra day.

Your goal of this homework is to implement and run experiments comparing two different ensemble methods: bagging and boosting, on two different base learners: stumps and ID3 decision tree. Our data was provided by UCI Machine Learning Repository and can be found in Molecular Biology (Promoter Gene Sequences) Data Set. It has 106 instances and 57 features. We randomly split the data set into the training (71 instances) and test (35 instances) sets, which are both well balanced.

You can download the training.txt and test.txt. Each DNA sequence is represented by one line, with the first 57 characters (one of 'a', 'g', 'c' and 't') representing the sequence, and the last character (separated by a space from the sequence) indicating the class ('+' for promoter, '-' for non-promoter).

We provide support code Ensemble.java ensemble.py for Part A. We have implemented those non interesting functions (i.e. reading datasets into memory, I/O, computing accuracy) for you. Your job is to follow the instructions below and finish your ensemble algorithms.

Note: Since there is random sampling in this problem set, when telling us your accuracy, we suggest you to run the experiment 10 times and to take the average.

Part A

  • Stump: a decision stump is a decision tree of depth one (i.e., it branches on one attribute and makes decisions). You can find more details in Wikipedia page.
  • Bagging: Given a standard training set D of size N, bagging generates M new training sets D_m, each of size L = N , by sampling examples from D uniformly and with replacement. By sampling with replacement, it is likely that some examples will be repeated in each D_m. This kind of sample is known as a bootstrap sample. The M models are trained using the above M bootstrap samples and combined by voting. You can find some details in Wikipedia page.

  • Adaboost: You have learned boosting in class. Let us first use Stump as the base learner. You should follow the pseudo code:

    pseudo code

    Note that you must store all K stumps because you need all of them to classify test instances.

    Part B

    Now we consider using ID3 decision tree as the base learner for bagging and adaboost. You could try to incorporate your ps1 ID3 implementation with your ensemble implementations. If you feel your ID3 in ps1 is not perfect, you are free to use any open source ID3 implementations (e.g. Weka library) you can find. Or you can use our ID3 implementation ID3.java.
  • Bias and Variance (Extra Credit)

    Code

    We provide you support code to help you focus on the most interesting parts of the homework. We have answered all questions by finishing these codes and didn't find problems. But we are not sure how you will use these codes, so we cannot guarantee they are bug free under all circumstances. You can change anything you want (or just start from scratch). Remember to add a section "Support code changes" in your report if you do that.

    Submission

  • writeup: You should first answer our questions. We recommend you to highlight question numbers (e.g. Q1, Q2 ...) then we will not miss your answers. We expect to see not only an accuracy number, but also your understandings. For example, if you just give us a number saying the accuracy of your Stump is 99% (which is wrong), we can hardly figure out whether the error comes from a simple bug, or you don't understand the algorithm. One thing you could do is to write down 1-2 most important lines of code in your report. It will be very helpful for you to get higher grades when your implementations are not perfect.You should also tell us how you have changed our support code if there is any. You can also say anything else important. At most 4 pages.
  • a directory named "partA" under your root directory: finish our the support code for partA by following our instructions. We expect to see training.txt, test.txt, ensemble.java/ensemble.py, run.sh/run.bat in this directory. We will execute
     ./run.sh
    and expect to see accuracies of different algorithms printed in screen. For example, if you run our support code, you will see
    Stump accuracy on training      0.0
    Stump accuracy on test    0.0
    Bagging accuracy on training    0.0
    Bagging accuracy on test  0.0
    Adaboost accuracy on training        0.0
    Adaboost accuracy on test      0.0
    
    You can use the same printing style. (You are not required to run your learner on the training set. But we suggest you to do that because it sometimes gives you hints to debug your algorithms.)
  • a directory named "partB" under root directory: anything you need for part B. At least you should include training.txt, test.txt, run.bat/run.sh in this directory. We will execute
     ./run.sh
    in this directory and expect to see accuracies of different algorithms printed in screen.
  • If you have measured bias and variance of your algorithms, try to print these number is the screen when we run your code.