Problem Set 4 - Ensembles
Due Mar 9 11:59pm. Revised late homework policy: 10% off if late by less than one day; 25% off if late between 1-2 days; 25% additional off each extra day.
Your goal of this homework is to implement and run experiments comparing two different ensemble methods: bagging and boosting, on two different base learners: stumps and ID3 decision tree.
Our data was provided by
UCI Machine Learning Repository and can be found in Molecular Biology (Promoter Gene Sequences) Data Set. It has 106 instances and 57 features. We randomly split the data set into the training (71 instances) and test (35 instances) sets, which are both well balanced.
You can download the training.txt and test.txt. Each DNA sequence is represented by one line, with the first 57 characters (one of 'a', 'g', 'c' and 't') representing the sequence, and the last character (separated by a space from the sequence) indicating the class ('+' for promoter, '-' for non-promoter).
We provide support code Ensemble.java ensemble.py for Part A. We have implemented those non interesting functions (i.e. reading datasets into memory, I/O, computing accuracy) for you. Your job is to follow the instructions below and finish your ensemble algorithms.
Note: Since there is random sampling in this problem set, when telling us your accuracy, we suggest you to run the experiment 10 times and to take the average.
Part A
Stump: a decision stump is a decision tree of depth one (i.e., it branches on one attribute and makes decisions). You can find more details in Wikipedia page.
- (1 point) Implement class Stump by finishing two functions. Learning is done in Stump.learn; classification is done in Stump.classify. You can use either misclassification impurity or information gain for evaluation criteria. (our suggestion: you may feel easier to debug your adaboost with misclassificaiton impurity, since it tends to have less bugs) Question 1 tell us the accuracy when you apply this stump on the test set.
Bagging: Given a standard training set D of size N, bagging generates M new training sets D_m, each of size L = N , by sampling examples from D uniformly and with replacement. By sampling with replacement, it is likely that some examples will be repeated in each D_m. This kind of sample is known as a bootstrap sample. The M models are trained using the above M bootstrap samples and combined by voting. You can find some details in Wikipedia page.
- (0.5 point) Uniform resampling with replacement: implement the function Ensemble.sampleByUniform. The return value of this function is a list of L elements sampled uniformly from input instances.
- (Extra credit 1 point) Question 2: Wikipedia tells us: if L=N, then for large N the set D_m is expected to have 63.2% of the unique examples of D, the rest being duplicates.
You need to empirically validate this conclusion. You need to sample 71 instances from the training set, and count how many unique examples there are. Try this process 100 times and tell us your average.
- (2 points) Bagging: implement the function Ensemble.baggingClassify, using Ensemble.sampleByUniform you have implemented in the last step. The return value of Ensemble.baggingClassify is a list of predictions for test instances. Question 3 In your writeup, tell us your accuracy on test instances by setting L=71, M=100
Adaboost: You have learned boosting in class. Let us first use Stump as the base learner. You should follow the pseudo code:
Note that you must store all K stumps because you need all of them to classify test instances.
- (1 point) Resampling by weight: finish Ensemble.sampleByWeight(S, pr,L), as we described in the pseudocode above. The idea is to generate a new training set S_r, which is resampled from S by weights. For example, let us assume the training set S has three examples (A,B,C), and the weights are (0.8,0.1, 0.1). If we draw 100 elements from S, we would expect about 80 of them are A, about 10 of them are B, and about 10 of them are C in S_r. Compared to "Uniform sampling with replacement", the only difference is that the distribution is defined in weights instead of uniform.
- (2 point) Adaboost: finish Ensemble.adaboostClassify by following the pseudo code. Question 4: In your writeup, tell us the accuracy by setting round number k=100, and always resampling L=71 instances each round. Do you achieve better accuracy then bagging? Why?
Part B
Now we consider using ID3 decision tree as the base learner for bagging and adaboost.
You could try to incorporate your ps1 ID3 implementation with your ensemble
implementations. If you feel your ID3 in ps1 is not perfect,
you are free to use any open source ID3 implementations (e.g. Weka library) you can find. Or you can use our ID3 implementation ID3.java.
- (0.5 point) Question 5: Alice thinks ID3 without pruning cannot be used as the base learner for adaboost because the error will be 0 in the first round. Tell us why she is wrong.
- (3 points) Question 6: tell us the accuracy of four strategies (Stump, ID3) * (Bagging,
Adaboost) on test set. For ID3, please use the full tree without pruning, and use information gain for criterion. For bagging and boosting, set L=71. For bagging, set M as 100. For adaboost, set round number as 100. We want you to carefully consider the differences
among these 4 strategies. In your report, you should tell us your order: Strategy 1 , Strategy 2 , Strategy 3 , Strategy 4. We expect to see your analysis: why would strategy 2 be more accurate than strategy 1? 3 more accurate than 2? 4 more accurate than 3? If two of your strategies are very close to each other. Also tells us what you think.
Bias and Variance (Extra Credit)
- (2 points) Question 7: Measure the bias and variance of four strategies: (bagging, adaboost)*(Stump, ID3). Compare the results and explain why. (e.g. why strategy A's bias is larger than B). You could refer Pedro's paper and implementations. paper and c code for more details. Please generate 30 bootstrap samples from the original training set and use them when you compute the bias and variance.
Code
We provide you support code to help you focus on the most interesting parts of the homework. We have answered all questions by finishing these codes and didn't find problems. But we are not sure how you will use these codes, so we cannot guarantee they are bug free under all circumstances. You can change anything you want (or just start from scratch). Remember to add a section "Support code changes" in your report if you do that.
Submission
writeup: You should first answer our questions. We recommend you to highlight question numbers (e.g. Q1, Q2 ...) then we will not miss your answers. We expect to see not only an accuracy number, but also your understandings. For example, if you just give us a number saying the accuracy of your Stump is 99% (which is wrong), we can hardly figure out whether the error comes from a simple bug, or you don't understand the algorithm. One thing you could do is to write down 1-2 most important lines of code in your report. It will be very helpful for you to get higher grades when your implementations are not perfect.You should also tell us how you have changed our support code if there is any. You can also say anything else important. At most 4 pages.
a directory named "partA" under your root directory: finish our the support code for partA by following our instructions. We expect to see training.txt, test.txt, ensemble.java/ensemble.py, run.sh/run.bat in this directory. We will execute
./run.sh
and expect to see accuracies of different algorithms printed in screen. For example, if you run our support code, you will see
Stump accuracy on training 0.0
Stump accuracy on test 0.0
Bagging accuracy on training 0.0
Bagging accuracy on test 0.0
Adaboost accuracy on training 0.0
Adaboost accuracy on test 0.0
You can use the same printing style. (You are not required to run your learner on the training set. But we suggest you to do that because it sometimes gives you hints to debug your algorithms.)
a directory named "partB" under root directory: anything you need for part B. At least you should include training.txt, test.txt, run.bat/run.sh in this directory. We will execute
./run.sh
in this directory and expect to see accuracies of different algorithms printed in screen.
If you have measured bias and variance of your algorithms, try to print these number is the screen when we run your code.